Since its popularization in 2008, ‘big data’ stays a buzzword. And it gets only bigger, because data volumes continue to grow at a fast pace. According to an IDC research, information we create and copy is doubling in size every two years, and it will reach 44 zettabytes (or 44 trillion gigabytes) by 2020. Consequently, about 1.7 megabytes of new information will be generated every second for every person on the planet, IDC states.

In this guide, we review the concepts and approaches aimed to harness big data for bringing a practical value to a number of domains.


3Vs of big data

3Vs stand for Volume, Variety and Velocity.
Volume: Big data is characterized by its volume. While there’s no specific value to define big data, this term usually describes the data flows from 100GB and more, captured over time.

Variety: Data comes in all forms and shapes – patient health records, customer profiles, information from wearables and smart watches, social media posts, video clips, pictures and more.

Velocity: This V refers to the speed at which data flows are received, stored and managed. For example, Facebook users uploaded more than900 million photos a day in 2015, according to the Data Center Frontier. The company has to collect all these images, process and store them somewhere if a user would like to retrieve one.

Unstructured vs. structured data

Structured data is the information that can be organized by a predefined structure, e.g. put into a table with relations to other data bits within this table. For example, it can be a simple table on customers with their names, contact information, average purchase size and more.

Unstructured data is, basically, everything else. This information doesn’t have any identifiable structure, for example a Pinterest post, an email, a text message, a 911 phone call, a movie and more. Forbes also says that unstructured data prevails over the structured type, accounting for about 80% of all data in any organization. Accordingly, processing and organizing this information is a time- and energy-consuming task.

Data lakes
A data lake is a storage repository that contains information in its native format (structured, unstructured or semi-structured). It is an agile data analytics tool allowing users to configure and reconfigure its models, queries, and apps whenever necessary.

Data lakes are used to get complex business insights, as there’s no need to organize and structure information according to a particular schema prior to loading it to the storage. Only when some question arises, a BI consultant or data scientist queries the lake to retrieve a smaller data set to be analyzed for the answer.

Data quality

In my view, data quality means how reliable data is for making decisions. To enable fact-based decision-making in a company, the first thing to be decided on is whether this data bit is an actual fact or an irrelevant noise. Data quality acceptance criteria can be identified with the help of business intelligence consulting.

Data quality assurance (DQA) is the process of verifying data relevance and effectiveness. The aspects of data quality include:

  • Accuracy
  • Relevance
  • Reliability
  • Completeness
  • Consistency across data sources
  • Update status
  • Appropriate presentation
  • Accessibility

The way data is collected, processed, stored and managed highly affects data quality. Therefore, data should be recurrently updated, organized, standardized and de-duplicated. This allows for a single view of the information, even when the data is stored across multiple disparate systems such as CRM, ERP, PLM and more.

Business intelligence (BI)

BI is a set of tools, applications and methodologies for collection, processing and transformation of unstructured data into meaningful and useful information for ultimate business solution. Such information can be presented as single reports, comprehensive dashboards and visualizations.

BI helps ventures to unlock a range of advantages, such as:

  • Identifying emerging trends
  • Improving decision making
  • Highlighting performance problems
  • Driving new revenues and more

The effective business intelligence allows the company to get a holistic view of the current state of things and make fact-based decisions, whether it is about pricing, creating a new marketing campaign or estimating the growth potential.

Data mining

Data mining is a huge part of business intelligence. Sometimes it’s also called knowledge discovery, and this definition says it all. Data mining is a multi-perspective data analysis used to:

  • Elicit previously unknown and practically useful information
  • Identify patterns
  • Establish relationships and dependencies

Data mining processes can include:

  • Association –defining patterns where one event is connected to another
  • Sequence (path analysis)–identifying patterns where one event leads to another
  • Classification – defining new patterns
  • Clustering – recognizing and visualizing previously unknown fact groups
  • Forecasting – discovering patterns that can point out to future events (i.e. predictions)

Machine learning and deep learning

Machine learning works similar to data mining, but with the zest of artificial intelligence. A program is taught to identify common patterns and then make decisions based on its own experience, not human commands. However, developers control AI’s experience by providing examples and manually correcting its mistakes.

Deep learning is an advanced form of machine learning aimed to create more complex and independent self-learning programs. A system itself outlines the necessary functionality, performs multi-tier calculations and makes conclusions about the environment. Deep learning is applied to neural networks and used in image processing, speech recognition, and other breakthrough technologies. Google, Amazon, Microsoft, IBM and other major IT players are among the companies that already use machine and deep learning.


So, there you have it. We broke down a few basic definitions to both refresh some of the familiar concepts and wrap our heads around the recent ones. These terms can help a big data newcomer to get a head start with the topic, while a savvy analyst might want to suggest another terms to cover or redefine some of the concepts above. Either way, please feel free to leave your comment.