001 Introduction Big Data
001 Introduction Big Data
CHAPTER 1:
Introduction to Big Data
Contents
I. Introduction........................................................................................................................................... 2
II. What is Big Data? ................................................................................................................................. 3
III. A brief history of Big Data ................................................................................................................ 3
IV. Characteristics of Big Data .............................................................................................................. 4
V. The different types of Big Data ............................................................................................................. 6
VI. From data to knowledge ................................................................................................................... 7
VII. Conclusion ...................................................................................................................................... 11
1|Page
Chapter I: introduction to Big Data
I. Introduction
Every day, we generate 2.5 trillion bytes of data; new applications and web pages are
created to help humans in their daily life. So much so that 90% of the data in the world has been
created in the last two years alone. Users more and more store data about different components
of systems (persons, products, courses, operations, geographic positions, etc.). This data comes
from everywhere: from sensors used to collect climate information, messages on social media
sites, digital images and videos posted online, transaction records of online purchases, and GPS
phone signals. mobiles… etc.
When social networks and the internet of things applications have appeared, data
becomes not just numbers that depict amounts, performance indicators, or scales. It also includes
unstructured forms, such as Website links, Emails, Twitter responses, product reviews,
pictures/images, written text on various platforms. The old techniques of data analysis become
incapable of processing this huge quantity of information.
To cope with the explosion in the volume of data, a new technological field has
emerged: BIG DATA. Invented by the web giants, these solutions are designed to offer real-time
access to huge databases. BIG DATA aims to offer an alternative to traditional database and
analysis solutions (SQL Server, Business Intelligence platform,…). Web giants like Yahoo,
Facebook, and Google were the first to deploy this type of technology.
BIG DATA encompasses a set of technologies and practices designed to store very large
amounts of data and analyze it quickly.
In this course, we will explain this term to show its meaning and the means on which it
is relied to analyze a huge quantity of data.
2|Page
Chapter I: introduction to Big Data
Several authors try to define the concept of “Big Data”. The Oxford English Dictionary
defined it as: “Data of a very large size, typically to the extent that its manipulation and
management present significant logistical challenges”.
Wikipedia defines it as “an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process using on-hand data management tools or traditional data
processing applications”.
Another definition is given by McKinsey, 2011 said that big data is “datasets whose size is
beyond the ability of typical database software tools to capture, store, manage, and analyze”.
Although the concept of Big Data is relatively new, large datasets date back to the
1960s and 1970s when the data world was just starting to take off with the first data centers and
relational database development.
In 2005, an amount of data is generated by the users of Facebook, YouTube, and other
online services. Hadoop (an open-source infrastructure created specifically to store and analyze
big data games) was developed that same year. NoSQL also began to be used more and more
around this time.
3|Page
Chapter I: introduction to Big Data
With the advent of the Internet of Things (IoT), more and more objects and devices are
connected to the Internet, collecting data on customer usage patterns and product performance.
The emergence of machine learning has produced even more data.
While big data has come a long way, its usefulness is only beginning to be felt. Cloud
computing has further increased its possibilities. The cloud offers tremendous scalability,
developers can simply quickly run dedicated clusters to test a subset of data. Graph databases are
also gaining in importance, as they can display massive amounts of data in a way that makes
analysis fast and comprehensive.
The major characteristics of BIG DATA are resumed in five letters “V”: Volume,
Velocity, Variety, Veracity, and Value. Figure 1 represents these five characteristics.
1. Volume:
Datasets large enough to require supercomputers, but in the years 1990-
2000 it became possible to use standard software to analyze or co-analyze large
datasets.
2. Velocity:
4|Page
Chapter I: introduction to Big Data
The volume of BIG DATA presents data centers with a real challenge: the
variety of data. BIG DATA is in the form of structured or unstructured data (text,
sensor data, sound, video, route data, log files, etc.). For example :
5. Value:
prevent fraud and save taxpayers' money and provide better services to citizens,
such as healthcare. Big data use cases are emerging across all industries.
The virtue and the ethical aspect of data must be taken into account. The information
must be processed and managed concerning privacy and data compliance regulations such as
GDPR in Europe.
Big Data Data comes from a variety of sources, and can therefore take many forms.
There are several main categories.
1. Structured data
6|Page
Chapter I: introduction to Big Data
2. Unstructured data
3. Semi-structured data
Before you can process and analyze unstructured or semi-structured data, it is necessary
to prepare and transform it using different types of data mining or data preparation tools.
To extract knowledge from Big Data, data passes through several steps. In each step,
some Vs of Big Data dimensions are involved. Figure 2 summarizes the main steps in the process
of knowledge extraction from Big Data. The process includes five main steps, as shown by
reference in Figure 2.
The selection of data sources is the first step. Each source should be
evaluated and classed based on the reliability of the information. At the end of this
7|Page
Chapter I: introduction to Big Data
phase, a classification of reliable sources is established. This step covers the five Vs
dimensions of Big Data, including veracity, that is, biases, noise, and anomalies
present in the data. The key questions that the selection phase raised with experts
are:
Statistical question
How to identify the criteria (variables) to be included in the source model and how
to extract these criteria from the sources? How to classify the sources?
Technical question
How to identify the paradigm of data modeling (eg relational, document, key-value,
graph) to store a considerable amount of data? How to automatically collect the
data? Do we need access to an API or do we need to develop a scraper/crawler?
How to program the automatic data collection processes?
Domain expert
How do select the right sources? Have we selected the right sources?
Step 2: Preprocessing
8|Page
Chapter I: introduction to Big Data
That means the condition in which the data are accepted or considered to be true,
real, and credible. The key questions that step 2 raised with experts are:
Statistical question
How to assess the consistency of the data? How do you measure the
accuracy of the data? How to estimate the importance of the data?
Technical question
How do I identify duplicates in data records? How to identify missing
values?
Domain expert
How do you identify synonyms that help improve data accuracy? How
to identify the criteria that characterize missing values and duplicates?
Step 3: Transformation
This step involves data reduction and projection, which aims to identify a
unified model to represent the data, depending on the purpose of the study. In
addition, it may include the use of dimensionality reduction or transformation
methods to reduce the effective number of variables or to find invariant
representations of the data. Like step 2, the transformation step reduces the
complexity of the data set by taking into account the dimension of variety. It is
typically performed using some techniques, which support the data preprocessing
and transformation phases. Globally, the data extracted from a source system
undergoes a series of transformation procedures that analyze it, manipulate it, and
then clean it up before loading it into a knowledge base. At the end of this step,
which results in a clean and well-defined data model, the big data variety problem
should be addressed. The key questions that the transformation phase has raised
with experts are:
Statistical question
How to measure the completeness of the identified target model? Does
the target model conserve the importance of the data at the end of the
process?
9|Page
Chapter I: introduction to Big Data
Technical question
How to develop Big Data procedures to transform raw data into a
target model in a scalable way?
Domain expert
How to identify the destination data format and taxonomy?
Step 5: Interpretation/Assessment
This final step uses visual paradigms to visually represent the obtained
knowledge, based on the user's goals. This means considering the user's ability to
understand the data and their primary purpose. For example, government agencies
might be interested in identifying the most popular occupations in their region;
companies could focus on tracking skills trends and identifying new skills for
10 | P a g e
Chapter I: introduction to Big Data
certain occupations so that they can design training paths for their employees. The
key questions that the interpretation/assessment phase raised with experts are:
VII. Conclusion
Many concepts are represented in this chapter. We presented Big Data as a new field
that interested in collecting, analyzing, assessing, and visualization of data. The process starts by
collecting data from any source (sensors, smart machines, social networks, ...) to analyze it and
extract some knowledge according to the companies’ needs.
In the next chapter, we will present some environments of Bog Data. We will present
the NoSQL database management system and its characteristics. Next, we will present some
platforms for Big Data.
11 | P a g e