0% found this document useful (0 votes)
86 views

001 Introduction Big Data

The document discusses the introduction to Big Data including defining Big Data, a brief history of Big Data, and the five V's that characterize Big Data: Volume, Velocity, Variety, Veracity, and Value.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

001 Introduction Big Data

The document discusses the introduction to Big Data including defining Big Data, a brief history of Big Data, and the five V's that characterize Big Data: Volume, Velocity, Variety, Veracity, and Value.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Chapter I: introduction to Big Data

CHAPTER 1:
Introduction to Big Data
Contents
I. Introduction........................................................................................................................................... 2
II. What is Big Data? ................................................................................................................................. 3
III. A brief history of Big Data ................................................................................................................ 3
IV. Characteristics of Big Data .............................................................................................................. 4
V. The different types of Big Data ............................................................................................................. 6
VI. From data to knowledge ................................................................................................................... 7
VII. Conclusion ...................................................................................................................................... 11

1|Page
Chapter I: introduction to Big Data

I. Introduction

Every day, we generate 2.5 trillion bytes of data; new applications and web pages are
created to help humans in their daily life. So much so that 90% of the data in the world has been
created in the last two years alone. Users more and more store data about different components
of systems (persons, products, courses, operations, geographic positions, etc.). This data comes
from everywhere: from sensors used to collect climate information, messages on social media
sites, digital images and videos posted online, transaction records of online purchases, and GPS
phone signals. mobiles… etc.

In the beginning, information is structured on many supports (files or databases) in a


way that allows users to retrieve it in various forms according to their needs. Several applications
can send information to a common great database to create what is called a “data warehouse”. It
contents some information about some objects. For example, to analyze the impact of an
advertisement on the behavior of customers, we need the information to compute some indicators
related to the purchases of each customer before and after advertising.

When social networks and the internet of things applications have appeared, data
becomes not just numbers that depict amounts, performance indicators, or scales. It also includes
unstructured forms, such as Website links, Emails, Twitter responses, product reviews,
pictures/images, written text on various platforms. The old techniques of data analysis become
incapable of processing this huge quantity of information.

To cope with the explosion in the volume of data, a new technological field has
emerged: BIG DATA. Invented by the web giants, these solutions are designed to offer real-time
access to huge databases. BIG DATA aims to offer an alternative to traditional database and
analysis solutions (SQL Server, Business Intelligence platform,…). Web giants like Yahoo,
Facebook, and Google were the first to deploy this type of technology.

BIG DATA encompasses a set of technologies and practices designed to store very large
amounts of data and analyze it quickly.

In this course, we will explain this term to show its meaning and the means on which it
is relied to analyze a huge quantity of data.
2|Page
Chapter I: introduction to Big Data

II. What is Big Data?

Several authors try to define the concept of “Big Data”. The Oxford English Dictionary
defined it as: “Data of a very large size, typically to the extent that its manipulation and
management present significant logistical challenges”.

Wikipedia defines it as “an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process using on-hand data management tools or traditional data
processing applications”.

Another definition is given by McKinsey, 2011 said that big data is “datasets whose size is
beyond the ability of typical database software tools to capture, store, manage, and analyze”.

According to these definitions, we are in front of a huge quantity of data that is


voluminous, varied, velocity, and can be transformed into a business. The old tools and technics
(used with databases) are incapable to treat this quantity of data. As a consequence, many new
tools and technics are appeared to manage this data.

III. A brief history of Big Data

Although the concept of Big Data is relatively new, large datasets date back to the
1960s and 1970s when the data world was just starting to take off with the first data centers and
relational database development.

In 2005, an amount of data is generated by the users of Facebook, YouTube, and other
online services. Hadoop (an open-source infrastructure created specifically to store and analyze
big data games) was developed that same year. NoSQL also began to be used more and more
around this time.

The development of open-source infrastructures such as Hadoop (and, more recently,


Spark) makes it easier to use big data and reduce storage costs. Since then, the volume of Big
Data has exploded. Users continue to generate phenomenal amounts of data, but they are no
longer the only ones.

3|Page
Chapter I: introduction to Big Data

With the advent of the Internet of Things (IoT), more and more objects and devices are
connected to the Internet, collecting data on customer usage patterns and product performance.
The emergence of machine learning has produced even more data.

While big data has come a long way, its usefulness is only beginning to be felt. Cloud
computing has further increased its possibilities. The cloud offers tremendous scalability,
developers can simply quickly run dedicated clusters to test a subset of data. Graph databases are
also gaining in importance, as they can display massive amounts of data in a way that makes
analysis fast and comprehensive.

IV. Characteristics of Big Data

The major characteristics of BIG DATA are resumed in five letters “V”: Volume,
Velocity, Variety, Veracity, and Value. Figure 1 represents these five characteristics.

1. Volume:
Datasets large enough to require supercomputers, but in the years 1990-
2000 it became possible to use standard software to analyze or co-analyze large
datasets.

The volume of stored data is expanding rapidly: digital data created


around the world has grown from 1.2 zettabytes per year in 2010 to 1.8
zettabytes in 2011, then 2.8 zettabytes in 2012, and will amount to 40 zettabytes
in 2020. For example, in 2013, Twitter generated 7 terabytes of data every day
and Facebook generated 10 terabytes. Businesses are inundated with growing
volumes of data of all types, which can be counted in terabytes and even
petabytes.

2. Velocity:

Velocity represents the frequency with which data is generated, captured,


shared, and updated. Sometimes two minutes is too much. For time-sensitive
processes such as fraud detection, increasing data flows must be analyzed in near
real-time. For example :

4|Page
Chapter I: introduction to Big Data

 Analyze 5 million business events per day to identify potential


frauds.
 Analyze 500 million detailed daily call records in real-time.
3. Variety:

The volume of BIG DATA presents data centers with a real challenge: the
variety of data. BIG DATA is in the form of structured or unstructured data (text,
sensor data, sound, video, route data, log files, etc.). For example :

 Use hundreds of video feeds from surveillance cameras to monitor


points of interest;
 Take advantage of the 80% growth in the volume of image, video,
and documentary data to improve customer satisfaction.
4. Veracity:

Data veracity indicates the degree of accuracy or confidence in a set of


data. The quality of internet data cannot be manipulated at the source, but must
be assessed during data collection and storage, by means of analyzes,
procedures, and ad hoc tools. Bias, anomalies or inconsistencies, duplication and
volatility are some of the aspects that must be removed to improve big data
accuracy. As we can imagine, for a considered data source, as the variety is
height, the veracity will be height. Indeed, the use of natural language brings a
lot of noise containing no information in a text (for example prepositions, terms
unrelated to the considered subject, conjunctions and acronyms that must be
developed). All of these issues need to be properly addressed to enable
unstructured data to generate knowledge in the stages of the Knowledge
Extraction from Database (KOD) process.

5. Value:

The goal of big data analytics is to create a unique competitive advantage


for businesses, enabling them to better understand their customer's preferences,
segment customers in a more granular fashion, and target specific offerings to
specific segments. But public sector companies are also using Big Data to
5|Page
Chapter I: introduction to Big Data

prevent fraud and save taxpayers' money and provide better services to citizens,
such as healthcare. Big data use cases are emerging across all industries.

Recently, a 6th V is added to this precept: Virtue.

The virtue and the ethical aspect of data must be taken into account. The information
must be processed and managed concerning privacy and data compliance regulations such as
GDPR in Europe.

Figure 1: the five "Vs" of the Big Data.

V. The different types of Big Data

Big Data Data comes from a variety of sources, and can therefore take many forms.
There are several main categories.

1. Structured data

When data can be stored and processed in a fixed and well-defined


format, this is referred to as “structured” data. Because of the many advances
made in the field of computing, techniques today allow us to work effectively
with this data and derive its full value.

6|Page
Chapter I: introduction to Big Data

However, even structured data can be problematic due to its massive


volume. With the volume of data that is reaching several zettabytes, storage, and
processing present real challenges.

2. Unstructured data

Data with an unknown format or structure is considered “unstructured”


data. This type of data presents many challenges in terms of processing and
exploitation, because of its massive volume.

A typical example is a heterogeneous data source containing a


combination of text files, images, and video. In the digital and multimedia age,
this type of data is increasingly common. Companies, therefore, have vast
amounts of data in their databases; but struggle to take advantage of it because of
the difficulty in processing this unstructured information ...

3. Semi-structured data

Finally, “semi-structured” data is halfway between these two categories.


For example, this could be data that is structured in terms of format but is not
clearly defined within a database.

Before you can process and analyze unstructured or semi-structured data, it is necessary
to prepare and transform it using different types of data mining or data preparation tools.

VI. From data to knowledge

To extract knowledge from Big Data, data passes through several steps. In each step,
some Vs of Big Data dimensions are involved. Figure 2 summarizes the main steps in the process
of knowledge extraction from Big Data. The process includes five main steps, as shown by
reference in Figure 2.

Step 1: data selection

The selection of data sources is the first step. Each source should be
evaluated and classed based on the reliability of the information. At the end of this
7|Page
Chapter I: introduction to Big Data

phase, a classification of reliable sources is established. This step covers the five Vs
dimensions of Big Data, including veracity, that is, biases, noise, and anomalies
present in the data. The key questions that the selection phase raised with experts
are:

 Statistical question

How to identify the criteria (variables) to be included in the source model and how
to extract these criteria from the sources? How to classify the sources?

 Technical question

How to identify the paradigm of data modeling (eg relational, document, key-value,
graph) to store a considerable amount of data? How to automatically collect the
data? Do we need access to an API or do we need to develop a scraper/crawler?
How to program the automatic data collection processes?

 Domain expert

How do select the right sources? Have we selected the right sources?

Figure 2: Knowledge extract from Big Data.

Step 2: Preprocessing

This step involves cleaning up the data to remove noise or aberrations,


deciding how to treat missing data, and identifying a function to detect and remove
duplicate data. Data quality assessment and cleaning are essential tasks in any data-
based decision-making approach to make sure the credibility of the overall process.

8|Page
Chapter I: introduction to Big Data

That means the condition in which the data are accepted or considered to be true,
real, and credible. The key questions that step 2 raised with experts are:

 Statistical question
How to assess the consistency of the data? How do you measure the
accuracy of the data? How to estimate the importance of the data?
 Technical question
How do I identify duplicates in data records? How to identify missing
values?
 Domain expert
How do you identify synonyms that help improve data accuracy? How
to identify the criteria that characterize missing values and duplicates?

Step 3: Transformation

This step involves data reduction and projection, which aims to identify a
unified model to represent the data, depending on the purpose of the study. In
addition, it may include the use of dimensionality reduction or transformation
methods to reduce the effective number of variables or to find invariant
representations of the data. Like step 2, the transformation step reduces the
complexity of the data set by taking into account the dimension of variety. It is
typically performed using some techniques, which support the data preprocessing
and transformation phases. Globally, the data extracted from a source system
undergoes a series of transformation procedures that analyze it, manipulate it, and
then clean it up before loading it into a knowledge base. At the end of this step,
which results in a clean and well-defined data model, the big data variety problem
should be addressed. The key questions that the transformation phase has raised
with experts are:

 Statistical question
How to measure the completeness of the identified target model? Does
the target model conserve the importance of the data at the end of the
process?

9|Page
Chapter I: introduction to Big Data

 Technical question
How to develop Big Data procedures to transform raw data into a
target model in a scalable way?
 Domain expert
How to identify the destination data format and taxonomy?

Step 4: Data mining and machine learning

The objective of this step is to identify suitable AI algorithms (e.g. for


classification, prediction, regression, grouping, filtering information) by
researching interesting trends in a particular representative form, depending on the
purpose of the analysis. For example, to classify texts, this step generally requires
the use of algorithms dedicated to the classification of the text (based on ontology
or machine learning) to build a classification function allowing to identify data in
one of several predefined classes. This step is crucial because it is mainly devoted
to extracting knowledge from the data. The key questions that the data mining and
machine learning phase raised with experts are:

 Statistical and technical question


How to select the best algorithm? How to adjust the parameters of the
algorithms? How to assess the efficiency of algorithms? How to
implement them?
 Domain expert
Which knowledge should be selected and which should be discarded?
How important is the knowledge obtained for domain experts?

Step 5: Interpretation/Assessment

This final step uses visual paradigms to visually represent the obtained
knowledge, based on the user's goals. This means considering the user's ability to
understand the data and their primary purpose. For example, government agencies
might be interested in identifying the most popular occupations in their region;
companies could focus on tracking skills trends and identifying new skills for
10 | P a g e
Chapter I: introduction to Big Data

certain occupations so that they can design training paths for their employees. The
key questions that the interpretation/assessment phase raised with experts are:

 Statistical and technical question


How to select the visualization paradigm? How do we select an
appropriate visualization model for the knowledge we want to
visualize?
 Domain expert
How to provide appropriate knowledge according to companies’
needs? How to identify visual navigation paths for each company?
How to put knowledge at the service of companies?

VII. Conclusion

Many concepts are represented in this chapter. We presented Big Data as a new field
that interested in collecting, analyzing, assessing, and visualization of data. The process starts by
collecting data from any source (sensors, smart machines, social networks, ...) to analyze it and
extract some knowledge according to the companies’ needs.

In the next chapter, we will present some environments of Bog Data. We will present
the NoSQL database management system and its characteristics. Next, we will present some
platforms for Big Data.

11 | P a g e

You might also like