Module I Big Data
Module I Big Data
The quantities, characters or symbols on which operations are performed by a computer, which may
be stored and transmitted in the form of electric signals are recorded on magnetic, optical or
mechanical recording media.
Big Data
Big Data is also data but with huge size. Big data is a term used to describe a collection of data that is
huge in size and yet growing exponentially with time. Such data is so large and complex that none of
the traditional data management tools are able to store it or process it efficiently.
Big data platform is a type of IT solutions that combines the features and capabilities of several big
data applications and utilities within a single solution.
Big data platform generally consists of big data storage, servers, database, big data management ,
business intelligence and other big data management utilities.
The New York Stock Exchange is an example of Big Data that generates about one terabyte of new
trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand
flights per day, generation of data reaches up to many Petabytes.
Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide
• Structured
• Unstructured
• Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data.
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the size
being huge, un-structured data poses multiple challenges in terms of its processing for deriving value
out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example
of semi-structured data is a data represented in an XML file.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over the years
Volume
Variety
Velocity
Variability
Veracity
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very
crucial role in determining value out of data. It is the size of the data which determines it as abig data
or not. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data
solutions.
(ii) Variety- This means that the category to which big data belongs to. Variety refers to heterogeneous
sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets
and databases were the only sources of data considered by most of the applications. Nowadays, data
in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered
in the analysis applications. This variety of unstructured data posses certain issues for storage, mining
and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data or how fast the data is
generated and processed to meet the demands and determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
obstructing the process of being able to handle and manage the data effectively.
(v) Veracity: The quality of data being captured can vary greatly. Accuracy of analysis depends on the
veracity of the source data. Veracity relates to the truthfulness, believability and quality of data. Big
data can be messy. There is a lot of misinformation in them. The reasons for poor reliability of data
can range from technical error to human error, to malicious intent. Some of these are,
1. The source of information may not be authorisation. For eg. All websites are not equally
trustworthy. Wilipedia is useful, but not all equally reliable.
2. The data may not be communicated and received correctly because of technical failure. While
communicating, the machine may malfunction and may record and transmit incorrect data.
3. The data provided and received may also be intentionally wrong, for competitive or security
reasons. There could be malicious information spread on social media for stategic reasons.
Big data is the storage and analysis of large data sets. These are complex data sets that can be both
structured or unstructured. They are so large that it is not possible to work on them with traditional
analytical tools. One of the major challenges of conventional systems was the uncertainty of the Data
Management Landscape. Big data is continuously expanding, there are new companies and
technologies that are being developed every day. A big challenge for companies is to find out which
technology works bests for them without the introduction of new risks and problems.
Data representation
Storing
Analyzing
• DECISION MAKING
Big Data and natural language processing technologies are being used to read and evaluate
consumer responses
Access to social data from search engines and sites like facebook, twitter are enabling
organizations to fine tune their business strategies
Intelligent Data Analysis (IDA) is an interdisciplinary study that is concerned with the extraction of
useful knowledge from huge data, drawing techniques from a variety of fields, such as artificial
intelligence, high-performance computing, pattern recognition, and statistics. Data intelligence
platforms and data intelligence solutions are available from data intelligence companies such as Data
Visualization Intelligence, Strategic Data Intelligence, Global Data Intelligence.
Intelligent data analysis refers to the use of analysis, classification, conversion, extraction
organization, and reasoning methods to extract useful knowledge from data. This data analytics
intelligence process generally consists of the data preparation stage, the data mining stage, and the
result validation and explanation stage.
Data preparation involves the integration of required data into a dataset that will be used for data
mining; data mining involves examining large databases in order to generate new information; result
validation involves the verification of patterns produced by data mining algorithms; and result
explanation involves the intuitive communication of results.
The increased use of technology in the past few years has also led to an increase in the amounts of
data being generated per minute. Everything we do online generates some sort of data.
A report series, Data Never Sleeps, by DOMO, covers the amount of data being generated every
minute. In the eighth edition of the report, it shows that a solitary internet minute has over 400,000
hours of video streaming on Netflix, 500 hours of video streamed by users on Youtube, and almost 42
million messages shared through WhatsApp.
The number of internet users has reached 4.5 billion, nearly 63% of the total world population. The
number is expected to increase in the coming years as we witness an expansion of technologies.
These huge amounts of structured, semi-structured, unstructured data are referred to as big data.
Businesses analyze and make use of these data to gain better knowledge about their customers.
Big Data Analytics is a process that enables data scientists to make something out of the stack of big
data generated. This analysis of big data is done using some tools that we call as big data analytics
tools.
R-Programming
R-Programming is a domain-specific programming language specifically designed for statistical
analysis, scientific computing, and data visualization using R Programming. Ross Ihaka and Robert
Gentleman developed it in 1993.
It is among the top big data analytics tools because R-Programming software helps data scientists to
create statistics engines that can provide better and precise insights due to relevant and accurate data
collection.
Features
• scikit-learn has a large amount of algorithms that can handle medium sized datasets.
SPSS
• A product of IBM for statistical analysis.
• Mostly used to analyze survey data.
• It offers predictive models and delivers to individuals, groups, systems and the enterprise
Apache Hadoop
Apache Hadoop is an open-source software framework for storing data and running applications on
clusters of commodity hardware.
Hadoop
Spark
Microsoft HDInsight