Unit 1 Notes
Unit 1 Notes
What is Data?
The quantities, characters, or symbols on which operations are performed by a computer, which
may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical,
or mechanical recording media.
Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data
that is huge in volume and yet growing exponentially with time. In short such data is so large and
complex that none of the traditional data management tools are able to store it or process it
efficiently.
“Extremely large data sets that may be analyzed computationally to reveal patterns ,
trends and association, especially relating to human behavior and interaction are known as
Big Data.”
The Stock Exchange generates about one terabyte of new trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads,
message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.
Byte 8 bits 1
Zettabyte 1, 024 exabytes 1, 180, 591, 620, 717, 411, 303, 424
Yottabyte 1, 024 zettabytes 1, 208, 925, 819, 614, 629, 174, 706, 176
1. Structured
2. Unstructured
3. Semi-structured
Structured
● Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data.
● Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it.
● However, nowadays, we are foreseeing issues when the size of such data grows to a huge
extent, typical sizes being in the range of multiple zettabytes.
● Data stored in a relational database management system is one example of a 'structured'
data
Unstructured
● Any data with unknown form or the structure is classified as unstructured data.
● In addition to the size being huge, un-structured data poses multiple challenges in terms
of its processing for deriving value out of it.
● A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc.
● Now day organizations have wealth of data available with them but unfortunately, they
don't know how to derive value out of it since this data is in its raw form or unstructured
format
Volume means “How much Data is generated”. Now-a-days, Organizations or Human Beings or
Systems are generating or getting a very vast amount of Data say TB (TeraBytes) to PB
(PetaBytes) to ExaByte(EB) and more.
Size of data plays a very crucial role in determining value out of data. Also, whether a particular
data can actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data
solutions.
Volume= Very Large amount of Data
Velocity:
Velocity means “How fast produce Data”. Now-a-days, Organizations or Human Beings or
Systems are generating huge amounts of Data at a very fast rate.
Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The
flow of data is massive and continuous.
Velocity= Produce data at very fast rate
Variety:
Variety means “Different forms of Data”. Now-a-days, Organizations or Human Beings or
Systems are generating a very huge amount of data at a very fast rate in different formats.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by
most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring
devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of
unstructured data poses certain issues for storage, mining and analyzing data .
Variety= Produce data in different formats
Veracity
Veracity means “The Quality or Correctness or Accuracy of Captured Data”. Out of 4Vs, it is the
most important V for any Big Data Solutions. Because without Correct Information or Data,
there is no use of storing large amounts of data at fast rate and different formats. That data
should give correct business value.
Veracity= The correctness of Data
History of big data
The first trace of big data is seen way back in 1663 when John Graunt dealt with overwhelming
amounts of information while he studied the bubonic plague, which was haunting Europe at the
time. Graunt was the first-ever person to use statistical data analysis.
Later, in the early 1800s, the field of statistics expanded to include collecting and analyzing data.
The world first saw the problem with the overwhelming amount of data in 1880.
The US Census Bureau announced that they estimated it would take eight years to handle and
process the data collected during the census program that year.
In 1881, a man from the Bureau named Herman Hollerith invented the Hollerith Tabulating
Machine that reduced the calculation work.
Throughout the 20th century, data evolved at an unexpected speed. Big data became the core of
evolution. Machines for storing information magnetically and scanning patterns in messages, and
computers were also created at that time.
In 1965, the US government built the first data centre, with the intention of storing millions of
fingerprint sets and tax returns.
Big data platform is a type of IT solution that combines the features and capabilities of several
big data applications and utilities within a single solution.
It is an enterprise class IT platform that enables organizations in developing, deploying,
operating and managing a big data infrastructure /environment.
Big data platforms generally consist of big data storage, servers, database, big data management,
business intelligence and other big data management utilities. It also supports custom
development, querying and integration with other systems. The primary benefit behind a big data
platform is to reduce the complexity of multiple vendors/ solutions into a one cohesive solution.
Big data platforms are also delivered through the cloud where the provider provides an all
inclusive big data solutions and services.
● Big Data platforms should be able to accommodate new platforms and tools based on the
business requirement. Because business needs can change due to new technologies or due
to change in business processes.
● It should support linear scale-out
● It should have capability for rapid deployment
● It should support variety of data format
● Platform should provide data analysis and reporting tools
● It should provide real-time data analysis software
● It should have tools for searching the data through large data sets
There are four main Big Data architecture layers to an architecture of Big Data:
1. Data Ingestion
This layer is responsible for collecting and storing data from various sources. In Big Data, the
data ingestion process of extracting data from various sources and loading it into a data
repository. Data ingestion is a key component of a Bi
Bigg Data architecture because it determines
how data will be ingested, transformed, and stored.
2. Data Processing
Data processing is the second layer, responsible for collecting, cleaning, and preparing the data
for analysis. This layer is critical for ens
ensuring
uring that the data is high quality and ready to be used in
the future.
3. Data Storage
Data storage is the third layer, responsible for storing the data in a format that can be easily
accessed and analyzed. This layer is essential for ensuring that the data is accessible and
available to the other layers.
4. Data Visualization
Data visualization is the fourth layer and is responsible for creating visualizations of the data that
humans can easily understand. This layer is important for making the data accessible.
Most big data architectures include some or all of the following components:
● Data sources: All big data solutions start with one or more data sources. Examples include:
○ Application data stores, such as relational databases.
○ Static files produced by applications, such as web server log files.
○ Real-time data sources, such as IoT devices.
● Data storage: Data for batch processing operations is typically stored in a distributed file
store that can hold high volumes of large files in various formats. This kind of store is often
called a data lake. Options for implementing this storage include Azure Data Lake Store or
blob containers in Azure Storage.
● Batch processing: Because the data sets are so large, often a big data solution must process
data files using long-running batch jobs to filter, aggregate, and otherwise prepare the data
for analysis. Usually these jobs involve reading source files, processing them, and writing
the output to new files. Options include running U-SQL jobs in Azure Data Lake Analytics,
using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java,
Scala, or Python programs in an HDInsight Spark cluster.
● Real-time message ingestion: If the solution includes real-time sources, the architecture
must include a way to capture and store real-time messages for stream processing. This
might be a simple data store, where incoming messages are dropped into a folder for
processing. However, many solutions need a message ingestion store to act as a buffer for
messages, and to support scale-out processing, reliable delivery, and other message queuing
semantics. Options include Azure Event Hubs, Azure IoT Hubs, and Kafka.
● Stream processing: After capturing real-time messages, the solution must process them by
filtering, aggregating, and otherwise preparing the data for analysis. The processed stream
data is then written to an output sink.
● Analytical data store:
● Many big data solutions prepare data for analysis and then serve the processed data in
a structured format that can be queried using analytical tools.
● The data could be presented through a low-latency NoSQL technology such as HBase,
or an interactive Hive database that provides a metadata abstraction over data files in
the distributed data store.
● Orchestration:
● Most big data solutions consist of repeated data processing operations, encapsulated in
workflows, that transform source data, move data between multiple sources and sinks,
load the processed data into an analytical data store, or push the results straight to a
report or dashboard.
● To automate these workflows, we can use an orchestration technology such as Azure
Data Factory or Apache Oozie and Sqoop.
Big Data Technology Component:
1. Cost Saving:
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to businesses
when they have to store large amounts of data. These tools help organizations in
identifying more effective ways of doing business.
2. Time Saving:
Tools like Hadoop help them to analyze data immediately thus helping in making quick
decisions based on the learnings.
Big Data analytics has expanded its roots in all the fields. This results in the use of Big Data in a
wide range of industries including Finance and Banking, Healthcare, Education, Government,
Retail, Manufacturing, and many more.
There are many companies like Amazon, Netflix, Spotify, LinkedIn, Swiggy,etc which use big
data analytics. Banking sectors make the maximum use of Big Data Analytics. Education sector
is also using data analytics to enhance students’ performance as well as making teaching easier
for instructors.
Big Data analytics help retailers from traditional to e-commerce to understand customer behavior
and recommend products as per customer interest. This helps them in developing new and
improved products which help the firm enormously.
The term Big Data is referred to as a large amount of complex and unprocessed data.
Healthcare
Big data has started making a massive difference in the healthcare sector, with the help of
predictive analytics, medical professionals, and health care personnel. It can produce
personalized healthcare and solo patients also.
The government and military also used technology at high rates. We see the figures that the
government makes on the record. In the military, a fighter plane is required to process
petabytes of data.
Government agencies use Big Data and run many agencies, managing utilities, dealing with
traffic jams, and the effect of crime like hacking and online fraud.
E-commerce
E-commerce is also an application of Big data. It maintains relationships with customers that are
essential for the e-commerce industry. E-commerce websites have many marketing ideas to retail
merchandise customers, manage transactions, and implement better strategies of innovative ideas
to improve businesses with Big data.
Big Data Security
Big Data Security is the collective term for all the measures and tools used to guard both the data
and analytics methods against attacks, theft, or other malicious activities that could cause a
problem or negatively affect them. Like other forms of attacks, it can be compromised either by
attacks originating from online or offline spheres.
As compared to other areas to have securities issues and attacks happening every single minute,
these attacks can be on different components of it, like on stored data or the data source.
Why is it important?
Today almost every organization is thinking of adopting Big Data as they see the potential and
utilizing its power of it; they are using Hadoop to process these large data sets, and securing your
data is the most important step they are concerned about; independent of organization sizes,
everyone is trying to secure their data.
As it saves a different kinds of data from various sources, so we need to make security essential
as almost every enterprise that is using it has some form of sensitive data which needs to be
protected. Sensitive data can be the user’s credit card details, banking details, and
passwords. It is not a small thing, and we can’t describe it in the context of size, as size is
one of its main features of it. To secure it, someone can construct various strategies like
keeping out unauthorized users and intrusions with firewalls, making user authentication
reliable, giving training to end-user training, and many others.
The basic Architecture to secure any platform contains different stages as follows:
Data ethics encompasses the moral obligations of gathering, protecting, and using personally
identifiable information and how it affects individuals.
1. Private customer data and identity should remain private: Privacy does not mean
secrecy, as personal data might need to be audited based on legal requirements, but that
private data obtained from a person with their consent should not be exposed for use by
share sensitive data — medical, financial or locational — and need restrictions on whether
3. Customers should have a transparent view of how our data is being used or sold and the
ability to manage the flow of their private information across massive, third-party analytical
systems.
4. Big Data should not interfere with human will: Big data analytics can moderate and even
determine who we are before we make up our minds. Companies need to consider the kind
of predictions and inferences that should be allowed and those that should not.
5. Big data should not institutionalize unfair biases like racism or sexism. Machine learning
algorithms can absorb unconscious biases in a population and amplify them via training
samples.
BIG DATA ANALYTICS
● Big Data analytics is a process used to extract meaningful insights, such as hidden
patterns, unknown correlations, market trends, and customer preferences. Data analytics
technologies and techniques give organizations a way to analyze data sets and gather new
information
● Private Companies and research institutions capture terabytes of data about their user’s
interactions, business, social media, and also sensors from devices such as mobile phones
and automobiles.
● Big data analytics involves collecting data from different sources, managing it in a way
that it becomes available to be consumed by analysts and finally delivering data products
useful to the organization business.
● The process of converting large amounts of unstructured raw data, retrieved from
different sources to a data product useful for organizations forms the core of Big Data
analytics.
1. Data professionals collect data from a variety of different sources. Often, it is a mix of
semistructured and unstructured data. While each organization will use different data
streams, some common sources include:
● cloud applications;
● mobile applications;
2. Data is prepared and processed. After data is collected and stored in a data warehouse or
data lake, data professionals must organize, configure and partition the data properly for
analytical queries. Thorough data preparation and processing makes for higher
performance from analytical queries.
3. Data is cleansed to improve its quality. Data professionals scrub the data using scripting
tools or data quality software. They look for any errors or inconsistencies, such as
duplications or formatting mistakes, and organize and tidy up the data.
4. The collected, processed and cleaned data is analyzed with analytics software. This
includes tools for:
● data mining, which sifts through data sets in search of patterns and relationships
● predictive analytics, which builds models to forecast customer behavior and other
future actions, scenarios and trends
● machine learning, which taps various algorithms to analyze large data sets
● Cost savings, which can result from new business process efficiencies and
optimizations.
● A better understanding of customer needs, behavior and sentiment, which can lead to
better marketing insights, as well as provide information for product development.
● Improved, better informed risk management strategies that draw from large sample
sizes of data.
Conventional Systems
● The system consists of one or more zones each having either manually operated call
points or automatic detection devices, or a combination of both.
● Big data is a huge amount of data which is beyond the processing capacity of
conventional database systems to manage and analyze the data in a specific time interval.
● These are complex data sets that can be both structured or unstructured.
● They are so large that it is not possible to work on them with traditional analytical tools.
● One of the major challenges of conventional systems was the uncertainty of the Data
Management Landscape.
● Big data is continuously expanding, there are new companies and technologies that are
being developed every day.
● A big challenge for companies is to find out which technology works best for them
without the introduction of new risks and problems.
● These days, organizations are realizing the value they get out of big data analytics and
hence they are deploying big data tools and processes to bring more efficiency in their
work environment.
Intelligent Data Analysis (IDA) is one of the most important approaches in the field of data
mining.
Based on the basic principles of IDA and the features of datasets that IDA handles, the
development of IDA is briefly summarized from three aspects :
● Algorithm principle
● The scale
Intelligent Data Analysis (IDA) is one of the major issues in artificial intelligence and
information.
Intelligent data analysis discloses hidden facts that are not known previously and provide
potentially important information or facts from large quantities of data.
Based on machine learning, artificial intelligence, recognition of pattern, and records and
visualization technology, IDA helps to obtain useful information, necessary data and interesting
models from a lot of data available online in order to make the right choices.
(1) Preparation of data: Data Preparation involves selecting the required data from the relevant
data sources and integrating this into a data set to be used for data mining.
(2) Rule finding: Rule finding is working out rules contained in the data set by means of certain
methods or algorithms.
(3) Data validation and Explanation: Result validation requires examining these rules, and
result explanation is giving intuitive, reasonable and understandable descriptions using logical
reasoning.
Analytic processes
Big Data Analytics is the process of collecting large chunks of structured/unstructured data,
segregating and analyzing it and discovering the patterns and other useful business insights from
it.
These days, organizations are realizing the value they get out of big data analytics and hence
they are deploying big data tools and processes to bring more efficiency in their work
environment.
Many big data tools and processes are being utilized by companies these days in the processes of
discovering insights and supporting decision making.
Big data processing is a set of techniques or programming models to access large- scale data to
extract useful information for supporting and providing decisions.
1. Deployment:
● We deploy the results of the analysis. THis is also known as reviewing the
project.
2. Business Understanding:
● Business objectives are defined in this phase.
● Whenever any requirement occurs, we need to assess the situation, determine data
mining goals and then produce the project plan as per the requirement.
3. Data Exploration:
● In this phase, we gather initial data, describe and explore the data and verify data
quality to ensure it contains the data we require.
● Data collected from the various sources is described in terms of its application
and the need for the project in this phase. This is also known as data exploration.
4. Data preparation:
● we need to format the data to get the appropriate data.
● Data is selected, cleaned, and integrated into the format finalized for the analysis
in this phase.
5. Data modeling:
● In this phase, we select the modeling techniques, generate test designs, build a
model and assess the model built.
● The data model is built to analyze relationships between various selected objects
in the data.
● Test cases built for assessing the model and model is tested and implemented on
the data in this phase.
Reports and analytics help businesses improve operational efficiency and productivity, but in
different ways. While reports explain what is happening, analytics helps identify why it is
happening. Reporting summarizes and organizes data in easily digestible ways while analytics
enables questioning and exploring that data further. It provides invaluable insights into trends
and helps create strategies to help improve operations, customer satisfaction, growth, and other
business metrics.
Reporting and analysis are both important for an organization to make informed decisions by
presenting data in a format that is easy to understand. In reporting, data is brought together from
different sources and presented in an easy-to-consume format. Typically, modern reporting apps
today offer next-generation dashboards with high-level data visualization capabilities. There are
several types of reports being generated by companies including financial reports, accounting
reports, operational reports, market reports, and more. This helps understand how each function
is performing at a glance. But for further insights, it requires analytics.
Analytics enables business users to cull out insights from data, spot trends, and help make better
decisions. Next-generation analytics takes advantage of emerging technologies like AI, NLP, and
machine learning to offer predictive insights based on historical and real-time data.
One of the key differences between reporting and analytics is that, while a report involves
organizing data into summaries, analysis involves inspecting, cleaning, transforming, and
modeling these reports to gain insights for a specific purpose.
1. Purpose: Reporting involves extracting data from different sources within an organization and
monitoring it to gain an understanding of the performance of the various functions. By linking
data from across functions, it helps create a cross-channel view that facilitates comparison to
understand data easily. An analysis is being able to interpret data at a deeper level, interpreting it
and providing recommendations on actions.
4. People: Reporting requires repetitive tasks that can be automated. It is often used by
functional business heads who monitor specific business metrics. Analytics requires
customization and therefore depends on data analysts and scientists. Also, it is used by business
leaders to make data-driven decisions.
5. Value Proposition: This is like comparing apples to oranges. Both reporting and analytics
serve a different purpose. By understanding the purpose and using them correctly, businesses can
derive immense value from both.