Unit-1 Bda
Unit-1 Bda
What is Data?
The quantities, characters or symbols on which operations are performed by a computer,
which may be stored and transmitted in the form of electrical signals and recorded on magnetic,
optical, or mechanical recording media.
Big data is the large onset of structured, semi-structured, and unstructured data. It is data that
arrives at a much higher volume, at a much faster rate, in a wider variety of file formats, and
from a wider variety of sources.
o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
The statistic shows that 500+terabytes of new data get ingested into the databases of
social media site Facebook, every day. This data is mainly generated in terms of photo
and video uploads, message exchanges, putting comments etc.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs
from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data through
its daily transaction.
The New York Stock Exchange is an example of Big Data that generates about one
terabyte of new trade data per day.
a) Structured
b) Unstructured
c) Semi-structured
a) Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data. It can be stored and access displayed in fixed format that is rows and columns.
c) Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data
as a structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML file.
Semi-structured data is data that has some structure but doesn't conform to a data
model. It's also known as partially structured data or self-describing structure.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
(1) Volume – The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business processes, machines,
social media platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.
Velocity plays an important role compared to others. Velocity creates the speed by which the data is
created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity
bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.
(4) Veracity:
It refers to inconsistencies and uncertainty in data, that is data which is available can
sometimes get messy and quality and accuracy are difficult to control.
Big Data is also variable because of the multitude of data dimensions resulting from
multiple disparate data types and sources.
Example: Data in bulk could create confusion whereas less amount of data could convey
half or Incomplete Information .
(5) Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
(6) Variability:
How fast or available data that extent is the structure of your data is changing?
How often does the meaning or shape of your data change?
Example: if you are eating same ice-cream daily and the taste just keep changing.
1. Ingestion
Ingestion refers to the process of gathering and preparing the data. You’d use the ETL (extract,
transform, and load) process to prepare your data. In this phase, you have to identify your data
sources, determine whether you’ll gather the data in batches or stream it, and prepare it through
cleansing, massaging, and organization. You perform the extract process in gathering the data
and the transformation process in optimizing it.
2. Storage
Once you have gathered the necessary data, you’d need to store it. Here, you’ll perform the final
step of the ETL, the load process. You’d store your data in a data warehouse or a data lake,
depending on your requirements. This is why it’s crucial to understand your organization’s goals
while performing any big data process.
4. Analysis
In this phase of your big data process, you’d analyze the data to generate valuable insights for
your organization. There are four kinds of big data analytics: prescriptive, predictive, descriptive,
and diagnostic. You’d use artificial intelligence and machine learning algorithms in this phase to
analyze the data.
5. Consumption
This is the final phase of a big data process. Once you have analyzed the data and have found the
insights, you have to share them with others. Here, you’d have to utilize data visualization and
data storytelling to share your insights effectively with a non-technical audience such as
stakeholders and project managers.
A big data platform acts as an organized storage medium for large amounts of data. Big data
platforms utilize a combination of data management hardware and software tools to store
aggregated data sets, usually onto the cloud.
Big Data platform workflow can be divided into the following stages:
1. Data Collection
Big Data platforms collect data from various sources, such as sensors, weblogs, social
media, and other databases.
2. Data Storage
Once the data is collected, it is stored in a repository, such as Hadoop Distributed File
System (HDFS), Amazon S3, or Google Cloud Storage.
3. Data Processing
Data Processing involves tasks such as filtering, transforming, and aggregating the data.
This can be done using distributed processing frameworks, such as Apache Spark,
Apache Flink, or Apache Storm.
4. Data Analytics
After data is processed, it is then analyzed with analytics tools and techniques, such as
machine learning algorithms, predictive analytics, and data visualization.
5. Data Governance
Data Governance (data cataloging, data quality management, and data lineage tracking)
ensures the accuracy, completeness, and security of the data.
6. Data Management
Big data platforms provide management capabilities that enable organizations to make
backups, recover, and archive.
Big data analytics is the process of collecting, examining, and analyzing large amounts of
data to discover market trends, insights, and patterns that can help companies make better
business decisions.
There are numerous advantages of Big Data for organizations. Some of the key ones are as
follows:
1. Enhanced Decision-making
Big data implementations can help businesses and organizations make better-informed decisions
in less time. It allows them to use outside intelligence such as search engines and social media
platforms to fine-tune their strategies. Big data can identify trends and patterns that would’ve
been invisible otherwise, helping companies avoiding errors.
Another huge impact big data can have on all industries is in the customer service department.
Companies are replacing the traditional customer feedback system with data-driven solutions.
Such solutions can analyze customer feedback more efficiently and help them offer customer
service to the consumers.
3. Efficiency Optimization
Organizations use big data to identify the weak areas present within them. Then, they use these
findings to resolve those issues and enhance their operations substantially. For example, Big
Data has substantially helped the manufacturing sector improve its efficiency through IoT and
robotics.
4. Real-time Decision Making
Big Data has transformed several areas by enabling real-time trackings, such as inventory
management, supply chain optimization, anti-money laundering, and fraud detection in banking
& finance.
The surveys conducted by New Vantage and Syncsort (now Precisely) reveals that big data
analytics has helped businesses to reduce their expenses significantly. 66.7% of survey
respondents from New Vantage claimed that they have started using big data to reduce expenses.
Furthermore, 59.4% of survey respondents from Syncsort claimed that big data tools helped
them reduce costs and increase operational efficiency.
6. Fraud Detection
Financial companies, in particular, use big data to detect fraud. Data analysts use machine
learning algorithms and artificial intelligence to detect anomalies and transaction patterns. These
anomalies of transaction patterns indicate something is out of order or a mismatch giving us
clues about possible frauds.
7. Increased productivity
According to a survey from Syncsort, 59.9% of survey respondents have claimed that they were
using big data analytics tools like Spark and Hadoop to increase productivity. This increase in
productivity has, in turn, helped them to improve customer retention and boost sales.
Modern big data tools help data scientists and analysts to analyze a large amount of data
efficiently, enabling them to have a quick overview of more information. This also increases
their productivity levels.
Since big data analytics provide businesses with more information, they can utilize that data to
create more targeted marketing campaigns and special, highly personalized offers to each
individual client.
Disadvantages:
1. Lack of talent
According to a survey by AtScale, the lack of big data experts and data scientists has been the
biggest challenge in this field for the past three years. Currently, many IT professionals don’t
know how to carry out big data analytics as it requires a different skill set. Thus, finding data
scientists who are also experts in big data can be challenging.
Big data experts and data scientists are two highly paid careers in the data science field.
Therefore, hiring big data analysts can be very expensive for companies, especially for startups.
Some companies have to wait for a long time to hire the required staff to continue their big data
analytics tasks.
2. Security risks
Most of the time, companies collect sensitive information for big data analytics. Those data need
protection, and security risks can be demerits due to the lack of proper maintenance.
Besides, having access to huge data sets can gain unwanted attention from hackers, and your
business may be a target of a potential cyber-attack. As you know, data breaches have become
the biggest threat to many companies today.
Another risk with big data is that unless you take all necessary precautions, important
information can be leaked to competitors.
3. Compliance
The need to have compliance with government legislation is also a drawback of big data. If big
data contains personal or confidential information, the company should make sure that they
follow government requirements and industry standards to store, handle, maintain, and process
that data.
So, data governance tasks, transmission, and storage will become more difficult to manage as the
big data volumes increase.
The emergence of data, and big data, is a long and storied history. There were many
advancements in technology during World War 2, which were primarily made to serve military
purposes. Over time though, those advancements would become useful to the commercial sector
and eventually the general public, with personal computing becoming a viable option to the
everyday consumer.
The first personal desktop computer to feature a Graphical User Interface (GUI) was Lisa,
released by Apple Computers in 1983. Throughout the 1980s, companies like Apple, Microsoft,
and IBM would release a wide range of personal desktop computers, which led to a surge in
people buying their own personal computers and being able to use them at home for the first time
ever. Thus, electronic storage was finally available to the masses.
2000s to 2010s – Controlling Data Volume, Social Media and Cloud Computing
During the early 2000s, companies such as Amazon, eBay, and Google helped generate large
amounts of web traffic, as well as a combination of structured and unstructured data. Amazon
also launched a beta version of AWS (Amazon Web Services) in 2002, which opened
the Amazon.com platform to all developers. By 2004, over 100 applications were built for it.
AWS then relaunched in 2006, offering a wide range of cloud infrastructure services, including
Simple Storage Service (S3) and Elastic Compute Cloud (EC2). The public launch of AWS
attracted a wide range of customers, such as Dropbox, Netflix, and Reddit, who were eager to
become cloud-enabled and so they would all partner with AWS before 2010.
1. Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large volumes
of structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed storage
medium and large data processing are provided by Hadoop, and it is an open-source
framework.
3. NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
4. Cloud Computing:
Cloud Computing technology helps companies to store their important data in data
centers that are remote, and it saves their infrastructure cost and maintenance costs.
5. Machine Learning:
Machine Learning algorithms are those algorithms that work on large data, and analysis
is done on a huge amount of data to get meaningful insights from it. This has led to the
development of artificial intelligence (AI) applications.
6. Data Streaming:
Data Streaming technology has emerged as a solution to process large volumes of data in
real time.
7. Edge Computing:
Edge Computing is a kind of distributed computing paradigm that allows data processing
to be done at the edge or the corner of the network, closer to the source of the data.
Challenges of Big Data
Many companies get stuck at the initial stage of their Big Data projects. This is because they are
neither aware of the challenges of Big Data nor are equipped to tackle those
challenges. The challenges of conventional systems in Big Data need to be addressed. Below are
some of the major Big Data challenges and their solutions.
Companies fail in their Big Data initiatives due to insufficient understanding. Employees may
not know what data is, its storage, processing, importance, and sources. Data professionals may
know what is going on, but others may not have a clear picture.
For example, if employees do not understand the importance of data storage, they might not keep
the backup of sensitive data. They might not use databases properly for storage. As a result,
when this important data is required, it cannot be retrieved easily.
Solution
Big Data workshops and seminars must be held at companies for everyone. Basic training
programs must be arranged for all the employees who are handling data regularly and are a part
of the Big Data projects. A basic understanding of data concepts must be inculcated by all levels
of the organization.
One of the most pressing challenges of Big Data is storing all these huge sets of data properly.
The amount of data being stored in data centers and databases of companies is increasing rapidly.
As these data sets grow exponentially with time, it gets extremely difficult to handle.
Most of the data is unstructured and comes from documents, videos, audios, text files and other
sources. This means that you cannot find them in databases. This can pose huge Big Data
analytics challenges and must be resolved as soon as possible, or it can delay the growth of the
company.
Solution
In order to handle these large data sets, companies are opting for modern techniques, such
as compression, tiering, and deduplication. Compression is used for reducing the number of bits
in the data, thus reducing its overall size. Deduplication is the process of removing duplicate and
unwanted data from a data set.
Data tiering allows companies to store data in different storage tiers. It ensures that the data is
residing in the most appropriate storage space. Data tiers can be public cloud, private cloud, and
flash storage, depending on the data size and importance.
Companies are also opting for Big Data tools, such as Hadoop, NoSQL and other technologies.
3. Confusion while Big Data tool selection
Companies often get confused while selecting the best tool for Big Data analysis and storage.
Is HBase or Cassandra the best technology for data storage? Is Hadoop MapReduce good enough
or will Spark be a better option for data analytics and storage?
These questions bother companies and sometimes they are unable to find the answers. They end
up making poor decisions and selecting inappropriate technology. As a result, money, time,
efforts and work hours are wasted.
Solution
The best way to go about it is to seek professional help. You can either hire experienced
professionals who know much more about these tools. Another way is to go for Big Data
consulting. Here, consultants will give a recommendation of the best tools, based on your
company’s scenario. Based on their advice, you can work out a strategy and then select the best
tool for you.
To run these modern technologies and Big Data tools, companies need skilled data professionals.
These professionals will include data scientists, data analysts and data engineers who are
experienced in working with the tools and making sense out of huge data sets.
Companies face a problem of lack of Big Data professionals. This is because data handling tools
have evolved rapidly, but in most cases, the professionals have not. Actionable steps need to be
taken in order to bridge this gap.
Solution
Companies are investing more money in the recruitment of skilled professionals. They also have
to offer training programs to the existing staff to get the most out of them.
Another important step taken by organizations is the purchase of data analytics solutions that are
powered by artificial intelligence/machine learning. These tools can be run by professionals who
are not data science experts but have basic knowledge. This step helps companies to save a lot of
money for recruitment.
5. Securing data
Securing these huge sets of data is one of the daunting challenges of Big Data. Often companies
are so busy in understanding, storing and analyzing their data sets that they push data security for
later stages. But, this is not a smart move as unprotected data repositories can become breeding
grounds for malicious hackers.
Companies can lose up to $3.7 million for a stolen record or a data breach.
Solution
Companies are recruiting more cybersecurity professionals to protect their data. Other steps
taken for securing data include:
Data encryption
Data segregation
Identity and access control
Implementation of endpoint security
Real-time security monitoring
Use Big Data security tools, such as IBM Guardian
Data in an organization comes from a variety of sources, such as social media pages, ERP
applications, customer logs, financial reports, e-mails, presentations and reports created by
employees. Combining all this data to prepare reports is a challenging task.
This is an area often neglected by firms. But, data integration is crucial for analysis, reporting
and business intelligence, so it has to be perfect.
Solution
Companies have to solve their data integration problems by purchasing the right tools. Some of
the best data integration tools are mentioned below:
Talend Data Integration
Centerprise Data Integrator
ArcESB
IBM InfoSphere
Xplenty
Informatica PowerCenter
CloverDX
Microsoft SQL
QlikView
Big data analytics helps organizations harness their data and use it to identify new
opportunities. That, in turn, leads to smarter business moves, more efficient operations,
higher profits and happier customers.
Big data analytics helps organizations harness their data and use it to identify new
opportunities. That, in turn, leads to smarter business moves, more efficient operations,
higher profits and happier customers. Businesses that use big data with advanced
analytics gain value in many ways, such as:
1. Reducing cost. Big data technologies like cloud-based analytics can significantly reduce costs
when it comes to storing large amounts of data (for example, a data lake). Plus, big data analytics
helps organizations find more efficient ways of doing business.
2. Making faster, better decisions. The speed of in-memory analytics – combined with the ability
to analyze new sources of data, such as streaming data from IoT – helps businesses analyze
information immediately and make fast, informed decisions.
3. Developing and marketing new products and services. Being able to gauge customer needs
and customer satisfaction through analytics empowers businesses to give customers what they
want, when they want it. With big data analytics, more companies have an opportunity to
develop innovative new products to meet customers’ changing needs.
4. Risk Management: More informed risk management techniques based on large data sample
sizes.
5. Increased Efficiency: Savings due to the increased efficiency and optimization of business
processes.
i. Prescriptive Analytics
This type of analytics talks about an analysis that is based on rules and recommendations,
to prescribe a certain analytical path for an enterprise. At the next level, prescriptive
analytics will automate decisions and actions—how can we make that happen?
Building on the previous analytics, neural networks and heuristics are applied to the data
to recommend the best possible actions that will derive the desired outcomes.
ii. Diagnostic Analytics
In diagnostic analytics, most enterprises start to apply big data analytics to answer diagnostic
questions such as how and why something happened. Some may also call this behavioral
analytics.
Diagnostic analytics is about looking into the past and determining why a certain thing
happened. This type of analytics usually revolves around working on a dashboard.
Diagnostic analytics with big data helps in two ways: (a) the additional data brought by the
digital age eliminates analytic blind spots, and (b) the how and why questions deliver insights
that pinpoint the actions that need to be taken.
iii. Predictive Analytics
This type of analytics ensures that the path for the future course of action is predicted.
Answering the how and why questions will reveal specific patterns to detect when
outcomes are about to occur.
Predictive analytics builds on diagnostic analytics to look for these patterns and see
what is going to happen. Machine learning is also applied as new patterns emerge to
continuously learn.
iv. Descriptive Analytics
In this type of analytics, work is done based on incoming data. For the mining of this
data, we deploy analytics and come up with a description based on the data.
Many enterprises have spent years generating descriptive analytics—answering the
happened questions. This information is valuable but only provides a high-level,
rearview-mirror view of the business performance.
Apache Hadoop: Hadoop which helps in storing and processing large data.
Apache Spark: Spark helps in-memory calculation
Flink
Apache Storm: Storm helps in faster processing of unbounded data
Apache Cassandra: It provides high availability and scalability of a database.
MongoDB: provides cross-platform capabilities.
Tableau
RapidMiner
R Programming
Qubole
SAS
Data Pine
Hadoop:
An open-source framework that stores and processes big data sets. Hadoop can handle and
analyse structured and unstructured data.
Spark:
An open-source cluster computing framework for real-time processing and data analysis.
APACHE Cassandra:
APACHE Cassandra is an open-source NoSQL distributed database that is used to fetch large
amounts of data. It’s one of the most popular tools for data analytics and has been praised by
many tech companies due to its high scalability and availability without compromising speed
and performance. It is capable of delivering thousands of operations every second and can
handle petabytes of resources with almost zero downtime. It was created by Facebook back in
2008 and was published publicly.
Qubole
It’s an open-source big data tool that helps in fetching data in a value of chain using ad-hoc
analysis in machine learning. Qubole is a data lake platform that offers end-to-end service with
reduced time and effort which are required in moving data pipelines. It is capable of
configuring multi-cloud services such as AWS, Azure, and Google Cloud. Besides, it also
helps in lowering the cost of cloud computing by 50%.
Mongo DB
Apache Storm
A storm is a robust, user-friendly tool used for data analytics, especially in small companies.
The best part about the storm is that it has no language barrier (programming) in it and can
support any of them. It was designed to handle a pool of large data in fault-tolerance and
horizontally scalable methods. When we talk about real-time data processing, Storm leads the
chart because of its distributed real-time big data processing system, due to which today many
tech giants are using APACHE Storm in their system. Some of the most notable names are
Twitter, Zendesk, NaviSite, etc.
SAS
Today it is one of the best tools for creating statistical modeling used by data analysts. By
using SAS, a data scientist can mine, manage, extract or update data in different variants from
different sources. Statistical Analytical System or SAS allows a user to access the data in any
format (SAS tables or Excel worksheets). Besides that it also offers a cloud platform for
business analytics called SAS Viya and also to get a strong grip on AI & ML, they have
introduced new tools and products.
Data Pine
Datapine is an analytical used for BI and was founded back in 2012 (Berlin, Germany). In a
short period of time, it has gained much popularity in a number of countries and it’s mainly
used for data extraction (for small-medium companies fetching data for close monitoring).
With the help of its enhanced UI design, anyone can visit and check the data as per their
requirement and offer in 4 different price brackets, starting from $249 per month. They do
offer dashboards by functions, industry, and platform.
Rapid Miner
It’s a fully automated visual workflow design tool used for data analytics. It’s a no-code
platform and users aren’t required to code for segregating data. Today, it is being heavily used
in many industries such as ed-tech, training, research, etc. Though it’s an open-source platform
but has a limitation of adding 10000 data rows and a single logical processor. With the help
of Rapid Miner, one can easily deploy their ML models to the web or mobile (only when the
user interface is ready to collect real-time figures).