Acknowledgement
The success and the final outcome of the seminar require a lot of guidance and
assistance from many people and I am extremely fortunate to have got this all along
the complettion of my seminar work.whatever I have done is only due to such guidance
and assistance and I would not forget to thank them.
I owe profound gratitude to our in charge principal priyanka parmar and seminar
guide prof. vijay shah and all other assistance professor of smt z.s patel college of
computer application college, who took keen interest on my seminar work and guide me
all along, till the completion of my seminar work by providing all the necessary
information for presenting a good concept. I am extremely grateful to them for providing
such a nice support and guidance through they had busy schedule managing the college
affair.
I am thankful and fortunate enough to get support and guidance from all teaching
staff of bachelor of computer application department which helped me in successfully
completing my seminar work. I would like to extend my sincere regards to all the non-
teaching staff of bachelor of computer application department for their support .
Name:MANDLEWALA HAMID
Roll no:49
CONTENT
1) BIG DATA
2) HISTORY AND EVOLUTION OF BIG DATA
3) CATEGORIES OF BIG DATA AND WHO USE THE BIG DATA
4) BIG DATA ANALYTICS
5) BIG DATA TECHNOLOGIES
6) DIFFERENCE BETWEEN TRADIIONAL DATA AND BIG DATA
7) DATA VISULIZATION
BIG DATA
Big data refers to extremely large and complex data sets that cannot
be effectively processed or analyzed using traditional data processing
methods.
Data that is too large or two complex to be managed using
traditional data processing , analysis , and storage techniques.
HISTORY OF BIG DATA
The history of Big Data dates back to the 1960s and 1970s, when
computers were first introduced for data processing. However, it was
not until the 1990s that the term "Big Data" was coined to describe
the growing volume, variety, and velocity of data being generated by
various sources.
In the early 2000s, the emergence of the internet and the proliferation of
digital devices led to a massive increase in the amount of data being
generated and collected. This, in turn, created a need for new tools and
technologies to store, process, and analyze the data.
The first major data project was created in 1937 and was ordered by
the Franklin D. Roosevelt administration after the Social Security Act
became law. The government had to keep track of contributions from
26 million Americans and more than 3 million employers. IBM got the
contract to develop punch card-reading machine for this massive
bookkeeping project.
The first data-processing machine appeared in 1943 and was
developed by the British to decipher Nazi codes during World War II.
This device, named Colossus, searched for patterns in intercepted
messages at a rate of 5,000 characters per second, reducing the length
of time the task took from weeks to merely hours.
EVOLUTION OF BIG DATA
Early Computing and Data Collection (1960s-1980s)
The concept of managing large datasets began with early
computers and databases.
Organization started collecting and storing data from
transactions and scientific research.
Traditional databases were limited in scale and capacity,
leading to the need for more advanced systems.
The Rise of the Internet (1990s)
The rapid expansion of the internet led to massive data
generation.
Website , emails , and e-commerce platform created vast
amounts of digital data.
Traditional databases struggled to manage the volume
and variety of this data, paving the way for big data
concepts.
Web 2.0 and User-Generated Content (Early 2000s)
The emergence of Web 2.0 allowed users to generate
content, leading to an explosion of unstructured data.
Social media platforms and blogs contributed significantly to
the data landscape.
New technologies like NoSQL databases and frameworks
such as Hadoop were developed to handle unstructured
data.
Big Data Technologies and Advanced Analytics (2010s-
Present)
The development of cloud computing and distributed
systems enabled the storage and processing of massive
datasets.
Companies began using advanced analytics tools, including
machine learning and AI, to extract insights from data.
Real-time data processing became essential for businesses
to remain competitive.
The Internet of Things (IoT) and Future Trends
The rise of IoT has introduced a new dimension of data
generation, connecting millions of devices that continuously
collect data.
This has led to an exponential increase in data volume,
requiring sophisticated processing and analysis tools.
The future of big data will focus on real-time analytics,
AI-driven insights, and the ethical use of data, including
compliance with regulations like GDPR and CCPA.
Key Characteristics of Big Data
Volume:
The sheer amount of data generated daily has exploded,
moving from gigabytes to zettabytes
Velocity:
Data is generated at unprecedented speeds, necessitating
real-time processing.
Variety:
Data comes in various forms, including structured, semi-
structured, and unstructured formats.
CATEGORIES OF BIG DATA
STRUCTURED DATA
SEMI-STRUCTURED DATA
UNSTRUCTURED DATA
STRUCTURED DATA
Structured data can be crudely defined as the data that resides in a
fixed field within a record.
It is type of data most familiar to our everyday lives. for ex:
birthday,address etc.
Structured data is also called relational data. It is split into multiple tables
to enhance the integrity of the data by creating a single record to depict an
entity.
SEMISTRUCTURED DATA
Semi-structured data is not bound by any rigid schema for data storage and
handling. The data is not in the relational format and is not neatly
organized into rows and columns like that in a spreadsheet.
semi-structured data doesn’t need a structured query language, it is
commonly called NoSQL data.
A data serialization language is used to exchange semi-structured data
across systems that may even have varied underlying infrastructure.
Semi-structured content is often used to store metadata about a business
process but it can also include files containing machine instructions for
computer programs.
This type of information typically comes from external sources such as
social media platforms or other web-based data feeds.
UNSTRUCTURED DATA
Unstructured data is the kind of data that doesn’t adhere to any definite
schema or set of rules. Its arrangement is unplanned and haphazard.
Photos, videos, text documents, and log files can be generally
considered unstructured data. Even though the metadata accompanying
an image or a video may be semi-structured, the actual data being dealt
with is unstructured.
Additionally, Unstructured data is also known as “dark data” because
it cannot be analyzed without the proper software tools.
USE OF BIG DATA
1) BUSINESS INTELLIGENCE AND ANALYTICS:
DESCRIPTION: Organizations use big data analytics to gain
insights into customer behavior, market trends, and operational
efficiency.
APPLICATION:
Sales forecasting
Customer segmentation
Performance metrics analysis
2) CUSTOMER EXPERIENCE ENHANCEMENT:
DESCRIPTION: Businesses analyze customer data to
personalize experiences and improve satisfaction.
APPLICATION:
Targeted marketing campaigns
Recommendation systems (e.g., Netflix, Amazon)
Customer feedback analysis
3) HEALTH CARE AND MEDICAL RESEARCH:
DESCRIPTION: Big data is used to improve patient care,
conduct research, and manage healthcare operations
APPLICATION:
Patient data analysis for personalized medicine
Predictive modeling for disease outbreaks
Drug discovery and clinical trials analysis
4) SUPPLY CHAIN MANAGEMENT:
DESCRIPTION: Companies leverage big data to optimize supply
chain operations and improve logistics.
APPLICATIONS:
Inventory management
Demand forecasting
Route optimization for delivery
5) SOCIAL MEDIA ANALIYSIS:
DESCRIPTION: Organizations analyze social media data to
understand public sentiment and brand perception.
APPLICTIONS:
Sentiment analysis for brand monitoring
Trend analysis for marketing strategies
Crisis management through real-time monitoring
6) FINANCIAL SERVICES:
DESCRIPTION: Financial institutions use big data for risk
management, customer insights, and regulatory compliance.
APPLICATIONS:
Credit scoring and risk assessment
Algorithmic trading
Regulatory reporting and compliance monitoring
7) TELECOMMUNIACTIONS:
DESCRIPTION: Telecom companies analyze big data to
improve network performance and customer services.
APPLICATION:
Churn prediction and customer retention strategies
Network optimization and fault detection
Usage pattern analysis for service improvement
8) EDUCATION:
DESCRIPTION: Educational institutions use big data to enhance
learning experiences and operational efficiency.
APPLICATION:
Student performance analytics
Curriculum development based on learning outcomes
Predictive analytics for student retention
BIG DATA ANALYTICS
Defination:
Big data analytics refers to the process of examining large and
complex datasets to uncover hidden patterns, correlations , trends,
and insights that can inform decision-making and drive business
strategies.
Big data analytics refers to the systematic processing and
analysis of large amounts of data and complex data sets, known
as big data.
Key components of big data analytics:
1) DATA SOURCES
2) DATA PROCESSING
3) DATA ANALYTICS
DATA SOURCE:
Structured Data: Organized data in fixed fields, such as databases and
spreadsheets.
Unstructured Data: Data that does not have a predefined format,
such as text, images, and videos.
Semi-Structured Data: Data that has some organizational properties
but is not strictly structured, such as JSON and XML files.
DATA PROCESSING:
Data Collection: Gathering data from various sources, including IoT
devices, social media, transaction records, and more.
Data Storage: Storing data in data lakes, warehouses, or cloud
storage solutions that can handle large volumes of data.
Data Cleaning: Removing inaccuracies, duplicates, and irrelevant
information to ensure data quality
DATA ANALYSIS:
Descriptive Analytics: Analyzing historical data to understand
what has happened in the past.
Diagnostic Analytics: Investigating data to understand why certain
events occurred.
Predictive Analytics: Using statistical models and machine learning
techniques to forecast future outcomes based on historical data.
Prescriptive Analytics: Recommending actions based on data
analysis to achieve desired outcomes
THE FIVE V’S OF BIG DATA ANALYTICS
The 5V's are a set of characteristics of big data that defines the opportunities
and challenges of big data analytics. These include the following:
1. VOLUME
2. VERACITY
3. VELOCITY
4. VALUE
5. VARIETY
1) VOLUME:
This refers to the massive amounts of data generated from different
sources.
The sheer volume of data generated today, from social media feeds,
IoT devices, transaction records and more, presents a significant
challenge.
Big data technologies and cloud-based storage solutions enable
organizations to store and manage these vast data sets cost-effectively,
protecting valuable data from being discarded due to storage limitations.
The larger the volume, the deeper the analysis can be, revealing trends and
patterns that smaller datasets may miss.
2) VERACITY:
Veracity refers to the accuracy and quality of data.
Data reliability and accuracy are critical, as decisions based on
inaccurate or incomplete data can lead to negative outcomes.
Veracity refers to the data's trustworthiness, encompassing data quality,
noise and anomaly detection issues.
Techniques and tools for data cleaning, validation and verification are
integral to ensuring the integrity of big data
3) VELOCITY:
Velocity refers to the speed at which this data is generated and how
fast it's processed and analyzed.
Data is being produced at unprecedented speeds, from real-time social
media updates to high-frequency stock trading records.
The velocity at which data flows into organizations requires robust
processing capabilities to capture, process and deliver accurate analysis
in near real-time.
Stream processing frameworks and in-memory data processing are
designed to handle these rapid data streams and balance supply with
demand.
4) VALUE:
Value refers to the overall worth that big data analytics should
provide.
Large data sets should be processed and analyzed to provide real-
world meaningful insights that can positively affect an organization's
decisions.
Big data analytics aims to extract actionable insights that offer
tangible value.
This involves turning vast data sets into meaningful information that
can inform strategic decisions, uncover new opportunities and drive
innovation.
5) VARIETY:
This refers to the data types, including structured, semi structured and
unstructured data.
It also refers to the data's format, such as text, videos or images.
The variety in data means that organizations must have a flexible data
management system to handle, integrate and analyze different data
types.
This variety demans flexible data management systems to handle and
integrate disparate data types for comprehensive analysis.
BIG DATA ANALYTICS MODELS
DEFINATION:
The primary models include descriptive analytics, which summarizes past
data; diagnostic analytics, which explores reasons behind past outcomes.
THERE ARE FOUR TYPES OF BIG DATA ANALYTICS:
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
They use various tools for processes such as data mining, cleaning,
integration, visualization, and many others, to improve the process of
analyzing data and ensuring the company benefits from the data they gather.
1. Descriptive Analytics:
Descriptive analytics is one of the most common forms of
analytics that companies use to stay updated on current trends
and the company’s operational performances.
It is one of the first steps of analyzing raw data by performing
simple mathematical operations and producing statements about
samples and measurements.
WHAT HAS HAPPEN?
Descriptive analytics such as data visualization , is important in
helping user interpret the output from predictive and
prescriptive analytics.
2. Diagnostic Analytics:
Diagnostic analytics is one of the more advanced types of big data
analytics that you can use to investigate data and content.
So, by analyzing data, you can comprehend the reasons for certain
behaviors and events related to the company you work for, their
customers, employees, products, and more.
WHY DID IT HAPPEN?
Identifies the fundamental reasons for issues using methods like the
"5 Whys" and fishbone diagrams.
3. Predictive Analytics
As the name suggests, this type of data analytics is all about making
predictions about future outcomes based on insight from data.
In order to get the best results, it uses many sophisticated predictive
tools and models such as machine learning and statistical modeling.
Predictive analytics is one of the most widely used types of analytics
today. The market size and shares are projected to reach $10.95
billion by 2022, growing at a 21% rate for six years.
WHAT WILL OCCUR?
Marketing is the target for many predictive analytics
applications.
Descriptive analytics , such as data visualization is important in
helping users interpret the output from predictive and
prescriptive analytics.
4. Prescriptive Analytics:
Prescriptive analytics takes the results from descriptive and predictive
analysis and finds solutions for optimizing business practices through
various simulations and techniques.
It uses the insight from data to suggest what the best step forward
would be for the company.
WHAT SHOULD OCCUR?
Prescriptive analytics can benefits healthcare strategic planning
by using analytics to leverage operational and usage data
combined with data of external factors such as economic data,
Population demographics trends and population health care
trends,to more accurately plan for future capital investment such
as new facilities and equipment utilization.
IMPORTANCE OF BIG DATA ANALYTICS
Combination of big data with high-powered analytics, you can have
great impact on your business strategy such as:
Finding the root cause of failure , issue and defects in real time
operation.
Generating coupons at the point of sale seeing the customer’s
habits of buying goods.
Recalculating entire risk portfolio in just minutes.
Detecting fraudulent behavior it affects and risk your
organization.
Analyzing big data enables businesses to gain deep customer
insights.
This leads to increased customer satisfaction, loyalty, and ultimately
higher revenue.
Big data analysis is crucial in optimizing business processes and
improving operational efficiency.
It enables predictive maintenance, where potential equipment
failures can be anticipated, minimizing downtime and maximizing
productivity.
This ensures that strategic planning, resource allocation, and risk
management are backed by solid evidence, leading to better
outcomes and improved business performance.
BIG DATA TECHNOLOGIES
1. APACHE CASSANDRA:
It is one of the No-SQL databases which is highly scalable and has
high availability.
we can replicate data across multiple data centers. Replication across
multiple data centers is supported.
In Cassandra, fault to lerance is one of the big factors in which
failed nodes can be easily replaced without any downtime.
2. APACHE HADOOP:
Hadoop is one of the most widely used big data technology that is
used to handle large-scale data, large file systems by using Hadoop
file system which is called HDFS.
Parallel processing like features using the MapReduce framework of
Hadoop.
Hadoop is a scalable system that helps to provide a scalable solution
capable of handling large capacities and capabilities.
For example: If you see real use cases like NextBio is using Hadoop
MapReduce and HBase to process multi-terabyte data sets off the
human genome.
3. APACHE HIVE:
It is used for data summarization and ad hoc querying which means
for querying and analyzing Big Data easily.
It is built on top of Hadoop for providing data summarization, ad-hoc
queries, and the analysis of large datasets using SQL-like language
called HiveQL.
It is not a relational database and not a language for real-time
queries.
It has many features like: designed for OLAP, SQL type language
called HiveQL, fast, scalable, and extensible.
4. APACHE FLUME:
It is a distributed and reliable system that is used to collect,
aggregate, and move large amounts of log data from many data
sources toward a centralized data store.
5. APACHE SPARK:
The main objective of spark for speeding up the Hadoop computational
computing software process, and It was introduced by Apache Software
Foundation.
The Main idea to implement Spark with Hadoop in two ways is for storage
and processing.
two ways Spark uses Hadoop for storage purposes just because Spark has
its own cluster management computation.
6. APACHE KAFKA:
It is a distributed publish-subscribe messaging system and more
specifically you can say it has a robust queue that allows you to
handle a high volume of data.
you can pass the messages from one point to another as you can
say from one sender to receiver.
You can perform message computation in both offline and online
modes, it is suitable for both.
To prevent data loss Kafka messages are replicated within the cluster.
For real-time streaming data analysis, It integrates Apache Storm and
Spark and is built on top of the ZooKeeper synchronization service.
7. MONGO DB:
It is based on cross-platform and works on a concept like collection and
document.
It has document-oriented storage that means data will be stored in
the form of JSON form.
It can be an index on any attribute. It has features like high
availability, replication, rich queries, support by MongoDB, Auto-
Sharding, and Fast in-place updates.
8. ELASTIC SEARCH:
It is a real-time distributed system, and open-source full-text search and
analytics engine.
It has features like scalability factor is high and scalable structured
and unstructured data up to petabytes, It can be used as a
replacement of MongoDB, RavenDB which is based on document-
based storage.
WHERE DOES BIG DATA COME FROM?
DIFFERNCES BETWEEN
TRADITIONAL DATA AND BIG DATA
TRADITIONAL DATA BIG DATA
Traditional data is the kind Big data, however, is much larger and
of information that is easy to more complex. It includes huge
organize and store in simple amounts of information from many
databases, like spreadsheets different sources, such as social media,
or small computer systems. online videos, sensors in machines, or
website clicks
Traditional data is generated Big data is generated outside the
in enterprise level. enterprise level.
Its volume ranges from Its volume ranges from Petabytes to
Gigabytes to Terabytes. Zettabytes or Exabytes.
Traditional database system Big data system deals with structured,
deals with structured data. semi-structured,database,
and unstructured data.
Traditional data source is Big data source is distributed and it is
centralized and it is managed managed in distributed form.
in centralized form.
Data integration is very easy. Data integration is very difficult.
Normal system configuration High system configuration is required
is capable to process to process big data.
traditional data.
The size of the data is very The size is more than the traditional
small. data size.
Normal functions can Special kind of functions can
manipulate data. manipulate data.
It is easy to manage and It is difficult to manage and manipulate
manipulate the data. the data.
BIG DATA VISUALIZATION
DEFINATION:
Big data visualization is the process by which large amounts of analyzed
data are converted into an easy-to-comprehend visual format.
By presenting complex data as graphs, charts, tables, diagrams, or other
visuals, users are able to more-easily grasp the meanings behind the
information, and do so quickly.
Big Data Visualization refers to the techniques and tools used to
graphically represent large and complex datasets in a way that is
easy to understand and interpret.
Given the volume, variety, and velocity of big data, traditional
visualization methods often fall short, requiring more sophisticated
approaches to make sense of such vast amounts of information.
HISTORY OF DATA VISUALIZATION
WHATS MAKES A GOOD CHART?
NAPOLEON’S 1812 MARCH BY
CHARLES JOSEPH MINARD
Prior to the 17th century, data visualization existed mainly in the realm of
maps, displaying land markers, cities, roads, and resources. As the demand
grew for more accurate mapping and physical measurement, better
visualizations were needed.
In 1644, Michael Florent Van Langren, a Flemish astronomer, is believed to
have provided the first visual representation of statistical data.
The one-dimensional line graph below shows the twelve known estimates at
the time of the difference in longitude between Toledo and Rome as well as
the name of each astronomer who provided the estimate.
This period also gave us William Playfair, who's widely considered to be
the inventor of many of the most popular graphs we use today (line, bar,
circle, and pie charts). Many statistical chart types, including histograms,
time series plots, contour plots, scatterplots, and others were invented during
this period.
The latter half of the 19th century is what Friendly calls the Golden Age of
statistical graphics. Two famous examples of data visualization from that era
include John Snow’s (not that Jon Snow!) map of cholera outbreaks in the
London epidemic of 1854 and Charles Minard’s 1869 chart showing the
number of men in Napoleon’s 1812 infamous Russian campaign army, with
army location indicated by the X-axis, and extreme cold temperatures
indicated at points when frostbite took a fatal toll.
This time also provided us with a new visualization, the Rose Chart, created
by Florence Nightingale.
A number of factors contributed to this “Golden Age” of statistical graphing:
the industrial revolution, which created the modern business;
The latter half of the 20th century is what Friendly calls the ‘rebirth of data
visualization’, brought on by the emergence of computer processing.
the United States and Jacques Bertin in France, who developed the science
of information visualization in the areas of statistics and cartography,
respectively
WHY DATA VISUALIZATION IS IMP
Data visualization is important in big data analytics because it simplifies
complex datasets, making it easier to identify patterns, trends, and outliers. It
enhances decision-making by providing clear visual insights that facilitate
understanding and effective communication among stakeholders.
IS DATA VISULAZATION IS
PART OF BIG DATA
Data science and data visualization are not two different entites.
They are bound to each other. DATA Visualization is a subset of data
Science is not a single process or a method or any work flow.
Which are the best data
visualization software of 2019
WhSisense
Looker
Periscope Data
ZoHo Analytics
Tableau
DOMO
Microsoft Power BI
QlikView
DATA VISUALIZATION TECHNIQUES
The type of data visualization technique you leverage will vary based on the
type of data you’re working with, in addition to the story you’re telling with
your data.
Here are some important data visualization techniques to know:
Pie chart
Bar Chart
Histogram
Gantt Chart
Heat Map
Box and Whisker Plot
Waterfall Chart
Area Chart
Scatter Plot
Pictogram Chart
Timeline
Highlight Table
Bullet Graph
Choropleth Map
Word Cloud
Network Diagram
Correlation Matrices