0% found this document useful (0 votes)
17 views57 pages

Hamid Seminar Doc

The document is an acknowledgment and overview of a seminar on Big Data, detailing its history, evolution, categories, and analytics. It highlights the contributions of various individuals and departments in facilitating the seminar work. Key topics covered include the characteristics of Big Data, its applications across different sectors, and the models of Big Data analytics.

Uploaded by

jalalifaiz1156
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views57 pages

Hamid Seminar Doc

The document is an acknowledgment and overview of a seminar on Big Data, detailing its history, evolution, categories, and analytics. It highlights the contributions of various individuals and departments in facilitating the seminar work. Key topics covered include the characteristics of Big Data, its applications across different sectors, and the models of Big Data analytics.

Uploaded by

jalalifaiz1156
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Acknowledgement

The success and the final outcome of the seminar require a lot of guidance and
assistance from many people and I am extremely fortunate to have got this all along
the complettion of my seminar work.whatever I have done is only due to such guidance
and assistance and I would not forget to thank them.

I owe profound gratitude to our in charge principal priyanka parmar and seminar
guide prof. vijay shah and all other assistance professor of smt z.s patel college of
computer application college, who took keen interest on my seminar work and guide me
all along, till the completion of my seminar work by providing all the necessary
information for presenting a good concept. I am extremely grateful to them for providing
such a nice support and guidance through they had busy schedule managing the college
affair.

I am thankful and fortunate enough to get support and guidance from all teaching
staff of bachelor of computer application department which helped me in successfully
completing my seminar work. I would like to extend my sincere regards to all the non-
teaching staff of bachelor of computer application department for their support .

Name:MANDLEWALA HAMID

Roll no:49
CONTENT

1) BIG DATA

2) HISTORY AND EVOLUTION OF BIG DATA

3) CATEGORIES OF BIG DATA AND WHO USE THE BIG DATA

4) BIG DATA ANALYTICS

5) BIG DATA TECHNOLOGIES

6) DIFFERENCE BETWEEN TRADIIONAL DATA AND BIG DATA

7) DATA VISULIZATION
BIG DATA

 Big data refers to extremely large and complex data sets that cannot
be effectively processed or analyzed using traditional data processing
methods.

 Data that is too large or two complex to be managed using


traditional data processing , analysis , and storage techniques.
HISTORY OF BIG DATA

 The history of Big Data dates back to the 1960s and 1970s, when
computers were first introduced for data processing. However, it was
not until the 1990s that the term "Big Data" was coined to describe
the growing volume, variety, and velocity of data being generated by
various sources.

 In the early 2000s, the emergence of the internet and the proliferation of
digital devices led to a massive increase in the amount of data being
generated and collected. This, in turn, created a need for new tools and
technologies to store, process, and analyze the data.

 The first major data project was created in 1937 and was ordered by
the Franklin D. Roosevelt administration after the Social Security Act
became law. The government had to keep track of contributions from
26 million Americans and more than 3 million employers. IBM got the
contract to develop punch card-reading machine for this massive
bookkeeping project.

 The first data-processing machine appeared in 1943 and was


developed by the British to decipher Nazi codes during World War II.
This device, named Colossus, searched for patterns in intercepted
messages at a rate of 5,000 characters per second, reducing the length
of time the task took from weeks to merely hours.

EVOLUTION OF BIG DATA

Early Computing and Data Collection (1960s-1980s)


 The concept of managing large datasets began with early
computers and databases.
 Organization started collecting and storing data from
transactions and scientific research.
 Traditional databases were limited in scale and capacity,
leading to the need for more advanced systems.

The Rise of the Internet (1990s)


 The rapid expansion of the internet led to massive data
generation.
 Website , emails , and e-commerce platform created vast
amounts of digital data.
 Traditional databases struggled to manage the volume
and variety of this data, paving the way for big data
concepts.

Web 2.0 and User-Generated Content (Early 2000s)


 The emergence of Web 2.0 allowed users to generate
content, leading to an explosion of unstructured data.
 Social media platforms and blogs contributed significantly to
the data landscape.
 New technologies like NoSQL databases and frameworks
such as Hadoop were developed to handle unstructured
data.

Big Data Technologies and Advanced Analytics (2010s-


Present)
 The development of cloud computing and distributed
systems enabled the storage and processing of massive
datasets.
 Companies began using advanced analytics tools, including
machine learning and AI, to extract insights from data.
 Real-time data processing became essential for businesses
to remain competitive.

The Internet of Things (IoT) and Future Trends


 The rise of IoT has introduced a new dimension of data
generation, connecting millions of devices that continuously
collect data.
 This has led to an exponential increase in data volume,
requiring sophisticated processing and analysis tools.
 The future of big data will focus on real-time analytics,
AI-driven insights, and the ethical use of data, including
compliance with regulations like GDPR and CCPA.
Key Characteristics of Big Data

Volume:
 The sheer amount of data generated daily has exploded,
moving from gigabytes to zettabytes

Velocity:
 Data is generated at unprecedented speeds, necessitating
real-time processing.

Variety:
 Data comes in various forms, including structured, semi-
structured, and unstructured formats.
CATEGORIES OF BIG DATA

 STRUCTURED DATA

 SEMI-STRUCTURED DATA

 UNSTRUCTURED DATA
STRUCTURED DATA

 Structured data can be crudely defined as the data that resides in a


fixed field within a record.

 It is type of data most familiar to our everyday lives. for ex:


birthday,address etc.

 Structured data is also called relational data. It is split into multiple tables
to enhance the integrity of the data by creating a single record to depict an
entity.
SEMISTRUCTURED DATA

 Semi-structured data is not bound by any rigid schema for data storage and
handling. The data is not in the relational format and is not neatly
organized into rows and columns like that in a spreadsheet.

 semi-structured data doesn’t need a structured query language, it is


commonly called NoSQL data.

 A data serialization language is used to exchange semi-structured data


across systems that may even have varied underlying infrastructure.
 Semi-structured content is often used to store metadata about a business
process but it can also include files containing machine instructions for
computer programs.

 This type of information typically comes from external sources such as


social media platforms or other web-based data feeds.

UNSTRUCTURED DATA

 Unstructured data is the kind of data that doesn’t adhere to any definite
schema or set of rules. Its arrangement is unplanned and haphazard.

 Photos, videos, text documents, and log files can be generally


considered unstructured data. Even though the metadata accompanying
an image or a video may be semi-structured, the actual data being dealt
with is unstructured.
 Additionally, Unstructured data is also known as “dark data” because
it cannot be analyzed without the proper software tools.

USE OF BIG DATA

1) BUSINESS INTELLIGENCE AND ANALYTICS:


 DESCRIPTION: Organizations use big data analytics to gain
insights into customer behavior, market trends, and operational
efficiency.
 APPLICATION:
 Sales forecasting
 Customer segmentation
 Performance metrics analysis
2) CUSTOMER EXPERIENCE ENHANCEMENT:
 DESCRIPTION: Businesses analyze customer data to
personalize experiences and improve satisfaction.
 APPLICATION:
 Targeted marketing campaigns
 Recommendation systems (e.g., Netflix, Amazon)
 Customer feedback analysis

3) HEALTH CARE AND MEDICAL RESEARCH:


 DESCRIPTION: Big data is used to improve patient care,
conduct research, and manage healthcare operations
 APPLICATION:
 Patient data analysis for personalized medicine
 Predictive modeling for disease outbreaks
 Drug discovery and clinical trials analysis

4) SUPPLY CHAIN MANAGEMENT:


 DESCRIPTION: Companies leverage big data to optimize supply
chain operations and improve logistics.

 APPLICATIONS:
 Inventory management
 Demand forecasting
 Route optimization for delivery
5) SOCIAL MEDIA ANALIYSIS:
 DESCRIPTION: Organizations analyze social media data to
understand public sentiment and brand perception.
 APPLICTIONS:
 Sentiment analysis for brand monitoring
 Trend analysis for marketing strategies
 Crisis management through real-time monitoring

6) FINANCIAL SERVICES:
 DESCRIPTION: Financial institutions use big data for risk
management, customer insights, and regulatory compliance.
 APPLICATIONS:

 Credit scoring and risk assessment


 Algorithmic trading
 Regulatory reporting and compliance monitoring

7) TELECOMMUNIACTIONS:
 DESCRIPTION: Telecom companies analyze big data to
improve network performance and customer services.
 APPLICATION:

 Churn prediction and customer retention strategies


 Network optimization and fault detection
 Usage pattern analysis for service improvement
8) EDUCATION:
 DESCRIPTION: Educational institutions use big data to enhance
learning experiences and operational efficiency.
 APPLICATION:
 Student performance analytics
 Curriculum development based on learning outcomes
 Predictive analytics for student retention
BIG DATA ANALYTICS
 Defination:
 Big data analytics refers to the process of examining large and
complex datasets to uncover hidden patterns, correlations , trends,
and insights that can inform decision-making and drive business
strategies.
 Big data analytics refers to the systematic processing and
analysis of large amounts of data and complex data sets, known
as big data.

 Key components of big data analytics:


1) DATA SOURCES
2) DATA PROCESSING
3) DATA ANALYTICS

 DATA SOURCE:

 Structured Data: Organized data in fixed fields, such as databases and


spreadsheets.
 Unstructured Data: Data that does not have a predefined format,
such as text, images, and videos.
 Semi-Structured Data: Data that has some organizational properties
but is not strictly structured, such as JSON and XML files.

 DATA PROCESSING:
 Data Collection: Gathering data from various sources, including IoT
devices, social media, transaction records, and more.
 Data Storage: Storing data in data lakes, warehouses, or cloud
storage solutions that can handle large volumes of data.
 Data Cleaning: Removing inaccuracies, duplicates, and irrelevant
information to ensure data quality

 DATA ANALYSIS:

 Descriptive Analytics: Analyzing historical data to understand


what has happened in the past.
 Diagnostic Analytics: Investigating data to understand why certain
events occurred.
 Predictive Analytics: Using statistical models and machine learning
techniques to forecast future outcomes based on historical data.
 Prescriptive Analytics: Recommending actions based on data
analysis to achieve desired outcomes
THE FIVE V’S OF BIG DATA ANALYTICS

 The 5V's are a set of characteristics of big data that defines the opportunities
and challenges of big data analytics. These include the following:
1. VOLUME
2. VERACITY
3. VELOCITY
4. VALUE
5. VARIETY
1) VOLUME:
 This refers to the massive amounts of data generated from different
sources.
 The sheer volume of data generated today, from social media feeds,
IoT devices, transaction records and more, presents a significant
challenge.
 Big data technologies and cloud-based storage solutions enable
organizations to store and manage these vast data sets cost-effectively,
protecting valuable data from being discarded due to storage limitations.

 The larger the volume, the deeper the analysis can be, revealing trends and
patterns that smaller datasets may miss.
2) VERACITY:
 Veracity refers to the accuracy and quality of data.
 Data reliability and accuracy are critical, as decisions based on
inaccurate or incomplete data can lead to negative outcomes.
 Veracity refers to the data's trustworthiness, encompassing data quality,
noise and anomaly detection issues.
 Techniques and tools for data cleaning, validation and verification are
integral to ensuring the integrity of big data
3) VELOCITY:
 Velocity refers to the speed at which this data is generated and how
fast it's processed and analyzed.
 Data is being produced at unprecedented speeds, from real-time social
media updates to high-frequency stock trading records.
 The velocity at which data flows into organizations requires robust
processing capabilities to capture, process and deliver accurate analysis
in near real-time.
 Stream processing frameworks and in-memory data processing are
designed to handle these rapid data streams and balance supply with
demand.
4) VALUE:
 Value refers to the overall worth that big data analytics should
provide.
 Large data sets should be processed and analyzed to provide real-
world meaningful insights that can positively affect an organization's
decisions.
 Big data analytics aims to extract actionable insights that offer
tangible value.
 This involves turning vast data sets into meaningful information that
can inform strategic decisions, uncover new opportunities and drive
innovation.
5) VARIETY:
 This refers to the data types, including structured, semi structured and
unstructured data.
 It also refers to the data's format, such as text, videos or images.
 The variety in data means that organizations must have a flexible data
management system to handle, integrate and analyze different data
types.
 This variety demans flexible data management systems to handle and
integrate disparate data types for comprehensive analysis.
BIG DATA ANALYTICS MODELS

DEFINATION:

 The primary models include descriptive analytics, which summarizes past


data; diagnostic analytics, which explores reasons behind past outcomes.
 THERE ARE FOUR TYPES OF BIG DATA ANALYTICS:
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
 They use various tools for processes such as data mining, cleaning,
integration, visualization, and many others, to improve the process of
analyzing data and ensuring the company benefits from the data they gather.

1. Descriptive Analytics:
 Descriptive analytics is one of the most common forms of
analytics that companies use to stay updated on current trends
and the company’s operational performances.
 It is one of the first steps of analyzing raw data by performing
simple mathematical operations and producing statements about
samples and measurements.

 WHAT HAS HAPPEN?


 Descriptive analytics such as data visualization , is important in
helping user interpret the output from predictive and
prescriptive analytics.
2. Diagnostic Analytics:
 Diagnostic analytics is one of the more advanced types of big data
analytics that you can use to investigate data and content.
 So, by analyzing data, you can comprehend the reasons for certain
behaviors and events related to the company you work for, their
customers, employees, products, and more.

 WHY DID IT HAPPEN?


 Identifies the fundamental reasons for issues using methods like the
"5 Whys" and fishbone diagrams.

3. Predictive Analytics

 As the name suggests, this type of data analytics is all about making
predictions about future outcomes based on insight from data.
 In order to get the best results, it uses many sophisticated predictive
tools and models such as machine learning and statistical modeling.
 Predictive analytics is one of the most widely used types of analytics
today. The market size and shares are projected to reach $10.95
billion by 2022, growing at a 21% rate for six years.

 WHAT WILL OCCUR?


 Marketing is the target for many predictive analytics
applications.
 Descriptive analytics , such as data visualization is important in
helping users interpret the output from predictive and
prescriptive analytics.
4. Prescriptive Analytics:
 Prescriptive analytics takes the results from descriptive and predictive
analysis and finds solutions for optimizing business practices through
various simulations and techniques.
 It uses the insight from data to suggest what the best step forward
would be for the company.

 WHAT SHOULD OCCUR?


 Prescriptive analytics can benefits healthcare strategic planning
by using analytics to leverage operational and usage data
combined with data of external factors such as economic data,
Population demographics trends and population health care
trends,to more accurately plan for future capital investment such
as new facilities and equipment utilization.
IMPORTANCE OF BIG DATA ANALYTICS
 Combination of big data with high-powered analytics, you can have
great impact on your business strategy such as:
 Finding the root cause of failure , issue and defects in real time
operation.
 Generating coupons at the point of sale seeing the customer’s
habits of buying goods.
 Recalculating entire risk portfolio in just minutes.
 Detecting fraudulent behavior it affects and risk your
organization.

 Analyzing big data enables businesses to gain deep customer


insights.
 This leads to increased customer satisfaction, loyalty, and ultimately
higher revenue.
 Big data analysis is crucial in optimizing business processes and
improving operational efficiency.
 It enables predictive maintenance, where potential equipment
failures can be anticipated, minimizing downtime and maximizing
productivity.
 This ensures that strategic planning, resource allocation, and risk
management are backed by solid evidence, leading to better
outcomes and improved business performance.
BIG DATA TECHNOLOGIES

1. APACHE CASSANDRA:
 It is one of the No-SQL databases which is highly scalable and has
high availability.
 we can replicate data across multiple data centers. Replication across
multiple data centers is supported.
 In Cassandra, fault to lerance is one of the big factors in which
failed nodes can be easily replaced without any downtime.

2. APACHE HADOOP:
 Hadoop is one of the most widely used big data technology that is
used to handle large-scale data, large file systems by using Hadoop
file system which is called HDFS.
 Parallel processing like features using the MapReduce framework of
Hadoop.
 Hadoop is a scalable system that helps to provide a scalable solution
capable of handling large capacities and capabilities.
 For example: If you see real use cases like NextBio is using Hadoop
MapReduce and HBase to process multi-terabyte data sets off the
human genome.

3. APACHE HIVE:
 It is used for data summarization and ad hoc querying which means
for querying and analyzing Big Data easily.
 It is built on top of Hadoop for providing data summarization, ad-hoc
queries, and the analysis of large datasets using SQL-like language
called HiveQL.
 It is not a relational database and not a language for real-time
queries.
 It has many features like: designed for OLAP, SQL type language
called HiveQL, fast, scalable, and extensible.

4. APACHE FLUME:
 It is a distributed and reliable system that is used to collect,
aggregate, and move large amounts of log data from many data
sources toward a centralized data store.
5. APACHE SPARK:
 The main objective of spark for speeding up the Hadoop computational
computing software process, and It was introduced by Apache Software
Foundation.
 The Main idea to implement Spark with Hadoop in two ways is for storage
and processing.
 two ways Spark uses Hadoop for storage purposes just because Spark has
its own cluster management computation.

6. APACHE KAFKA:
 It is a distributed publish-subscribe messaging system and more
specifically you can say it has a robust queue that allows you to
handle a high volume of data.
 you can pass the messages from one point to another as you can
say from one sender to receiver.
 You can perform message computation in both offline and online
modes, it is suitable for both.
 To prevent data loss Kafka messages are replicated within the cluster.
 For real-time streaming data analysis, It integrates Apache Storm and
Spark and is built on top of the ZooKeeper synchronization service.

7. MONGO DB:
 It is based on cross-platform and works on a concept like collection and
document.
 It has document-oriented storage that means data will be stored in
the form of JSON form.
 It can be an index on any attribute. It has features like high
availability, replication, rich queries, support by MongoDB, Auto-
Sharding, and Fast in-place updates.

8. ELASTIC SEARCH:
 It is a real-time distributed system, and open-source full-text search and
analytics engine.
 It has features like scalability factor is high and scalable structured
and unstructured data up to petabytes, It can be used as a
replacement of MongoDB, RavenDB which is based on document-
based storage.
WHERE DOES BIG DATA COME FROM?
DIFFERNCES BETWEEN
TRADITIONAL DATA AND BIG DATA

 TRADITIONAL DATA BIG DATA



Traditional data is the kind Big data, however, is much larger and
of information that is easy to more complex. It includes huge
organize and store in simple amounts of information from many
databases, like spreadsheets different sources, such as social media,
or small computer systems. online videos, sensors in machines, or
website clicks

Traditional data is generated Big data is generated outside the
in enterprise level. enterprise level.

Its volume ranges from Its volume ranges from Petabytes to
Gigabytes to Terabytes. Zettabytes or Exabytes.

Traditional database system Big data system deals with structured,
deals with structured data. semi-structured,database,
and unstructured data.

Traditional data source is Big data source is distributed and it is
centralized and it is managed managed in distributed form.
in centralized form.

Data integration is very easy. Data integration is very difficult.

Normal system configuration High system configuration is required
is capable to process to process big data.
traditional data.

The size of the data is very The size is more than the traditional
small. data size.

Normal functions can Special kind of functions can
manipulate data. manipulate data.

It is easy to manage and It is difficult to manage and manipulate
manipulate the data. the data.
BIG DATA VISUALIZATION

DEFINATION:

 Big data visualization is the process by which large amounts of analyzed


data are converted into an easy-to-comprehend visual format.

 By presenting complex data as graphs, charts, tables, diagrams, or other


visuals, users are able to more-easily grasp the meanings behind the
information, and do so quickly.

 Big Data Visualization refers to the techniques and tools used to


graphically represent large and complex datasets in a way that is
easy to understand and interpret.
 Given the volume, variety, and velocity of big data, traditional
visualization methods often fall short, requiring more sophisticated
approaches to make sense of such vast amounts of information.
HISTORY OF DATA VISUALIZATION

 WHATS MAKES A GOOD CHART?

NAPOLEON’S 1812 MARCH BY


CHARLES JOSEPH MINARD

 Prior to the 17th century, data visualization existed mainly in the realm of
maps, displaying land markers, cities, roads, and resources. As the demand
grew for more accurate mapping and physical measurement, better
visualizations were needed.
 In 1644, Michael Florent Van Langren, a Flemish astronomer, is believed to
have provided the first visual representation of statistical data.
 The one-dimensional line graph below shows the twelve known estimates at
the time of the difference in longitude between Toledo and Rome as well as
the name of each astronomer who provided the estimate.

 This period also gave us William Playfair, who's widely considered to be


the inventor of many of the most popular graphs we use today (line, bar,
circle, and pie charts). Many statistical chart types, including histograms,
time series plots, contour plots, scatterplots, and others were invented during
this period.

 The latter half of the 19th century is what Friendly calls the Golden Age of
statistical graphics. Two famous examples of data visualization from that era
include John Snow’s (not that Jon Snow!) map of cholera outbreaks in the
London epidemic of 1854 and Charles Minard’s 1869 chart showing the
number of men in Napoleon’s 1812 infamous Russian campaign army, with
army location indicated by the X-axis, and extreme cold temperatures
indicated at points when frostbite took a fatal toll.

 This time also provided us with a new visualization, the Rose Chart, created
by Florence Nightingale.
 A number of factors contributed to this “Golden Age” of statistical graphing:
the industrial revolution, which created the modern business;
 The latter half of the 20th century is what Friendly calls the ‘rebirth of data
visualization’, brought on by the emergence of computer processing.
 the United States and Jacques Bertin in France, who developed the science
of information visualization in the areas of statistics and cartography,
respectively

WHY DATA VISUALIZATION IS IMP


 Data visualization is important in big data analytics because it simplifies
complex datasets, making it easier to identify patterns, trends, and outliers. It
enhances decision-making by providing clear visual insights that facilitate
understanding and effective communication among stakeholders.
IS DATA VISULAZATION IS
PART OF BIG DATA
 Data science and data visualization are not two different entites.
 They are bound to each other. DATA Visualization is a subset of data
Science is not a single process or a method or any work flow.

Which are the best data


visualization software of 2019
 WhSisense
 Looker
 Periscope Data
 ZoHo Analytics
 Tableau
 DOMO
 Microsoft Power BI
 QlikView
DATA VISUALIZATION TECHNIQUES
 The type of data visualization technique you leverage will vary based on the
type of data you’re working with, in addition to the story you’re telling with
your data.

 Here are some important data visualization techniques to know:


 Pie chart

 Bar Chart
 Histogram
 Gantt Chart
 Heat Map
 Box and Whisker Plot
 Waterfall Chart
 Area Chart
 Scatter Plot
 Pictogram Chart
 Timeline
 Highlight Table
 Bullet Graph
 Choropleth Map
 Word Cloud
 Network Diagram
 Correlation Matrices

You might also like