Unit 1 PPT
Unit 1 PPT
Ghaziabad
Big Data (KCS061)
Session 2023-24
Course : B.Tech CSE VI sem
Dr K. P. Jayant
Department of Computer Science & Engineering
1 BigData KCS061 Dr KP Jayant, CSE Dept.
Big Data(KCS061)
What is Data?
Raw facts & figures.
Apart from the significant volume, big data is also complex such
that none of the conventional data management tools can
effectively store or process it.
Classification of Data
Data Classification:
Process of classifying data in relevant categories so that it can be
used or applied more efficiently.
The classification of data makes it easy for the user to retrieve it.
Data classification holds its importance when comes to data
security and compliance and also to meet different types of
business or personal objective.
It is also of major requirement, as data must be easily retrievable
within a specific period of time.
3. Un-Structure Data
In another words
Examples –
• DWS (Data Ware House) ,
• DM (Data Mart),
• OLTP (Online transaction process) ,
• ODS (operational data source/store ),
• APIs (Application programming interface) ,
• ERP (Enterprise resource planning) ,
• CRM (Customer relationship management ),
• MIS (management information system)
• etc
3. Semi-Structured Data :
Semi-structured data is information that does not reside in a
relational database but that have some organizational properties
that make it easier to analyze.
With some process, we can store them in a relational database
but some time it is very hard to process of some kind of semi-
structured data.
In other words
Example
• HTML,
• XML data,
• No SQL,
• emails,
• CSV files,
• JSON files (Java Script Object Notation)
• log files ,
• Excel files
• etc
2. Unstructured Data:
It is defined as the data in which is not follow a pre-defined
standard or you can say that any does not follow any organized
format.
This kind of data is also not fit for the relational database because
in the relational database we will see a pre-defined manner or we
can say organized way of data.
Unstructured data is also very important for the big data domain
and to manage and store Unstructured data.
There are many platforms to handle it like No-SQL Database.
In other words
Unstructured data is not organized and does not have a
specific structure.
This type of data is often generated by humans or machines
and includes data such as text, images, audio, and video.
Analyzing unstructured data requires advanced data
processing techniques
such as natural language processing,
image processing,
and machine learning.
Examples –
• Word,
• PDF,
• text,
• media logs,
• audio, video,
• www,
• Geo location,
• social media like Twitter, Facebook, Instagram, etc.
• Mobile phone,
• smart watch,
• Wi-Fi,
• etc
Area/Uses
• Transportation.
• Advertising and Marketing.
• Banking and Financial Services.
• Government.
• Media and Entertainment. Meteorology.
• Healthcare.
• Cyber security.
• Banking and Securities
• Communications, Media and Entertainment
• Insurance
• Retail and Wholesale trade
• etc
Evolution of Technology
1990s - Emergence of Data Warehousing:
2000s - Rise of NoSQL Databases:
2003 - Introduction of Hadoop:
2008 - Growth of Cloud Computing:
2010s - Expansion of Big Data Ecosystem:
2010s - Emergence of Data Lakes:
2010s - Advanced Analytics and Machine Learning:
2010s - Real-time Processing
Big data technologies continue to evolve rapidly, with a focus on improving
performance, scalability, and ease of use. The adoption of containerization and
orchestration tools, such as Docker and Kubernetes, has also played a role in
streamlining big data deployments.
23 BigData KCS061 Dr KP Jayant, CSE Dept.
Big Data(KCS061)
What is Big Data Architecture?
The term "Big Data architecture" refers to the systems and software used to
manage Big Data. A Big Data architecture must be able to handle the scale,
complexity, and variety of Big Data. It must also be able to support the needs of
different users, who may want to access and analyze the data differently.
1. Data Ingestion
This layer is responsible for collecting and storing data from various sources.
In Big Data, the data ingestion process of extracting data from various
sources and loading it into a data repository.
Data ingestion is a key component of a Big Data architecture because it
determines how data will be ingested, transformed, and stored.
1. Volume:
2. Variety:
Variety describes the diversity of the data types and its
heterogeneous sources.
3. Velocity:
Velocity describes how rapidly the data is generated and
how quickly it moves.
Since the data is pulled from diverse sources, the information can have
uncertainties, errors, redundancies, gaps, and inconsistencies.
It's bad enough when an analyst gets one set of data that has accuracy
issues; imagine getting tens of thousands of such datasets, or maybe
even millions.
5. Value:
The ultimate goal of big data architecture is to extract insights and value
from the data that can help organizations make better decisions.
Big data has become increasingly important in today's digital world, where
massive amounts of data are generated every second by individuals,
businesses, and machines. Here are some of the key reasons why big data is
important:
1. Improved decision-making:
By analyzing large and complex datasets, organizations can make more
informed and data-driven decisions, leading to better outcomes.
2. Cost savings:
Big data analytics can help identify inefficiencies in business operations,
leading to cost savings.
It describes
the process of uncovering trends, patterns,
and correlations in large amounts of raw data to help
make data-informed decisions.
Descriptive Analytics
Through this type of analytics, we use the insight gained to answer the question
“What is happening now based on incoming data?”
Benefits
• It helps companies to make sense of the large amounts of raw
data they gather by focusing on the more critical areas.
• To understand their current business situation better in
comparison to the past.
Predictive Analytics
Benefits
• Reliable and more accurate forecast of the future.
• Companies can find ways to save and earn money, manage shipping
schedules, and stay on top of inventory requirements.
• Can can help organizations attract new customers and retain the old
ones.
Prescriptive Analytics
“What action should be taken?”
Others are……….
Text Analytics (or Text Mining)
Spatial Analytics
Purpose Spatial analytics involves analyzing geographic or location-
based data to understand patterns, relationships, and trends associated with
specific locations.
Examples Geographic Information System (GIS) analysis,
location-based recommendation systems, and mapping.
Streaming Analytics
Preservation Analytics
Purpose Preservation analytics focuses on maintaining and
ensuring the quality, reliability, and integrity of data over time.
Video Analytics
• Collecting,
• Processing,
• Cleaning,
• and analyzing large datasets
Bi g Data technology
Feature Extraction
Pattern Recognition
Identifying meaningful patterns and structures in the data that
can provide appropriate information.
This can include the detection of anomalies, trends, correlations,
and other important relationships.
56 BigData KCS061 Dr KP Jayant, CSE Dept.
Big Data(KCS061)
Key Component of Intelligent Data Analysis (IDA)
Decision Making
Visualization
Presenting the results in a visual and interpretable format,
making it easier for stakeholders to understand and act upon the
findings.
Visualization tools help in conveying complex information in a
more accessible manner.
57 BigData KCS061 Dr KP Jayant, CSE Dept.
Big Data(KCS061)
Data Collection
Data Storage
Data Sharing
Big Data Ethics
Ethics means ensuring ethical use of data in the context of big
data analytics.
• Informed Consent
• Transparency
• Fairness and Bias
• Accountability
• Legal Compliance
• Continuous Monitoring and Auditing
59 BigData KCS061 Dr KP Jayant, CSE Dept.
Big Data(KCS061)
1. Privacy Concerns:
Data Collection:
Big data often involves the collection of extensive and diverse
datasets. Privacy concerns arise when personally identifiable
information (PII) is included without adequate consent.
Data Storage
Safeguarding data during storage is crucial to prevent
unauthorized access and data breaches. Encryption and access
controls are common measures.
Data Sharing
Sharing data among organizations or third parties for collaborative
projects can pose privacy risks. Organizations must ensure proper
agreements and safeguards are in place.
Transparency
Organizations should be transparent about their data practices, providing
clear information on data collection, storage, and usage policies.
Accountability
Organizations should be accountable for the consequences of their data
practices. This includes taking responsibility for any negative impacts on
individuals or groups.
61 BigData KCS061 Dr KP Jayant, CSE Dept.
Big Data(KCS061)
3. Legal Compliance:
Anonymization
De-identification
Data Governance:
Security Measures
Monitoring
Auditing
Analysis Report
Test -1