0% found this document useful (0 votes)
0 views

Cloud computing

The document discusses the importance and classification of data, highlighting structured, semi-structured, and unstructured data types. It emphasizes the significance of big data, characterized by its volume, velocity, variety, veracity, and value, and outlines the need for specialized tools for analysis. Additionally, it covers various analytics types, including descriptive, diagnostic, predictive, and prescriptive analytics, and their applications across different domains.

Uploaded by

prabhanjan.cs22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Cloud computing

The document discusses the importance and classification of data, highlighting structured, semi-structured, and unstructured data types. It emphasizes the significance of big data, characterized by its volume, velocity, variety, veracity, and value, and outlines the need for specialized tools for analysis. Additionally, it covers various analytics types, including descriptive, diagnostic, predictive, and prescriptive analytics, and their applications across different domains.

Uploaded by

prabhanjan.cs22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

UNIT 1

▪ Data continues to be a precise and irreplaceable asset for


enterprises

▪ Data is present internal and outside four walls of the


enterprise

▪ Data present in Homogeneous sources as well as in


heterogeneous sources
Approximate Distribution of Digital Data

Approximate percentage distribution of digital data


▪ Need of the hour is to understand, manage,
process and take the data for analysis to draw
valuable insights

Data -> Information


Information ->Insights
Classification of digital data

1. Unstructured data –
▪ Data which do not conform to a data model or a form
which can not be used by computer program.
▪ 80-90% data is in this form
▪ PPT, Images , Videos, body of an email etc

2. Semi structured data- does not conform to a data model


but has some structure . Not in a form which can be used by
a computer program.
▪ Email, XML, HTML etc
3 . Structured data

▪ Data in organised form


▪ Understandable by a computer program
▪ Eg- Data stored in databases
Structured data
-> When is data structured? –Data conforms to a pre
defined schema/structure

-> Think structured data and think data model


l
-> Context of RDBMS- Conforms to the relational model –
rows/columns
-> Cardinality ratio – Number of rows/columns
-> Degree- Number of columns
Steps
1) Design a relation/table
-> Fields to store -> Data type
2) Constraints- we would like our data to conform
– Unique
- Not Null
– business constraints
– permissible values a column should access
3) Figure
4)Referential Integrity constraints
Sources- When data is structured- leverage on available
RDBMS
Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc

Structured data Spreadsheets

OLTP Systems

Database store operational data generated and


collected by day to day business activities
Ease with Structured Data

Input / Update /
Delete

Security

Ease with Structured data Indexing /


Searching

Scalability

Transaction
Processing
ACID
Atomicity- Transaction happens in its eternity or none of it all

Consistency- If same information is stored at two or more


places they are in complete agreement

Isolation- Resource allocation to transaction happen such that


the transaction gets the impression that it is the only
transaction happens in isolation

Durability- Changes made to the database during transaction


is permanent
Semi structured data

Self describing structure


Example, emails, XML, markup languages like HTML, etc

Features

Does not conform to data model


Use tags to segregate semantic elements
Tags are also used to enforce hierarchies of records and fields within
data
No separation between data and schema
Entities belong to the same class and also grouped together need not
necessarily have same attributes- not necessary same order
Characteristics of Semi-structured Data

Inconsistent Structure

Self-describing
(lable/value
Semi-structured data pairs)
Often Schema information
is blended with data
values
Data objects may have
different attributes not known
beforehand
XML Extensible MarkUp

Sources Language

XML- popularised by web Other MarkUp Language


services

JSON(JavaScript Object
JSON- transmit data between Semi-Structured Notation)
server and web application. Data

Mongodb store data natively


Unstructured Data

This is the data which does not conform to a data model or is not in
a form which can be used easily by a computer program.

About 80–90% data of an organization is in this format.

Example: memos, chat rooms, PowerPoint presentations,


images, videos, letters, researches, white papers, body of an
email, etc.
Sources of Unstructured Data
Web Pages

Images

Free-Form
Text

Audios
Unstructured data

Videos

Body of
Email

Text
Messages
Chats

Social

Media data

Word
Document
Issues with terminology – Unstructured Data

Structure can be implied despite not


being formerly defined.

Issues with terminology Data with some structure may still be labeled
unstructured if the structure doesn’t help
with processing task at hand

Data may have some structure or may even be


highly structured in ways that are
unanticipated or unannounced.
Dealing with Unstructured Data

Data Mining

Natural Language Processing (NLP)

Dealing with Unstructured Data


Text Analytics

Noisy Text Analytics


Unstructured data

Data Mining
First we deal with large data sets.
Second use methods at the intersection of AI, machine learning and
statistics and database to unearth consistent patterns in large data set
and systematic relation between variables

Association rule mining- Market basket analysis- what goes with what-
bread , cheese
Regression analysis- Predict relationship between two variables –
dependant and independent variable – value to be predicted –
dependant
Collaborative filtering – Predicting user preference based on preferences
of group of users
What is Big Data?

Big data is defined as collections of datasets whose


volume, velocity or variety is so large that it is difficult to
store, manage, process and analyze the data using
traditional databases and data processing tools.

In the recent years, there has been an exponential growth


in the both structured and unstructured data generated by
information technology, industrial, healthcare, Internet of
Things, and other systems
Definition of Big Data

Big data refers to


• datasets whose size is typically beyond the storage capacity
of and also
• complex for traditional database software tools

Big data is anything beyond the human & technical


infrastructure needed to support storage, processing and
analysis.
Big data is

• high volume, high-velocity and high-variety information


assets that
• demand cost effective, innovative forms of information
processing
• for enhanced insight and decision making.
Part I of the definition:

talks about voluminous data that may have great


variety will require a good speed/pace for
storage, preparation, processing and analysis.

Part II of the definition:


talks about embracing new techniques and
technologies to capture (ingest), store, process,
persist, integrate and visualize the high-volume,
high-velocity, and high-variety data.
Part III of the definition:

talks about deriving deeper, richer and


meaningful insights and then using these
insights to make faster and better
decisions to gain business value and thus a
competitive edge.

Data —> Information —> Actionable


intelligence —> Better decisions
—>Enhanced business value
Below are some key pieces of data from the report:

• Facebook users share nearly 4.16 million pieces of content


• Twitter users send nearly 300,000 tweets
• Instagram users like nearly 1.73 million photos
• YouTube users upload 300 hours of new video content
• Apple users download nearly 51,000 apps
• Skype users make nearly 110,000 new calls
• Amazon receives 4300 new visitors
• Uber passengers take 694 rides
• Netflix subscribers stream nearly 77,000 hours of video
Big Data analytics deals with collection, storage, processing and analysis of this
massive scale data.

Specialized tools and frameworks are required for big data analysis when:

(1) the volume of data involved is so large that it is difficult to store, process
and analyze data on a single machine

(2) the velocity of data is very high and the data needs to be analyzed in
real-time,

(3) there is variety of data involved, which can be structured, unstructured or


semi-structured, and is collected from multiple data sources,

(4) various types of analytics need to be performed to extract value from the
data such as descriptive, diagnostic, predictive and prescriptive analytics
Big data analytics is enabled by several technologies such as cloud computing,
distributed and parallel processing frameworks, non-relational databases,
in-memory computing, for instance.

Some examples of big data are listed as follows:

• Data generated by social networks including text, images, audio and video
data
• Click-stream data generated by web applications such as e-Commerce to
analyze user behavior
• Machine sensor data collected from sensors embedded in industrial and
energy systems
for monitoring their health and detecting failures
• Healthcare data collected in electronic health record (EHR) systems
• Logs generated by web applications
• Stock markets data
Characteristics of Big Data
Volume

Big data is a form of data whose volume is so large that it would not fit on a
single machine therefore specialized tools and frameworks are required to
store process and analyze such data.

The volumes of data generated by modern IT, industrial, healthcare, Internet


of Things, and other systems is growing exponentially driven by the lowering
costs of data storage.

Though there is no fixed threshold for the volume of data to be considered


as big data, however, typically, the term big data is used for massive scale
data that is difficult to store, manage and process using traditional
databases and data processing architectures.
What is Big Data?

Volume

Bits-Bytes-Kilobytes-Megabytes-Gigabytes-Terabytes-Petabytes-Exabytes-Zettabytes-Yottabytes
A Mountain of Data

1 Kilobyte (KB) = 1000 bytes


1 Megabyte (MB) = 1,000,000 bytes
1 Gigabyte (GB) = 1,000,000,000 bytes
1 Terabyte (TB) = 1,000,000,000,000 bytes
1 Petabyte (PB) = 1,000,000,000,000,000 bytes
1 Exabyte (EB) = 1,000,000,000,000,000,000 bytes
1 Zettabyte (ZB) = 1,000,000,000,000,000,000,000 bytes
1 Yottabyte (YB) = 1,000,000,000,000,000,000,000,000 bytes
Where does this data get generated

-Multitude of sources
- XLS, DOC, Youtube videos, Chat conversations, customer feedback CCTV
coverage

1. Typical Internal Data Sources- Data present within an organization firewall


- Data Storage- File systems, SQL, NoSQL..
- Archives- Achieves of scanned documents , paper archives, customer
correspondence records, patient’s health records, Student admission records and so on.

2. External Data Sources- Data residing an organization firewall


- Public Web: Wikepedia, weather, regulatory, census
3. Both internal and external data sources
Velocity
Velocity of data refers to how fast the data is generated.

Data generated by certain sources can arrive at very high velocities, for example,
social media data or sensor data.

Velocity is another important characteristic of big data and the primary reason
for the exponential growth of data.

High velocity of data results in the volume of data accumulated to become very
large, in short span of time.

Some applications can have strict deadlines for data analysis (such as trading or
online fraud detection) and the data needs to be analyzed in real-time.
Velocit
y

Batch Periodic Near real time Real-time processing


Variety

Variety refers to the forms of the data.

Big data comes in different forms such as structured, unstructured or


semi-structured, including text data, image, audio, video and sensor data.

Big data systems need to be flexible enough to handle such variety of data
Veracity

Veracity refers to how accurate is the data.

To extract value from the data, the data needs to be cleaned to remove
noise.

Data-driven applications can reap the benefits of big data only when the data
is meaningful and accurate.

Therefore, cleansing of data is important so that incorrect and faulty data can
be filtered out.
Value

Value of data refers to the usefulness of data for the intended purpose.

The end goal of any big data analytics system is to extract value from
the data.

The value of the data is also related to the veracity or accuracy of the
data.

For some applications value also depends on how fast we are able to
process the data.
Analytics is a broad term that encompasses the processes,
technologies, frameworks and algorithms to extract meaningful
insights from data.

Analytics is this process of extracting and creating information from


raw data by filtering, processing, categorizing, condensing and
contextualizing the data.

This information obtained is then organized and structured to infer


knowledge about the system and/or its users, its environment, and
its operations and progress towards its objectives, thus making the
systems smarter and more efficient.

The choice of the technologies, algorithms, and


frameworks for analytics is driven by the analytics
The goals of the analytics task may be:

(1) to predict something (for example whether a transaction is a


fraud or not, whether it will rain on a particular day, or whether a
tumor is benign or malignant),

(2) to find patterns in the data (for example, finding the top 10
coldest days in the year, finding which pages are visited the most
on a particular website, or finding the most searched celebrity in a
particular year),

(3) finding relationships in the data (for example, finding similar


news articles, finding similar patients in an electronic health record
system
The National Research Council [1] has done a characterization of
computational tasks for massive data analysis (called the seven
“giants").

These computational tasks include:


(1) Basis Statistics, (2) Generalized N-Body Problems, (3) Linear
Algebraic Computations, (4) Graph-Theoretic Computations, (5)
Optimization,
(6) Integration and (7) Alignment Problems.

This characterization of computational tasks aims to provide a taxonomy


of tasks that have proved to be useful in data analysis and grouping
them roughly according to mathematical structure and computational
strategy.
Descriptive Analytics

Descriptive analytics comprises analyzing past data to present it in a


summarized form which can be easily interpreted. Descriptive analytics
aims to answer –
What has happened?

A major portion of analytics done today is descriptive analytics through


use of statistics functions such as counts, maximum, minimum, mean,
topN,percentage, for instance.

These statistics help in describing patterns in the data and present the
data in a summarized form.

For example, computing the total number of likes for a particular post,
computing the average monthly rainfall or finding the average number
Diagnostic Analytics

Diagnostic analytics comprises analysis of past data to diagnose the


reasons as to why certain events happened.

Diagnostic analytics aims to answer - Why did it happen?

Let us consider an example of a system that collects and analyzes sensor


data from machines for monitoring their health and predicting failures.

While descriptive analytics can be useful for summarizing the data by


computing various statistics (such as mean, minimum, maximum,
variance, or top-N), diagnostic analytics can provide more insights into
why certain a fault has occurred based on the patterns in the sensor
data for previous faults.
Predictive Analytics

Predictive analytics comprises predicting the occurrence of an event or


the likely outcome of an event or forecasting the future values using
prediction models.

Predictive analytics aims to answer - What is likely to happen?

For example, predictive analytics can be used for predicting when a fault
will occur in a machine, predicting whether a tumor is benign or
malignant, predicting the occurrence of natural emergency (events
such as forest fires or river floods) or forecasting the pollution levels.

Predictive Analytics is done using predictive models which are trained by


existing data. These models learn patterns and trends from the existing
data and predict the occurrence of an event
Prescriptive Analytics
While predictive analytics uses prediction models to predict the
likely outcome of an event, prescriptive analytics uses multiple
prediction models to predict various outcomes and the best
course of action for each outcome.

Prescriptive analytics aims to answer - What can we do to make


it happen?

Prescriptive Analytics can predict the possible outcomes based


on the current choice of actions.
Prescriptive analytics can be used to prescribe the best
medicine for treatment of a patient based on the outcomes of
various medicines for similar patients.
Another example of prescriptive analytics would be to suggest the
best mobile data plan for a customer based on the customer’s
browsing patterns
Domain Specific Examples of Big Data
The applications of big data span a wide range of domains including (but not
limited to) homes, cities, environment, energy systems, retail, logistics, industry,
agriculture, Internet of Things, and healthcare.

Various applications of big data for each of these domains are:

Web
• Web Analytics: Web analytics deals with collection and analysis of data on
the user visits on websites and cloud applications.

Analysis of this data can give insights about the user engagement and tracking
the performance of online advertisement campaigns.

For collecting data on user visits, two approaches are used.


In the first approach, user visits are logged on the web server which collects data
such as the date and time of visit, resource requested, user’s IP address, HTTP
status code, for instance.

The second approach, called page tagging, uses a JavaScript which is embedded
in the web page.

The benefit of the page tagging approach is that it facilitates real-time data
collection and analysis.
This approach allows third party services, which do not have access to the web
server (serving the website) to collect and process the data.

These specialized analytics service providers (such as Google Analytics) are offer
advanced analytics and summarized reports. user sessions, page visits, top
entry and exit pages, bounce rate, most visited page
Performance Monitoring: Multi-tier web and cloud applications such as such as
• e-Commerce,
• Business-to-Business,
• Health care, Banking and Financial,
• Retail and Social Networking applications

can experience rapid changes in their workloads.

Provisioning and capacity planning is a challenging task for complex multi-tier


applications since each class of applications has different deployment
configurations with web servers, application servers and database servers.
For performance monitoring, various types of tests can be performed such as

• load tests (which evaluate the performance of the system with multiple users
and workload levels )
• Stress test etc

Big data systems can be used to analyze the data generated by such
tests, to predict application performance under heavy workloads
and identify bottlenecks in the system so that failures can be
prevented.
• Ad Targeting & Analytics:

Search and display advertisements are the two most widely used approaches
for Internet advertising.

In search advertising, users are displayed advertisements ("ads"), along with


the search results, as they search for specific keywords on a search engine.

Advertisers can create ads using the advertising networks provided by the
search engines or social media networks.

These ads are setup for specific keywords which are related to the product or
service being advertised.

Users searching for these keywords are shown ads along with the search
results.
Display advertising, is another form of Internet advertising, in which the ads are
displayed within websites, videos and mobile applications who participate in
the advertising network

The ad-network matches these ads against the content on the website, video or
mobile application and places the ads.

The most commonly used compensation method for Internet ads is


Pay-per-click (PPC), in which the advertisers pay each time a user clicks on an
advertisement.

Advertising networks use big data systems for matching and placing
advertisements and generating advertisement statistics reports.
• Advertises can use big data tools for tracking the performance of
advertisements, optimizing the bids for pay-per-click advertising, tracking
which keywords link the most to the advertising landing page

• Content Recommendation: Content delivery applications that serve content


(such as music and video streaming applications), collect various types of
data such as user search patterns and browsing history, history of content
consumed, and user ratings.

Such applications can leverage big data systems for recommending new
content to the users based on the user preferences and interests.
Financial
• Credit Risk Modeling: Banking and Financial institutions use credit risk
modeling to score credit applications and predict if a borrower will default or
not in the future.

Credit risk models are created from the customer data that includes, credit
scores obtained from credit bureaus, credit history, account balance data,
account transactions data and spending patterns of the customer.

Big data systems can help in computing credit risk scores of a large number of
customers on a regular basis.

These frameworks can be used to build credit risk models by analysis of


customer data
•Fraud Detection: Banking and Financial institutions can leverage big data
systems for detecting frauds such as credit card frauds, money laundering
and insurance claim frauds.

Real-time big data analytics frameworks can help in analyzing data from
disparate sources and label transactions in real-time
• Healthcare

The healthcare ecosystem consists of numerous entities including healthcare


providers (primary care physicians, specialists, or hospitals), payers
(government, private health insurance companies, employers), pharmaceutical,
device and medical service companies, IT solutions and services firms, and
patients.

The process of provisioning healthcare involves massive healthcare data that


exists in different forms (structured or unstructured), is stored in disparate
data sources (such as relational databases, or file servers) and in many different
formats.

To promote more coordination of care across the multiple providers involved


with patients, their clinical information is increasingly aggregated from diverse
sources into Electronic Health Record (EHR) systems.
EHRs capture and store information on patient health and provider actions
including individual-level laboratory results, diagnostic, treatment, and
demographic data.

Though the primary use of EHRs is to maintain all medical data for an individual
patient and to provide efficient access to the stored data at the point of care,
EHRs can be the source for valuable aggregated information about overall
patient populations.

Big data systems can be used for data collection from different stakeholders
(patients, doctors, payers, physicians, specialists, etc) and disparate data sources.

Big data analytics systems allow massive scale clinical data analytics and
facilitate development of more efficient healthcare applications, improve the
accuracy of predictions and help in timely decision making
• Internet of Things
Internet of Things (IoT) refers to things that have unique identities and are
connected to the Internet.

The "Things" in IoT are the devices which can perform remote sensing, actuating
and monitoring.

IoT devices can exchange data with other connected devices and applications
(directly or indirectly), or collect data from other devices and process the data.

IoT systems can leverage big data technologies for storage and analysis of data.
IoT applications that can benefit from big data system
• Intrusion Detection
• Smart Parking
• Smart Roads
• Structural Health Monitoring
• Smart Irrigation
Intrusion Detection: Intrusion detection systems use security cameras and
sensors (such as PIR sensors and door sensors) to detect intrusions and raise
alerts.

Smart Parkings: Smart parkings make the search for parking space easier and
convenient for drivers. Smart parkings are powered by IoT systems that detect
the number of empty parking slots and send the information over the
Internet to smart parking application back-ends

Smart Roads: Smart roads equipped with sensors can provide information on
driving conditions, travel time estimates and alerts in case of poor driving
conditions, traffic congestions and accidents

.
Structural Health Monitoring: Structural Health Monitoring systems use a
network of sensors to monitor the vibration levels in the structures such as
bridges and buildings. The data collected from these sensors is analyzed to
assess the health of the structures

Smart Irrigation: Smart irrigation systems can improve crop yields while
saving water. Smart irrigation systems - use IoT devices with soil moisture
sensors - determine the amount of moisture in the soil and release the flow
of water -when the moisture levels go below a predefined threshold.
Environment
Environment monitoring systems generate high velocity and high volume data.

Accurate and timely analysis of such data can help in understanding the current
status of the environment and also predicting environmental trends.

Weather Monitoring : Weather monitoring systems can collect data from a


number of sensor attached. This data can then be analyzed and visualized for
monitoring weather and generating weather alerts

Air Pollution Monitoring: Air pollution monitoring systems can monitor emission
of harmful gases (CO2, CO, NO, or NO2) by factories and automobiles using
gaseous and meterological sensor
Noise Pollution Monitoring: Due to growing urban development, noise levels in
cities have increased and even become alarmingly high in some cities
Urban noise maps can help the policy makers in urban planning and making
policies to control noise levels near residential areas, schools and parks

Forest Fire Detection: There can be different causes of forest fires including
lightening, human negligence, volcanic eruptions and sparks from rock falls
Forest fire detection systems use a number of monitoring nodes deployed at
different locations in a forest.
River Floods Detection: River flood monitoring system use a number of sensor
nodes that monitor the water level (using ultrasonic sensors) and flow rate
(using the flow velocity sensors).

Big data systems can be used to collect and analyze data from a number of
such sensor nodes and raise alerts when a rapid increase in water level and
flow rate is detected.
Logistics & Transportation
• Real-time Fleet Tracking: Vehicle fleet tracking systems use GPS
technology to track the locations of the vehicles in real-time.
Big data systems can be used to aggregate and analyze vehicle locations
and routes data for detecting bottlenecks in the supply chain such as
traffic congestions on routes, assignment and generation of alternative
routes

Shipment Monitoring:monitoring the conditions inside containers- food


spoilage

Remote Vehicle Diagnostics: Remote vehicle diagnostic systems can detect


faults in the vehicles or warn of impending faults.
Route Generation & Scheduling: Modern transportation systems are driven by
data collected from multiple sources which is processed to provide new
services to the stakeholders. such as advanced route guidance, dynamic vehicle
routing, anticipating customer demands for pickup and delivery problem

Hyper-local Delivery: Hyper-local delivery platforms are being increasingly used


by businesses such as restaurants and grocery stores to expand their reach.
These platforms allow customers to order products (such as grocery and food
items) using web and mobile applications and the products are sourced from
local stores

Cab/Taxi Aggregators: On-demand transport technology aggregators (or


cab/taxi aggregators) allow customers to book cabs using web or mobile
applications and the requests are routed to nearest available cabs
Industry
• Machine Diagnosis & Prognosis: Machine prognosis refers to predicting the
performance of a machine by analyzing the data on the current operating
conditions and the deviations from the normal operating conditions.
Machine diagnosis refers to determining the cause of a machine fault.
Industrial machines have a large number of components that must function
correctly for the machine to perform its operations. Sensors in machines can
monitor the operating conditions

Risk Analysis of Industrial Operations: In many industries, there are strict


requirements on the environment conditions and equipment working conditions
Harmful and toxic gases such as carbon monoxide (CO), nitrogen monoxide (NO),
Nitrogen Dioxide (NO2), for instance, can cause serious health problems. Gas
monitoring systems can help in monitoring the indoor air quality using various gas
Production Planning and Control: Production planning and control systems
measure various parameters of production processes and control the entire
production process in real-time
Retail
Retailers can use big data systems for boosting sales, increasing profitability
and improving customer satisfaction.

Inventory Management: Inventory management -increasingly important in the


recent years with the growing competition. While over-stocking of products can
result in additional storage expenses and risk -under-stocking can lead to loss
of revenue.

Tags attached to the products allow them to be tracked in real-time so that


the inventory levels can be determined accurately and products which are low
Customer Recommendations: Big data systems can be used to analyze the
customer data (such as demographic data, shopping history, or customer
feedback) and predict the customer preferences

Store Layout Optimization: Big data systems can help in analyzing the data on
customer shopping patterns and customer feedback to optimize the store
layouts

Forecasting Demand: Due to a large number of products, seasonal variations


in demands and changing trends and customer preferences, retailers find it
difficult to forecast demand and sales volumes. Big data systems can be used
to analyze the customer purchase patterns and predict demand and sale
volumes
Case Study: Weather Data Analysis

Using the big data stack for analysis of weather data

To come up with a selection of the tools and frameworks from the


big data stack that can be used for weather data analysis, let us first
come up with the analytics flow for the application
Data Collection

Let us assume, we have multiple weather monitoring stations or


end-nodes equipped with temperature, humidity, wind, and
pressure sensors.

To collect and ingest streaming sensor data generated by the


weather monitoring stations, we can use a publish-subscribe
messaging framework to ingest data for real-time analysis within the
Big Data stack and

Source-Sink connector to ingest data into a distributed filesystem


for batch analysis.
Data Preparation

Since the weather data received from different monitoring stations


can have missing values, use different units and have different
formats, we may need to prepare data for analysis by cleaning,
wrangling, normalizing and filtering the data
Analysis Types

The choice of the analysis types is driven by the requirements of the


application.
Let us say, we want our weather analysis application

• to aggregate data on various timescales (minute, hourly, daily or


monthly)

• to determine the mean, maximum and minimum readings for


temperature, humidity, wind and pressure.
to support interactive querying for exploring the data, for example,
queries such as: finding the day with the lowest temperature in each
month of a year, finding the top-10 most wet days in the year, for
instance.

These type of analysis come under the basic statistics category.

Next, we also want the application to make predictions of certain


weather events, for example, predict the occurrence of fog or haze. For
such an analysis, we would require a classification model.

Additionally, if we want to predict values (such as the amount of


rainfall), we would require a regression model
Analysis Modes

Based on the analysis types determined the previous step, we


know that the analysis modes required for the application will be
batch, real-time and interactive.

Visualizations

The front end application for visualizing the analysis results would
be dynamic and interactive.
Mapping Analysis Flow to Big Data Stack
Now that we have the analytics flow for the application, let us map the selections at each step of the flow to the big data
stack
Figure shows a subset of the components of the big data stack based
on the analytics flow.

To collect and ingest streaming sensor data generated by the


weather monitoring stations, we can use a publish-subscribe
messaging framework such as Apache Kafka (for real-time analysis
within the Big Data stack).

Each weather station publishes the sensor data to Kafka

Real-time analysis frameworks such as Storm and Spark Streaming


can receive data from Kafka for processing
For batch analysis, we can use a source-sink connector such as Flume
to move the data to HDFS.

Once the data is in HDFS, we can use batch processing frameworks


such as Hadoop-MapReduce

While the batch and real-time processing frameworks are useful


when the analysis requirements and goals are known upfront,
interactive querying tools can be useful for exploring the data.
We can use interactive querying framework such as Spark SQL,
which can query the data in HDFS for interactive queries.

For presenting the results of batch and real-time analysis, a


NoSQL database such as DynamoDB can be used as a serving
database.

For developing web applications and displaying the analysis


results we can use a web framework such as Django.

You might also like