Cloud computing
Cloud computing
1. Unstructured data –
▪ Data which do not conform to a data model or a form
which can not be used by computer program.
▪ 80-90% data is in this form
▪ PPT, Images , Videos, body of an email etc
OLTP Systems
Input / Update /
Delete
Security
Scalability
Transaction
Processing
ACID
Atomicity- Transaction happens in its eternity or none of it all
Features
Inconsistent Structure
Self-describing
(lable/value
Semi-structured data pairs)
Often Schema information
is blended with data
values
Data objects may have
different attributes not known
beforehand
XML Extensible MarkUp
Sources Language
JSON(JavaScript Object
JSON- transmit data between Semi-Structured Notation)
server and web application. Data
This is the data which does not conform to a data model or is not in
a form which can be used easily by a computer program.
Images
Free-Form
Text
Audios
Unstructured data
Videos
Body of
Email
Text
Messages
Chats
Social
Media data
Word
Document
Issues with terminology – Unstructured Data
Issues with terminology Data with some structure may still be labeled
unstructured if the structure doesn’t help
with processing task at hand
Data Mining
Data Mining
First we deal with large data sets.
Second use methods at the intersection of AI, machine learning and
statistics and database to unearth consistent patterns in large data set
and systematic relation between variables
Association rule mining- Market basket analysis- what goes with what-
bread , cheese
Regression analysis- Predict relationship between two variables –
dependant and independent variable – value to be predicted –
dependant
Collaborative filtering – Predicting user preference based on preferences
of group of users
What is Big Data?
Specialized tools and frameworks are required for big data analysis when:
(1) the volume of data involved is so large that it is difficult to store, process
and analyze data on a single machine
(2) the velocity of data is very high and the data needs to be analyzed in
real-time,
(4) various types of analytics need to be performed to extract value from the
data such as descriptive, diagnostic, predictive and prescriptive analytics
Big data analytics is enabled by several technologies such as cloud computing,
distributed and parallel processing frameworks, non-relational databases,
in-memory computing, for instance.
• Data generated by social networks including text, images, audio and video
data
• Click-stream data generated by web applications such as e-Commerce to
analyze user behavior
• Machine sensor data collected from sensors embedded in industrial and
energy systems
for monitoring their health and detecting failures
• Healthcare data collected in electronic health record (EHR) systems
• Logs generated by web applications
• Stock markets data
Characteristics of Big Data
Volume
Big data is a form of data whose volume is so large that it would not fit on a
single machine therefore specialized tools and frameworks are required to
store process and analyze such data.
Volume
Bits-Bytes-Kilobytes-Megabytes-Gigabytes-Terabytes-Petabytes-Exabytes-Zettabytes-Yottabytes
A Mountain of Data
-Multitude of sources
- XLS, DOC, Youtube videos, Chat conversations, customer feedback CCTV
coverage
Data generated by certain sources can arrive at very high velocities, for example,
social media data or sensor data.
Velocity is another important characteristic of big data and the primary reason
for the exponential growth of data.
High velocity of data results in the volume of data accumulated to become very
large, in short span of time.
Some applications can have strict deadlines for data analysis (such as trading or
online fraud detection) and the data needs to be analyzed in real-time.
Velocit
y
Big data systems need to be flexible enough to handle such variety of data
Veracity
To extract value from the data, the data needs to be cleaned to remove
noise.
Data-driven applications can reap the benefits of big data only when the data
is meaningful and accurate.
Therefore, cleansing of data is important so that incorrect and faulty data can
be filtered out.
Value
Value of data refers to the usefulness of data for the intended purpose.
The end goal of any big data analytics system is to extract value from
the data.
The value of the data is also related to the veracity or accuracy of the
data.
For some applications value also depends on how fast we are able to
process the data.
Analytics is a broad term that encompasses the processes,
technologies, frameworks and algorithms to extract meaningful
insights from data.
(2) to find patterns in the data (for example, finding the top 10
coldest days in the year, finding which pages are visited the most
on a particular website, or finding the most searched celebrity in a
particular year),
These statistics help in describing patterns in the data and present the
data in a summarized form.
For example, computing the total number of likes for a particular post,
computing the average monthly rainfall or finding the average number
Diagnostic Analytics
For example, predictive analytics can be used for predicting when a fault
will occur in a machine, predicting whether a tumor is benign or
malignant, predicting the occurrence of natural emergency (events
such as forest fires or river floods) or forecasting the pollution levels.
Web
• Web Analytics: Web analytics deals with collection and analysis of data on
the user visits on websites and cloud applications.
Analysis of this data can give insights about the user engagement and tracking
the performance of online advertisement campaigns.
The second approach, called page tagging, uses a JavaScript which is embedded
in the web page.
The benefit of the page tagging approach is that it facilitates real-time data
collection and analysis.
This approach allows third party services, which do not have access to the web
server (serving the website) to collect and process the data.
These specialized analytics service providers (such as Google Analytics) are offer
advanced analytics and summarized reports. user sessions, page visits, top
entry and exit pages, bounce rate, most visited page
Performance Monitoring: Multi-tier web and cloud applications such as such as
• e-Commerce,
• Business-to-Business,
• Health care, Banking and Financial,
• Retail and Social Networking applications
• load tests (which evaluate the performance of the system with multiple users
and workload levels )
• Stress test etc
Big data systems can be used to analyze the data generated by such
tests, to predict application performance under heavy workloads
and identify bottlenecks in the system so that failures can be
prevented.
• Ad Targeting & Analytics:
Search and display advertisements are the two most widely used approaches
for Internet advertising.
Advertisers can create ads using the advertising networks provided by the
search engines or social media networks.
These ads are setup for specific keywords which are related to the product or
service being advertised.
Users searching for these keywords are shown ads along with the search
results.
Display advertising, is another form of Internet advertising, in which the ads are
displayed within websites, videos and mobile applications who participate in
the advertising network
The ad-network matches these ads against the content on the website, video or
mobile application and places the ads.
Advertising networks use big data systems for matching and placing
advertisements and generating advertisement statistics reports.
• Advertises can use big data tools for tracking the performance of
advertisements, optimizing the bids for pay-per-click advertising, tracking
which keywords link the most to the advertising landing page
Such applications can leverage big data systems for recommending new
content to the users based on the user preferences and interests.
Financial
• Credit Risk Modeling: Banking and Financial institutions use credit risk
modeling to score credit applications and predict if a borrower will default or
not in the future.
Credit risk models are created from the customer data that includes, credit
scores obtained from credit bureaus, credit history, account balance data,
account transactions data and spending patterns of the customer.
Big data systems can help in computing credit risk scores of a large number of
customers on a regular basis.
Real-time big data analytics frameworks can help in analyzing data from
disparate sources and label transactions in real-time
• Healthcare
Though the primary use of EHRs is to maintain all medical data for an individual
patient and to provide efficient access to the stored data at the point of care,
EHRs can be the source for valuable aggregated information about overall
patient populations.
Big data systems can be used for data collection from different stakeholders
(patients, doctors, payers, physicians, specialists, etc) and disparate data sources.
Big data analytics systems allow massive scale clinical data analytics and
facilitate development of more efficient healthcare applications, improve the
accuracy of predictions and help in timely decision making
• Internet of Things
Internet of Things (IoT) refers to things that have unique identities and are
connected to the Internet.
The "Things" in IoT are the devices which can perform remote sensing, actuating
and monitoring.
IoT devices can exchange data with other connected devices and applications
(directly or indirectly), or collect data from other devices and process the data.
IoT systems can leverage big data technologies for storage and analysis of data.
IoT applications that can benefit from big data system
• Intrusion Detection
• Smart Parking
• Smart Roads
• Structural Health Monitoring
• Smart Irrigation
Intrusion Detection: Intrusion detection systems use security cameras and
sensors (such as PIR sensors and door sensors) to detect intrusions and raise
alerts.
Smart Parkings: Smart parkings make the search for parking space easier and
convenient for drivers. Smart parkings are powered by IoT systems that detect
the number of empty parking slots and send the information over the
Internet to smart parking application back-ends
Smart Roads: Smart roads equipped with sensors can provide information on
driving conditions, travel time estimates and alerts in case of poor driving
conditions, traffic congestions and accidents
.
Structural Health Monitoring: Structural Health Monitoring systems use a
network of sensors to monitor the vibration levels in the structures such as
bridges and buildings. The data collected from these sensors is analyzed to
assess the health of the structures
Smart Irrigation: Smart irrigation systems can improve crop yields while
saving water. Smart irrigation systems - use IoT devices with soil moisture
sensors - determine the amount of moisture in the soil and release the flow
of water -when the moisture levels go below a predefined threshold.
Environment
Environment monitoring systems generate high velocity and high volume data.
Accurate and timely analysis of such data can help in understanding the current
status of the environment and also predicting environmental trends.
Air Pollution Monitoring: Air pollution monitoring systems can monitor emission
of harmful gases (CO2, CO, NO, or NO2) by factories and automobiles using
gaseous and meterological sensor
Noise Pollution Monitoring: Due to growing urban development, noise levels in
cities have increased and even become alarmingly high in some cities
Urban noise maps can help the policy makers in urban planning and making
policies to control noise levels near residential areas, schools and parks
Forest Fire Detection: There can be different causes of forest fires including
lightening, human negligence, volcanic eruptions and sparks from rock falls
Forest fire detection systems use a number of monitoring nodes deployed at
different locations in a forest.
River Floods Detection: River flood monitoring system use a number of sensor
nodes that monitor the water level (using ultrasonic sensors) and flow rate
(using the flow velocity sensors).
Big data systems can be used to collect and analyze data from a number of
such sensor nodes and raise alerts when a rapid increase in water level and
flow rate is detected.
Logistics & Transportation
• Real-time Fleet Tracking: Vehicle fleet tracking systems use GPS
technology to track the locations of the vehicles in real-time.
Big data systems can be used to aggregate and analyze vehicle locations
and routes data for detecting bottlenecks in the supply chain such as
traffic congestions on routes, assignment and generation of alternative
routes
Store Layout Optimization: Big data systems can help in analyzing the data on
customer shopping patterns and customer feedback to optimize the store
layouts
Visualizations
The front end application for visualizing the analysis results would
be dynamic and interactive.
Mapping Analysis Flow to Big Data Stack
Now that we have the analytics flow for the application, let us map the selections at each step of the flow to the big data
stack
Figure shows a subset of the components of the big data stack based
on the analytics flow.