0% found this document useful (0 votes)
44 views

Introduction To Big Data Platform

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Introduction To Big Data Platform

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Introduction to Big Data Platform

Big data is a term that is used to describe data that is high volume, high velocity, and/or
high variety; requires new technologies and techniques to capture, store, and analyze it; and
is used to enhance decision making, provide insight and discovery, and support and optimize
processes.
 For example, every customer e-mail, customer-service chat, and social media
comment may be captured, stored, and analyzed to better understand customers’
sentiments. Web browsing data may capture every mouse movement in order to
better understand customers’ shopping behaviors.
 Radio frequency identification (RFID) tags may be placed on every single piece of
merchandise in order to assess the condition and location of every item.
Bigdata is usually transformed in three dimensions- volume, velocity and variety.
 Volume: Machine generated data is produced in larger quantities than non-traditional
data.
a. Data Volume
b. 44x increase from 2009-2020
c. From 0.8 zettabytes to 35zb
d. Data volume is increasing exponentially
 Velocity: This refers to the speed of data processing.
Data is begin generated fast and need to be processed fast.
Online Data Analytics
Late decisions è missing opportunities
Examples
• E-Promotions: Based on your current location, your purchase history, what you
like è send promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body è any
abnormal measurements require immediate reaction

 Variety: This refers to large variety of input data which in turn generates large amount of
data as output.
a. Various formats, types, and structures
b. Text, numerical, images, audio, video, sequences, time
series, social media data, multi-dim arrays, etc…
c. Static data vs. streaming data

## A single application can be generating/collecting many types of data.

BIG DATA SOURCES


 Big data has many sources. For example, every mouse click on a web site can be captured
in Web log files and analyzed in order to better understand shoppers’ buying behaviors
and to influence their shopping by dynamically recommending products.
Social media sources such as Face book and Twitter generate tremendous amounts of
comments and tweets. This data can be captured and analyzed to understand, for
example, what people think about new product introductions.
Machines, such as smart meters, generate data. These meters continuously stream data
about electricity, water, or gas consumption that can be shared with customers and
combined with pricing plans to motivate customers to move some of their energy
consumption, such as for washing clothes, to non-peak hours. There is a tremendous
amount of geospatial (e.g., GPS) data, such as that created by cell phones, that can be
used by applications like Four Square to help you know the locations of friends and to
receive offers from nearby stores and restaurants. Image, voice, and audio data can be
analyzed for applications such as facial recognition systems in security systems.

What Comes Under Big Data?

Big data involves the data produced by different devices and applications. Given below are some
of the fields that come under the umbrella of Big Data.

 Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the performance
information of the aircraft.
 Social Media Data: Social media such as Facebook and Twitter hold information and the
views posted by millions of people across the globe.
 Stock Exchange Data: The stock exchange data holds information about the ‘buy’ and
‘sell’ decisions made on a share of different companies made by the customers.
 Power Grid Data: The power grid data holds information consumed by a particular node
with respect to a base station.
 Transport Data: Transport data includes model, capacity, distance and availability of a
vehicle.
 Search Engine Data: Search engines retrieve lots of data from different databases.

Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it
will be of three types.

o Structured data : Relational data.


o Semi Structured data : XML data.
o Unstructured data : Word, PDF, Text, Media Logs.

What made Big Data needed?


Key Computing Resources for Big Data

 Processing capability: CPU, processor, or node.


 Memory
 Storage
 Network
Techniques towards Big Data

• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization

Why Big Data now?

 More data are being collected and stored

 Open source code

 Commodity hardware
What’s driving Big Data

Optimizations and predictive analytics

- Complex statistical analysis


- All types of data, and many sources
- Very large datasets
- More of a real-time

Ad-hoc querying and reporting

- Data mining techniques


- Structured data, typical sources
- Small to mid-size datasets
Benefits of Big Data

Big data is really critical to our life and its emerging as one of the most important technologies in
modern world. Follow are just few benefits which are very much known to all of us:

 Using the information kept in the social network like Facebook, the marketing agencies
are learning about the response for their campaigns, promotions, and other advertising
mediums.
 Using the information in the social media like preferences and product perception of their
consumers, product companies and retail organizations are planning their production.
 Using the data regarding the previous medical history of patients, hospitals are providing
better and quick service.

BIG DATA ANALYTICS

 Accumulation of raw data captured from various sources (i.e. discussion boards, emails,
exam logs, chat logs in e-learning systems) can be used to identify fruitful patterns and
relationships
 By itself, stored data does not generate business value, and this is true of traditional
databases, data warehouses, and the new technologies such as Hadoop for storing big
data. Once the data is appropriately stored,
 However, it can be analyzed, which can create tremendous value. A variety of analysis
technologies, approaches, and products have emerged that are especially applicable to big
data, such as in-memory analytics, in-database analytics, and appliances

Big Data Technologies

Big data technologies are important in providing more accurate analysis, which may lead to more
concrete decision-making resulting in greater operational efficiencies, cost reductions, and
reduced risks for the business.
There are various technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we
examine the following two classes of technology:

o Operational Big Data

o Analytical Big Data

Operational Big Data

 This include systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.
 NoSQL Big Data systems are designed to take advantage of new cloud computing
architectures that have emerged over the past decade to allow massive computations to be
run inexpensively and efficiently. This makes operational big data workloads much easier
to manage, cheaper, and faster to implement.
 Some NoSQL systems can provide insights into patterns and trends based on real-time
data with minimal coding and without the need for data scientists and additional
infrastructure.

Analytical Big Data

 This includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis
that may touch most or all of the data.
 MapReduce provides a new method of analyzing data that is complementary to the
capabilities provided by SQL, and a system based on MapReduce that can be scaled up
from single servers to thousands of high and low end machines.
 These two classes of technology are complementary and frequently deployed together.

Big Data Challenges

The major challenges associated with big data are as follows:

o Capturing data
o Curation
o Storage
o Searching
o Sharing
o Transfer
o Analysis
o Presentation

To fulfill the above challenges, organizations normally take the help of enterprise servers.

Evolution of Analytic scalability

 The amount of data organizations process continues to increase

 The old methods for handling data won’t work anymore


 Important technologies to tame the big data tidal wave possible

Grid com- MapRe-


MPP The cloud
puting duce
The Convergence of the Analytic and Data Environment (1/2)
Traditional Analytic Architecture
 We had to pull all data together into a separate analytics environment to do analysis
Database
Database 3 Database
1 Database 4
2

Analytic Server
Or PC

Modern In-Database Architecture


 The processing stays in the database where the data has been consolidated

Database 3
Database 1 Database 4
Database 2

The user’s machine


just submits the request
Operational vs. Analytical Systems
Operational Analytical
Latency 1 ms - 100 ms 1 min - 100 min
Concurrency 1000 - 100,000 1 – 10
Access Pattern Writes and Reads Reads
Queries Selective Unselective
Data Scope Operational Retrospective
End User Customer Data Scientist
Technology NoSQL MapReduce, MPP Database

What is an MPP Database?

 An MPP database breaks the data into independent chunks with independent disk
and CPU

Single overloaded server Multiple lightly loaded servers


Shared Nothing!

An MPP database breaks the data into


independent chunks with independent
disk and CPU
100- 100- 100- 100- 100-
One- giga- giga- giga- giga- giga-
byte byte byte byte byte
ter- 100-
chunk
giga-
100-
chunk
giga-
100-
chunk
giga-
100-
chunk
giga-
100-
chunk
giga-
abyte s
byte
s
byte
s
byte
s
byte
s
byte
chunk chunk chunk chunk chunk
table s s s s s
A Traditional database will query
10 simultaneous 100-gigabyte queries
a one-terabyte table one row at time
• OLTP: Online Transaction Processing (DBMSs)

• OLAP: Online Analytical Processing (Data Warehousing)

• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

Massively Parallel Processing

Concurrent Processing

 An MPP system allows the different sets of CPU and disk to run the process
concurrently.

An MPP system
breaks the job into pieces

Single
Threaded
★ ★
Parallel Process
 MPP systems build in redundancy to make recovery easy

 MPP systems have resource management tools

o Manage the CPU and disk space

o Query optimizer

Cloud Computing

 Mask the underlying infrastructure from the user


 Be elastic to scale on demand
 On a pay-per-use basis
 On-demand self-service
 Broad network access
 Resource pooling
 Rapid elasticity
 Measured service
Two Types of Cloud Environment

 Public Cloud

o The services and infrastructure are provided off-site over the internet

o Greatest level of efficiency in shared resources

o Less secured and more vulnerable than private clouds

 Private Cloud

o Infrastructure operated solely for a single organization

o The same features of a public cloud

o Offer the greatest level of security and control

o Necessary to purchase and own the entire cloud infrastructure

What is MapReduce

 A Parallel programming framework¹

Library Paralleliza-
tion
A Parallel programming framework¹
Fault-tol-
erance
Map Reduce
Data dis-
tribution
Load bal-
ancing
Map function
……
Processing a key/value pairs to generate a set of intermediate key/value pairs

Reduce function
MergingWorks
How MapReduce all intermediate values associated with the same intermediate key

 Let’s assume there are 20 terabytes of data and 20 MapReduce server nodes for a project

o Distribute a terabyte to each of the 20 nodes using a simple file copy process
o Submit two programs(Map, Reduce) to the scheduler

o The map program finds the data on disk and executes the logic it contains

o The results of the map step are then passed to the reduce process to summarize
and aggregate the final answers

Map Function
Scheduler
Map

Results
 Good for

o Lots of input, intermediate, and output data

o Batch oriented datasets (ETL: Extract, Load, Transform)

o Cheap to get up and running because of running on commodity hardware

 Bad for

o Fast response time

o Large amounts of shared data

o CPU intensive operations (as opposed to data intensive)

o NOT a database!

 No built-in security

 No indexing, No query or process optimizer


 No knowledge of other data that exists

Big Data Technology


Reporting and analysis

Living in the era of digital technology and big data has made organizations dependent on the wealth of
information data can bring. You might have seen how reporting and analysis are used interchangeably,
especially the manner which outsourcing companies market their services. While both areas are part of
web analytics (note that analytics isn’t similar to analysis), there’s a vast difference between them, and
it’s more than just spelling.

It’s important that we differentiate the two because some organizations might be selling
themselves short in one area and not reap the benefits, which web analytics can bring to the
table. The first core component of web analytics, reporting, is merely organizing data into
summaries. On the other hand, analysis is the process of inspecting, cleaning, transforming, and
modeling these summaries (reports) with the goal of highlighting useful information.

Simply put, reporting translates data into information while analysis turns information into
insights. Also, reporting should enable users to ask “What?” questions about the information,
whereas analysis should answer to “Why”” and “What can we do about it?”

Here are five differences between reporting and analysis:

1. Purpose

Reporting helps companies monitor their data even before digital technology boomed. Various
organizations have been dependent on the information it brings to their business, as reporting
extracts that and makes it easier to understand.

Analysis interprets data at a deeper level. While reporting can link between cross-channels of
data, provide comparison, and make understand information easier (think of a dashboard, charts,
and graphs, which are reporting tools and not analysis reports), analysis interprets this
information and provides recommendations on actions.

2. Tasks

As reporting and analysis have a very fine line dividing them, sometimes it’s easy to confuse
tasks that have analysis labeled on top of them when all it does is reporting. Hence, ensure that
your analytics team has a healthy balance doing both.

Here’s a great differentiator to keep in mind if what you’re doing is reporting or analysis:
Reporting includes building, configuring, consolidating, organizing, formatting, and
summarizing. It’s very similar to the abovementioned like turning data into charts, graphs, and
linking data across multiple channels.

Analysis consists of questioning, examining, interpreting, comparing, and confirming. With big
data, predicting is possible as well.

3. Outputs

Reporting and analysis have the push and pull effect from its users through their outputs.
Reporting has a push approach, as it pushes information to users and outputs come in the forms
of canned reports, dashboards, and alerts.

Analysis has a pull approach, where a data analyst draws information to further probe and to
answer business questions. Outputs from such can be in the form of ad hoc responses and
analysis presentations. Analysis presentations are comprised of insights, recommended actions,
and a forecast of its impact on the company—all in a language that’s easy to understand at the
level of the user who’ll be reading and deciding on it.

This is important for organizations to realize truly the value of data, such that a standard report is
not similar to a meaningful analytics.

4. Delivery

Considering that reporting involves repetitive tasks—often with truckloads of data, automation
has been a lifesaver, especially now with big data. It’s not surprising that the first thing
outsourced are data entry services since outsourcing companies are perceived as data reporting
experts.

Analysis requires a more custom approach, with human minds doing superior reasoning and
analytical thinking to extract insights, and technical skills to provide efficient steps towards
accomplishing a specific goal. This is why data analysts and scientists are demanded these days,
as organizations depend on them to come up with recommendations for leaders or business
executives make decisions about their businesses.

5. Value

This isn’t about identifying which one brings more value, rather understanding that both are
indispensable when looking at the big picture. It should help businesses grow, expand, move
forward, and make more profit or increase their value.

This Path to Value diagram illustrates how data converts into value by reporting and analysis
such that it’s not achievable without the other.

Data — Reporting — Analysis — Decision-making — Action — VALUE


Data alone is useless, and action without data is baseless. Both reporting and analysis are vital to
bringing value to your data and operations.

Reporting and Analysis are Valuable

Not to undermine the role of reporting in web analytics, but organizations need to understand
that reporting itself is just numbers. Without drawing insights and getting reports aligned with
your organization’s big picture, you can’t make decisions based on reports alone.

Data analysis is the most powerful tool to bring into your business. Employing the powers of
analysis can be comparable to finding gold in your reports, which allows your business to
increase profits and further develop.

Having accurate research is crucial in devising various marketing and advertising materials for
your target market, while taking into account their needs as well as the advantage of your
competitors. We can help you come up with comprehensive strategies through our extensive
research services, which are carefully tailored for your immediate business concerns

Reporting Analysis

Provides data Provides answers

Provides what is asked for Provides what is needed

Is typically standardized Is typically customized

Does not involve a person Involves a person

Is fairly inflexible Is extremely flexible

In consulting career, they spent a great deal of time on reporting. They Created :

 Several thousand Crystal Reports


 More Microsoft Access databases, SQL statements, and ad hoc queries than I could count
 Many, many dashboards

Still, rarely believed that even my most complicated reports really explained why something was
happening—or had already happened. To me, that’s the very essence of analytics: they go
beyond the mere “what” and “where”. Ideally, they explain why and suggest a potentially
measurable course of action.

Reading a static a standard report is not the same as doing true data exploration.
For instance, how many customers visited our site and never made a purchase? Let’s say that that
number is 60 percent. That’s great, but why did they not make a purchase? Potential answers
include:

 the product’s price was too high


 the site’s navigation was confusing
 they became distracted
 their computers crashed
 a combination of a few different things

In an era of Big Data, organizations of all sizes can theoretically explain more of the unknown
and, dare I say, even potentially predict a few things. To truly realize the value of data—be it
Big, Small, whatever—people need to rid themselves of the notion that a standard report is the
same as meaningful analytics, let true alone data discovery.

You might also like