Introduction To Big Data Platform
Introduction To Big Data Platform
Big data is a term that is used to describe data that is high volume, high velocity, and/or
high variety; requires new technologies and techniques to capture, store, and analyze it; and
is used to enhance decision making, provide insight and discovery, and support and optimize
processes.
For example, every customer e-mail, customer-service chat, and social media
comment may be captured, stored, and analyzed to better understand customers’
sentiments. Web browsing data may capture every mouse movement in order to
better understand customers’ shopping behaviors.
Radio frequency identification (RFID) tags may be placed on every single piece of
merchandise in order to assess the condition and location of every item.
Bigdata is usually transformed in three dimensions- volume, velocity and variety.
Volume: Machine generated data is produced in larger quantities than non-traditional
data.
a. Data Volume
b. 44x increase from 2009-2020
c. From 0.8 zettabytes to 35zb
d. Data volume is increasing exponentially
Velocity: This refers to the speed of data processing.
Data is begin generated fast and need to be processed fast.
Online Data Analytics
Late decisions è missing opportunities
Examples
• E-Promotions: Based on your current location, your purchase history, what you
like è send promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body è any
abnormal measurements require immediate reaction
Variety: This refers to large variety of input data which in turn generates large amount of
data as output.
a. Various formats, types, and structures
b. Text, numerical, images, audio, video, sequences, time
series, social media data, multi-dim arrays, etc…
c. Static data vs. streaming data
Big data involves the data produced by different devices and applications. Given below are some
of the fields that come under the umbrella of Big Data.
Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the performance
information of the aircraft.
Social Media Data: Social media such as Facebook and Twitter hold information and the
views posted by millions of people across the globe.
Stock Exchange Data: The stock exchange data holds information about the ‘buy’ and
‘sell’ decisions made on a share of different companies made by the customers.
Power Grid Data: The power grid data holds information consumed by a particular node
with respect to a base station.
Transport Data: Transport data includes model, capacity, distance and availability of a
vehicle.
Search Engine Data: Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it
will be of three types.
• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization
Commodity hardware
What’s driving Big Data
Big data is really critical to our life and its emerging as one of the most important technologies in
modern world. Follow are just few benefits which are very much known to all of us:
Using the information kept in the social network like Facebook, the marketing agencies
are learning about the response for their campaigns, promotions, and other advertising
mediums.
Using the information in the social media like preferences and product perception of their
consumers, product companies and retail organizations are planning their production.
Using the data regarding the previous medical history of patients, hospitals are providing
better and quick service.
Accumulation of raw data captured from various sources (i.e. discussion boards, emails,
exam logs, chat logs in e-learning systems) can be used to identify fruitful patterns and
relationships
By itself, stored data does not generate business value, and this is true of traditional
databases, data warehouses, and the new technologies such as Hadoop for storing big
data. Once the data is appropriately stored,
However, it can be analyzed, which can create tremendous value. A variety of analysis
technologies, approaches, and products have emerged that are especially applicable to big
data, such as in-memory analytics, in-database analytics, and appliances
Big data technologies are important in providing more accurate analysis, which may lead to more
concrete decision-making resulting in greater operational efficiencies, cost reductions, and
reduced risks for the business.
There are various technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we
examine the following two classes of technology:
This include systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.
NoSQL Big Data systems are designed to take advantage of new cloud computing
architectures that have emerged over the past decade to allow massive computations to be
run inexpensively and efficiently. This makes operational big data workloads much easier
to manage, cheaper, and faster to implement.
Some NoSQL systems can provide insights into patterns and trends based on real-time
data with minimal coding and without the need for data scientists and additional
infrastructure.
This includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis
that may touch most or all of the data.
MapReduce provides a new method of analyzing data that is complementary to the
capabilities provided by SQL, and a system based on MapReduce that can be scaled up
from single servers to thousands of high and low end machines.
These two classes of technology are complementary and frequently deployed together.
o Capturing data
o Curation
o Storage
o Searching
o Sharing
o Transfer
o Analysis
o Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
Analytic Server
Or PC
Database 3
Database 1 Database 4
Database 2
An MPP database breaks the data into independent chunks with independent disk
and CPU
Concurrent Processing
An MPP system allows the different sets of CPU and disk to run the process
concurrently.
An MPP system
breaks the job into pieces
Single
Threaded
★ ★
Parallel Process
MPP systems build in redundancy to make recovery easy
o Query optimizer
Cloud Computing
Public Cloud
o The services and infrastructure are provided off-site over the internet
Private Cloud
What is MapReduce
Library Paralleliza-
tion
A Parallel programming framework¹
Fault-tol-
erance
Map Reduce
Data dis-
tribution
Load bal-
ancing
Map function
……
Processing a key/value pairs to generate a set of intermediate key/value pairs
Reduce function
MergingWorks
How MapReduce all intermediate values associated with the same intermediate key
Let’s assume there are 20 terabytes of data and 20 MapReduce server nodes for a project
o Distribute a terabyte to each of the 20 nodes using a simple file copy process
o Submit two programs(Map, Reduce) to the scheduler
o The map program finds the data on disk and executes the logic it contains
o The results of the map step are then passed to the reduce process to summarize
and aggregate the final answers
Map Function
Scheduler
Map
Results
Good for
Bad for
o NOT a database!
No built-in security
Living in the era of digital technology and big data has made organizations dependent on the wealth of
information data can bring. You might have seen how reporting and analysis are used interchangeably,
especially the manner which outsourcing companies market their services. While both areas are part of
web analytics (note that analytics isn’t similar to analysis), there’s a vast difference between them, and
it’s more than just spelling.
It’s important that we differentiate the two because some organizations might be selling
themselves short in one area and not reap the benefits, which web analytics can bring to the
table. The first core component of web analytics, reporting, is merely organizing data into
summaries. On the other hand, analysis is the process of inspecting, cleaning, transforming, and
modeling these summaries (reports) with the goal of highlighting useful information.
Simply put, reporting translates data into information while analysis turns information into
insights. Also, reporting should enable users to ask “What?” questions about the information,
whereas analysis should answer to “Why”” and “What can we do about it?”
1. Purpose
Reporting helps companies monitor their data even before digital technology boomed. Various
organizations have been dependent on the information it brings to their business, as reporting
extracts that and makes it easier to understand.
Analysis interprets data at a deeper level. While reporting can link between cross-channels of
data, provide comparison, and make understand information easier (think of a dashboard, charts,
and graphs, which are reporting tools and not analysis reports), analysis interprets this
information and provides recommendations on actions.
2. Tasks
As reporting and analysis have a very fine line dividing them, sometimes it’s easy to confuse
tasks that have analysis labeled on top of them when all it does is reporting. Hence, ensure that
your analytics team has a healthy balance doing both.
Here’s a great differentiator to keep in mind if what you’re doing is reporting or analysis:
Reporting includes building, configuring, consolidating, organizing, formatting, and
summarizing. It’s very similar to the abovementioned like turning data into charts, graphs, and
linking data across multiple channels.
Analysis consists of questioning, examining, interpreting, comparing, and confirming. With big
data, predicting is possible as well.
3. Outputs
Reporting and analysis have the push and pull effect from its users through their outputs.
Reporting has a push approach, as it pushes information to users and outputs come in the forms
of canned reports, dashboards, and alerts.
Analysis has a pull approach, where a data analyst draws information to further probe and to
answer business questions. Outputs from such can be in the form of ad hoc responses and
analysis presentations. Analysis presentations are comprised of insights, recommended actions,
and a forecast of its impact on the company—all in a language that’s easy to understand at the
level of the user who’ll be reading and deciding on it.
This is important for organizations to realize truly the value of data, such that a standard report is
not similar to a meaningful analytics.
4. Delivery
Considering that reporting involves repetitive tasks—often with truckloads of data, automation
has been a lifesaver, especially now with big data. It’s not surprising that the first thing
outsourced are data entry services since outsourcing companies are perceived as data reporting
experts.
Analysis requires a more custom approach, with human minds doing superior reasoning and
analytical thinking to extract insights, and technical skills to provide efficient steps towards
accomplishing a specific goal. This is why data analysts and scientists are demanded these days,
as organizations depend on them to come up with recommendations for leaders or business
executives make decisions about their businesses.
5. Value
This isn’t about identifying which one brings more value, rather understanding that both are
indispensable when looking at the big picture. It should help businesses grow, expand, move
forward, and make more profit or increase their value.
This Path to Value diagram illustrates how data converts into value by reporting and analysis
such that it’s not achievable without the other.
Not to undermine the role of reporting in web analytics, but organizations need to understand
that reporting itself is just numbers. Without drawing insights and getting reports aligned with
your organization’s big picture, you can’t make decisions based on reports alone.
Data analysis is the most powerful tool to bring into your business. Employing the powers of
analysis can be comparable to finding gold in your reports, which allows your business to
increase profits and further develop.
Having accurate research is crucial in devising various marketing and advertising materials for
your target market, while taking into account their needs as well as the advantage of your
competitors. We can help you come up with comprehensive strategies through our extensive
research services, which are carefully tailored for your immediate business concerns
Reporting Analysis
In consulting career, they spent a great deal of time on reporting. They Created :
Still, rarely believed that even my most complicated reports really explained why something was
happening—or had already happened. To me, that’s the very essence of analytics: they go
beyond the mere “what” and “where”. Ideally, they explain why and suggest a potentially
measurable course of action.
Reading a static a standard report is not the same as doing true data exploration.
For instance, how many customers visited our site and never made a purchase? Let’s say that that
number is 60 percent. That’s great, but why did they not make a purchase? Potential answers
include:
In an era of Big Data, organizations of all sizes can theoretically explain more of the unknown
and, dare I say, even potentially predict a few things. To truly realize the value of data—be it
Big, Small, whatever—people need to rid themselves of the notion that a standard report is the
same as meaningful analytics, let true alone data discovery.