Bda Unit1
Bda Unit1
BIGDATA ANALYTICS
UNIT-I
INTRODUCTION TO BIGDATA
1. INTRODUCTION TO BIGDATA PLATFORM
2. CHALLENGES OF CONVENTIONAL SYSTEMS
3. INTELLIGENT DATA ANALYSIS
4. NATURE OF DATA
5. ANALYTIC PROCESSES AND TOOLS
6. ANALYSIS Vs REPORTING
7. MODERN DATA ANALYICS TOOLS
Technical Terms
1
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Definition:
Big Data:
Big Data consists of very large volumes of heterogeneous data that it being generated
often at high speed.
It cannot managed and processed using traditional data management tools and
applications at hand.
Characteristics of Big Data:
1. Volume
2. Velocity
3. Variety
4. Veracity
5. Value or variability
1. Volume
Volume is a huge amount of data.
It refers to the size of data that are working with.
This data is spread across different places, in different formats, in large volumes
ranging from Gigabytes to Terabyte, Petabytes up to Yottabyte and even more.
The data is not only generated by humans, but large amount of data is being generated
by machines and it surpasses human generated data.
2
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes (6.2
billion GB) per month. Also, by the year 2020 we will have almost 40000 Exabytes of data.
2. Velocity:
Velocity refers to the high speed of accumulation of data.
In Big Data velocity data flows in from sources like machines, networks, social
media, mobile phones etc.
3. Variety:
It refers to nature of data that is structured, semi-structured and unstructured data.
It also refers to heterogeneous sources.
Variety is basically the arrival of data from new sources that are both inside and
outside of an enterprise.
It can be structured, semi-structured and unstructured.
o Structured data: This data is basically an organized data. It generally refers
to data that has defined the length and format of data.
o Semi- Structured data: This data is basically a semi-organized data. It is
generally a form of data that do not conform to the formal structure of data.
Log files are the examples of this type of data.
o Unstructured data: This data basically refers to unorganized data. It
generally refers to data that doesn’t fit neatly into the traditional row and
column structure of the relational database. Texts, pictures, videos etc. are
the examples of unstructured data which can’t be stored in the form of rows
and columns.
Variety may be text, web logs, sensor data, legacy Docs,Images,Audio,Video
3
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
5. Veracity:
The quality of the data being captured can vary greatly.
Accuracy of analysis depends on the veracity of the source data.
6. Complexity:
Data management become a very complex process, especially when large volumes
of data come from multiple sources.
These data need to be linked, connected and correlated in order to be able to grasp
the information that is supposed to be conveyed by these data .this is termed as the
“complexity” of big data.
2. Transactional Data:
Every enterprise has some kind of applications which involve performing
different kinds of transactions like web applications, Mobile Applications, CRM
Systems and many more.
4
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
To support the transactions in these applications, there are usually one or more
relational database as a backend infrastructure.
This is mostly structured data and is referred to as Transactional data.
3. Social Media:
4. Activity Generated:
There is a large amount of data being generated by machines which surpasses
the data volume generated by humans.
These includes data from medical devices, sensor data, surveillance
videos,satellites,cell phone towers, industrial machinery, and other data
generated mostly by machines.
5. Public data:
Data published by governments, research data published by research institute,
data from weather and metrological departments, Wikipedia available to the
public.
6. Archives:
Archived includes scanned documents, scanned copies of agreements, records of
ex-employees /completed projects, banking transactions older than the
compliance regulations.
This type of data, which is less frequently accessed is referred to as Archive
Data.
Organizations archives a lot of data which is either not required anymore or is
vary rarely required.
Definition:
Big Data Analytics:
Big Data Analytics is the process of examining large data set containing a variety of
data types i.e. Big Data. To uncover hidden patterns, unknown correlations, market
trends, customer preferences and other useful business information.
The analytical findings can lead to more effective marketing, new revenue
opportunities, better customer service, improved operational efficiency, competitive
advantages and other business benefits.
5
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
6
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
1. Data challenges
2. Process challenges
3. Management challenges
1. Data Challenges:
Volume:
The volume of data, especially machine-generated data, is exploding, how fast that
data is growing every year, with new sources of data that are emerging.
The challenge is how to deal with the large size of data?
Ex: According to the latest estimates, 402.74 million terabytes of data are created each day.
Variety:
More than 80% of today’s information is unstructured and it is typically too big to
manage effectively. What does it mean?
A lot of data is unstructured, or has complex structure that hard to represent in rows
and columns.
Organizations want to able to combine all this data and analyse it together in new
ways.
Ex: More than one customer in different industries whose applications combine
geospatial vessel location data with weather and new data to make real-time mission
critical decisions.
Data come from sensors, smart devices and social collaboration technologies.
Data are not only structured, but raw, semi structured, unstructured data from web
pages, web log files, search indexes, e-mails, documents, sensor data etc.
Semi structured web data such A/B testing, sessionization, bot detection and pathing
analysis all require powerful analytics on many petabytes of semi-structured web
data.
The challenge is how to handle multiple types, sources and formats?
Velocity:
How to react to the flood of information in the time required by the application?
Veracity:
If data is high quality in one country, and poor in another ,does the Aid response
skew ‘unfairly’ toward the well-surveyed country or toward the educated guesses
being made for the poorly surveyed one?
Several challenges:
1. How can we cope with uncertainty, imprecision, missing values, misstatements or
untruths?
2. How good is the data? How broad is the coverage?
3. How fine is the sampling resolution? How timely are the readings?
4. How well understood are the sampling biases?
5. Is there data available, at all?
Data comprehensiveness:
Are there areas without coverage? What are the implications?
7
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Scalability:
Techniques like social graph analysis, for instance leveraging the influencers in a
social network to create better user experience are hard problems to solve at scale.
All of these problems combined create a perfect storm of challenges and
opportunities to create faster, cheaper and better solutions for big data analytics than
traditional approaches can solve.
2. Process Challenges:
Capturing data.
Aligning data from different sources.
Transforming the data into a form suitable for analysis.
Modeling it, whether mathematically or through some form of simulation.
Understanding the output, visualizing and sharing the results, and how to display
complex analytics on an iPhone or a mobile device.
3. Management challenges:
Main challenges are:
Data privacy
Security
Governance
Ethical
The challenges are: Ensuring that data are used correctly.
It is another most important challenge with Big Data. This challenge includes
sensitive, conceptual, technical as well as legal significance.
Most of the organizations are unable to maintain regular checks due to large
amounts of data generation. However, it should be necessary to perform security
checks and observation in real time because it is most beneficial.
There is some information of a person which when combined with external large
data may lead to some facts of a person which may be secretive and he might not
want the owner to know this information about that person.
Some of the organization collects information of the people in order to add value to
their business. This is done by making insights into their lives that they’re unaware
of.
8
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Data analysis is the process of a combination of extracting data from data set,
analyzing, classification of data, organizing, reasoning, and so on.
A process in which the analyst moves laterally and recursively between three modes:
Importance of IDA:
Intelligent Data Analysis (IDA) is one of the major issues in artificial intelligence and
information.
Intelligent data analysis discloses hidden facts that are not known previously and
provides potentially important information or facts from large quantities of data
(White, 2008).
It also helps in making a decision.
Based on statistics, machine learning, artificial intelligence, recognition of pattern,
and records and visualization technology mainly, IDA helps to obtain useful
information, necessary data and interesting models from a lot of data available online
in order to make the right choices.
Intelligent data analysis helps to solve a problem that is already solved as a matter of
routine. If the data is collected for the past cases together with the result that was
finally achieved, such data can be used to revise and optimize the presently used
strategy to arrive at a conclusion.
In certain cases, if some questions arise for the first time, and have only a little
knowledge about it, data from the related situations helps us to solve the new problem
9
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
or any unknown relationships can be discovered from the data to gain knowledge in
an unfamiliar area.
The preparation of data involves opting for the required data from the related data
source and incorporating it into a data set that can be used for data mining.
The main goal of intelligent data analysis is to obtain knowledge.
It is challenging to choose suitable methods to resolve the complexity of the process.
Regarding the term visualization, we have moved away from visualization to use the
term charting. The term analysis is used for the method of incorporating, influencing,
filtering and scrubbing the data, which certainly contains, but is not limited to
interrelating with their data through charts.
4. NATURE OF DATA
Data should have specific items (values or facts), which must be identified.
Specific items of data must be organized into a meaningful form.
Data should have the functions to perform.
The nature of data can be understood on the basis of the class to which it belongs.
There is a large measure of cross-classification, e.g., all quantitative data are
numerical data, and most data are quantitative data.
With reference to the types of data; their nature in sciences is as follows:
1. Numerical data:
All data in sciences are derived by measurement and stated in numerical values.
Most of the time their nature is numerical. Even in semi quantitative data,
affirmative and negative answers are coded as '1' and '0' for obtaining numerical
data.
Thus, except in the three cases of qualitative, graphic and symbolic data, the
remaining yield numerical data.
2. Descriptive data:
Sciences are not known for descriptive data.
However, qualitative data in sciences are expressed in terms of definitive
statements concerning objects.
These may be viewed as descriptive data.
Here, the nature of data is descriptive. Graphic and symbolic data: Graphic and
symbolic data are modes of presentation.
10
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
2. Iteration:
The nature of iteration is that it sometimes leads you down a path that turns out to be a
dead end.
Many analysts and industry experts suggest that you start with small, well-defined
projects, learn from each iteration, and gradually move on to the next idea or field of
inquiry.
3. Flexible Capacity:
Because of the iterative nature of big data analysis, be prepared to spend more time
and utilize more resources to solve problems.
As you mine the data to discover patterns and relationships, predictive analytics can
yield the insights that you seek.
11
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
5. Decision Management:
Consider the transaction volume and velocity.
If you are using big data analytics to drive many operational decisions the you need to
consider how to automate and optimise the implementation of all those actions.
1. Discovery tools:
These are useful throughout the information lifecycle for rapid, intuitive exploration
and analysis of information from any combination of structured and unstructured
sources.
These tools permit analysis alongside traditional BI source systems.
Because there is no need for up-front modelling, users can draw new insights, come to
meaningful conclusions and make informed decisions quickly.
2. BI tools:
These are important for reporting, analysis and performance management, primarily
with transactional data from data warehouses and production information systems.
BI tools provide comprehensive capabilities for business intelligence and performance
management, including enterprise reporting, dashboards, ad-hoc analysis, scorecards,
and what-if scenario analysis on an integrated, enterprise scale platform.
Because these techniques are applied directly within the database, you eliminate data
movement to and from other analytical servers, which accelerates information cycle
times and reduces total cost of owner ship.
5. Decision management:
It includes predictive modelling, business rules, and self-learning to take informed
action based on the current context.
12
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Tools:
It quickly replicates data onto several nodes in a cluster in order to provide reliable,
fast performance.
HBase:
HBase is the non-relational data store for Hadoop.
HBase is a data model that is similar to Google’s big table. It is an open source,
distributed database developed by Apache software foundation written in Java.
Hive:
Hive is a data warehouse system which is used to analyze structured data.
It is built on the top of Hadoop.
It runs SQL like queries called HQL (Hive query language) which gets internally
converted to MapReduce jobs.
Sqoop:
Tool designed for efficiently transferring bulk data between Apache Hadoop and
structured data stores such as relational databases
Flume:
Distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of log data .
It has a simple and very flexible architecture based on streaming data flows.
It's quite robust and fall tolerant, and it's really tuneable to enhance the reliability
mechanisms, fail over, recovery, and all the other mechanisms that keep the cluster
safe and reliable.
13
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
It uses simple extensible data model that allows us to apply all kinds of online
analytic applications
Oozie:
Workflow scheduler system to manage ApacheHadoop jobs.
Oozie Coordinator jobs!
ZooKeeper:
ZooKeeper is a highly reliable distributed coordination kernel, which can be used for
distributed locking, configuration management, leadership election, and work queues.
Zookeeper is a replicated service that holds the metadata of distributed applications.
Pig:
High level programming on top of Hadoop MapReduce
The language: Pig Latin Data analysis problems as data flows
Originally developed at Yahoo 2006
14
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
6. ANALYSIS Vs REPORTING
(FROM WEBANALYTICS ACTION HERO)
Analysis:
Analysis means to translate information into insights.
The process of exploring data and reports in order to extract meaningful insights,
which can be used to better understand and improve business information.
Insight:
Insight refers to an analyst or business user discovering a pattern in data or a
relationship between variables that they didn't previously know existed.
Reporting:
Reporting means to translate raw data into information.
The process of organizing data into informational summaries in order to monitor
how different areas of a business are performing.
Reporting and analysis different in terms of their purpose, tasks, outputs, delivery
and value
Reporting and analysis is to increase sales and reduce costs.
Both reporting and analysis play roles in influencing and driving the actions which
lead to greater value in organizations.
Purpose:
Analysis Reporting
Provides answers Provides data
Provides what is needed Provides what is asked for
Is typically customized Is typically standardized
Involves a person Does not involve a person
Is extremely flexible Is fairly inflexible
Tasks:
Reporting activities such as building, configuring, consolidating, organizing,
formatting, and summarizing.
Analysis tasks are questioning, examining, interpreting, comparing and confirming.
15
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Outputs:
On the surface, reporting and analysis deliverables may look similar with lots of
charts, graphs, trend lines, tables, and stats.
The first is the overall approach Reporting generally follows a push approach, where
reports are passively pushed to users who are then expected to extract meaningful
insights and take appropriate actions for themselves (think self-serve).
16
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
In the case of analysis with actual deliverables, there are two main types:
1. Ad-hoc responses:
Analysts receive requests to answer a variety of business questions, which may be
spurred by questions the reporting raised.
Typically, these urgent requests are time sensitive and demand a quick turnaround.
The analytics team may have to juggle multiple requests at the same time.
As a result, the analyses cannot go as deep or wide as the analysts may like, and the
deliverable is a short and concise report, which may or may not include any specific
recommendations.
2. Analysis presentations:
Some business questions are more complex in nature and require more time to
perform a comprehensive, deep-dive analysis.
These analysis projects result in a more formal deliverable, which includes two
important sections:
1. Key findings: The key findings highlight the most meaningful and actionable insights
gleaned from the analyses performed.
2. Recommendations: The recommendations provide guidance on what actions to take based
on the analysis findings.
Delivery:
Through the push model of reporting, recipients can access reports through an
analytics tool, intranet site, Microsoft Excel® spreadsheet, or mobile app.
They can also have them scheduled for delivery into their mailbox, mobile device
(SMS), or FTP site.
Because of the demands of having to provide data to multiple individuals and groups
at regular intervals, the building, refreshing, and delivering of reports is often
automated. It's a job for robots or computers, not human beings.
On the other hand, analysis is all about human beings using their superior reasoning
and analytical skills to extract key insights from the data and form actionable
recommendations for their organizations.
Although analysis can be "submitted" to decision makers, it is more effectively
presented person-to-person. In their book Competing on Analytics (Harvard Business
School Press, 2007), Thomas Davenport and Jeanne Harris emphasize the importance
of trust and credibility between the analyst and decision maker.
Decision makers typically don't have the time or ability to perform analyses
themselves. With a "close, trusting relationship" in place, the executives will frame
their needs correctly, the analysts will ask the right questions, and the executives will
be more likely to take action on analysis they trust.
Value:
Finally, you need to keep in mind the relationship between reporting and analysis in
driving value. Think of the data-driven decision-making stages (data > reporting >
analysis > decision > action > value)
17
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Figure 2.5 If you remove one of these dominoes, you won't be able to achieve the desired
Reporting and Analysis Comparison
Purpose Tasks Outputs Delivery Value
Organize
Format
Summarize
18
MASTER OF COMPUTER APPLICATIONS BIG DATA ANALYTICS
Examine Analysis
Compare
Confirm
UNIT-I COMPLETED
Reference Book: “Web Analytics Action Hero”
Reference Link: “tutorial and Geeks and Geeks”
19