Big Data Fundamentals
Big Data Fundamentals
This E-Book
EPUB is an open, industry-standard format for e-books. However, support for EPUB
and its many features varies across reading devices and applications. Use your device or
app settings to customize the presentation to your liking. Settings that you can customize
often include font, font size, single or double column, landscape or portrait mode, and
figures that you can click or tap to enlarge. For additional information about the settings
and features on your reading device or app, visit the device manufacturer’s Web site.
Many titles include programming code or configuration examples. To optimize the
presentation of these elements, view the e-book in single-column, landscape mode and
adjust the font size to the smallest setting. In addition to presenting code and
configurations in the reflowable text format, we have included images of the code that
mimic the presentation found in the print book; therefore, where the reflowable format
may compromise the presentation of the code listing, you will see a “Click here to view
code image” link. Click the link to view the print-fidelity code image. To return to the
previous page viewed, click the Back button on your device or app.
Big Data Fundamentals
Concepts, Drivers & Techniques
Thomas Erl,
Wajid Khattak,
and Paul Buhler
Acknowledgments
Reader Services
PART I: THE FUNDAMENTALS OF BIG DATA
CHAPTER 1: Understanding Big Data
Concepts and Terminology
Datasets
Data Analysis
Data Analytics
Descriptive Analytics
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics
Business Intelligence (BI)
Key Performance Indicators (KPI)
Big Data Characteristics
Volume
Velocity
Variety
Veracity
Value
Different Types of Data
Structured Data
Unstructured Data
Semi-structured Data
Metadata
Case Study Background
History
Technical Infrastructure and Automation Environment
Business Goals and Obstacles
Case Study Example
Identifying Data Characteristics
Volume
Velocity
Variety
Veracity
Value
Identifying Types of Data
CHAPTER 2: Business Motivations and Drivers for Big Data Adoption
Marketplace Dynamics
Business Architecture
Business Process Management
Information and Communications Technology
Data Analytics and Data Science
Digitization
Affordable Technology and Commodity Hardware
Social Media
Hyper-Connected Communities and Devices
Cloud Computing
Internet of Everything (IoE)
Case Study Example
CHAPTER 3: Big Data Adoption and Planning Considerations
Organization Prerequisites
Data Procurement
Privacy
Security
Provenance
Limited Realtime Support
Distinct Performance Challenges
Distinct Governance Requirements
Distinct Methodology
Clouds
Big Data Analytics Lifecycle
Business Case Evaluation
Data Identification
Data Acquisition and Filtering
Data Extraction
Data Validation and Cleansing
Data Aggregation and Representation
Data Analysis
Data Visualization
Utilization of Analysis Results
Case Study Example
Big Data Analytics Lifecycle
Business Case Evaluation
Data Identification
Data Acquisition and Filtering
Data Extraction
Data Validation and Cleansing
Data Aggregation and Representation
Data Analysis
Data Visualization
Utilization of Analysis Results
CHAPTER 4: Enterprise Technologies and Big Data Business Intelligence
Online Transaction Processing (OLTP)
Online Analytical Processing (OLAP)
Extract Transform Load (ETL)
Data Warehouses
Data Marts
Traditional BI
Ad-hoc Reports
Dashboards
Big Data BI
Traditional Data Visualization
Data Visualization for Big Data
Case Study Example
Enterprise Technology
Big Data Business Intelligence
PART II: STORING AND ANALYZING BIG DATA
CHAPTER 5: Big Data Storage Concepts
Clusters
File Systems and Distributed File Systems
NoSQL
Sharding
Replication
Master-Slave
Peer-to-Peer
Sharding and Replication
Combining Sharding and Master-Slave Replication
Combining Sharding and Peer-to-Peer Replication
CAP Theorem
ACID
BASE
Case Study Example
CHAPTER 6: Big Data Processing Concepts
Parallel Data Processing
Distributed Data Processing
Hadoop
Processing Workloads
Batch
Transactional
Cluster
Processing in Batch Mode
Batch Processing with MapReduce
Map and Reduce Tasks
Map
Combine
Partition
Shuffle and Sort
Reduce
A Simple MapReduce Example
Understanding MapReduce Algorithms
Processing in Realtime Mode
Speed Consistency Volume (SCV)
Event Stream Processing
Complex Event Processing
Realtime Big Data Processing and SCV
Realtime Big Data Processing and MapReduce
Case Study Example
Processing Workloads
Processing in Batch Mode
Processing in Realtime
CHAPTER 7: Big Data Storage Technology
On-Disk Storage Devices
Distributed File Systems
RDBMS Databases
NoSQL Databases
Characteristics
Rationale
Types
Key-Value
Document
Column-Family
Graph
NewSQL Databases
In-Memory Storage Devices
In-Memory Data Grids
Read-through
Write-through
Write-behind
Refresh-ahead
In-Memory Databases
Case Study Example
CHAPTER 8: Big Data Analysis Techniques
Quantitative Analysis
Qualitative Analysis
Data Mining
Statistical Analysis
A/B Testing
Correlation
Regression
Machine Learning
Classification (Supervised Machine Learning)
Clustering (Unsupervised Machine Learning)
Outlier Detection
Filtering
Semantic Analysis
Natural Language Processing
Text Analytics
Sentiment Analysis
Visual Analysis
Heat Maps
Time Series Plots
Network Graphs
Spatial Data Mapping
Case Study Example
Correlation
Regression
Time Series Plot
Clustering
Classification
APPENDIX A: Case Study Conclusion
About the Authors
Thomas Erl
Wajid Khattak
Paul Buhler
Index
Acknowledgments
Register your copy of Big Data Fundamentals at informit.com for convenient access to
downloads, updates, and corrections as they become available. To start the registration
process, go to informit.com/register and log in or create an account.* Enter the product
ISBN, 9780134291079, and click Submit. Once the process is complete, you will find any
available bonus content under “Registered Products.”
*Be sure to check the box that you would like to hear from us in order to receive exclusive
discounts on future editions of this product.
Part I: The Fundamentals of Big Data
Datasets
Collections or groups of related data are generally referred to as datasets. Each group or
dataset member (datum) shares the same set of attributes or properties as others in the
same dataset. Some examples of datasets are:
• tweets stored in a flat file
• a collection of image files in a directory
• an extract of rows from a database table stored in a CSV formatted file
• historical weather observations that are stored as XML files
Figure 1.1 shows three datasets based on three different data formats.
Data Analysis
Data analysis is the process of examining data to find facts, relationships, patterns,
insights and/or trends. The overall goal of data analysis is to support better decision-
making. A simple data analysis example is the analysis of ice cream sales data in order to
determine how the number of ice cream cones sold is related to the daily temperature. The
results of such an analysis would support decisions related to how much ice cream a store
should order in relation to weather forecast information. Carrying out data analysis helps
establish patterns and relationships among the data being analyzed. Figure 1.2 shows the
symbol used to represent data analysis.
Data Analytics
Data analytics is a broader term that encompasses data analysis. Data analytics is a
discipline that includes the management of the complete data lifecycle, which
encompasses collecting, cleansing, organizing, storing, analyzing and governing data. The
term includes the development of analysis methods, scientific techniques and automated
tools. In Big Data environments, data analytics has developed methods that allow data
analysis to occur through the use of highly scalable distributed technologies and
frameworks that are capable of analyzing large volumes of data from different sources.
Figure 1.3 shows the symbol used to represent analytics.
Figure 1.3 The symbol used to represent data analytics.
The Big Data analytics lifecycle generally involves identifying, procuring, preparing and
analyzing large amounts of raw, unstructured data to extract meaningful information that
can serve as an input for identifying patterns, enriching existing enterprise data and
performing large-scale searches.
Different kinds of organizations use data analytics tools and techniques in different ways.
Take, for example, these three sectors:
• In business-oriented environments, data analytics results can lower operational costs
and facilitate strategic decision-making.
• In the scientific domain, data analytics can help identify the cause of a phenomenon
to improve the accuracy of predictions.
• In service-based environments like public sector organizations, data analytics can
help strengthen the focus on delivering high-quality services by driving down costs.
Data analytics enable data-driven decision-making with scientific backing so that
decisions can be based on factual data and not simply on past experience or intuition
alone. There are four general categories of analytics that are distinguished by the results
they produce:
• descriptive analytics
• diagnostic analytics
• predictive analytics
• prescriptive analytics
The different analytics types leverage different techniques and analysis algorithms. This
implies that there may be varying data, storage and processing requirements to facilitate
the delivery of multiple types of analytic results. Figure 1.4 depicts the reality that the
generation of high value analytic results increases the complexity and cost of the analytic
environment.
Figure 1.4 Value and complexity increase from descriptive to prescriptive analytics.
Descriptive Analytics
Descriptive analytics are carried out to answer questions about events that have already
occurred. This form of analytics contextualizes data to generate information.
Sample questions can include:
• What was the sales volume over the past 12 months?
• What is the number of support calls received as categorized by severity and
geographic location?
• What is the monthly commission earned by each sales agent?
It is estimated that 80% of generated analytics results are descriptive in nature. Value-
wise, descriptive analytics provide the least worth and require a relatively basic skillset.
Descriptive analytics are often carried out via ad-hoc reporting or dashboards, as shown in
Figure 1.5. The reports are generally static in nature and display historical data that is
presented in the form of data grids or charts. Queries are executed on operational data
stores from within an enterprise, for example a Customer Relationship Management
system (CRM) or Enterprise Resource Planning (ERP) system.
Figure 1.5 The operational systems, pictured left, are queried via descriptive analytics
tools to generate reports or dashboards, pictured right.
Diagnostic Analytics
Diagnostic analytics aim to determine the cause of a phenomenon that occurred in the past
using questions that focus on the reason behind the event. The goal of this type of
analytics is to determine what information is related to the phenomenon in order to enable
answering questions that seek to determine why something has occurred.
Such questions include:
• Why were Q2 sales less than Q1 sales?
• Why have there been more support calls originating from the Eastern region than
from the Western region?
• Why was there an increase in patient re-admission rates over the past three months?
Diagnostic analytics provide more value than descriptive analytics but require a more
advanced skillset. Diagnostic analytics usually require collecting data from multiple
sources and storing it in a structure that lends itself to performing drill-down and roll-up
analysis, as shown in Figure 1.6. Diagnostic analytics results are viewed via interactive
visualization tools that enable users to identify trends and patterns. The executed queries
are more complex compared to those of descriptive analytics and are performed on multi-
dimensional data held in analytic processing systems.
Figure 1.6 Diagnostic analytics can result in data that is suitable for performing drill-
down and roll-up analysis.
Predictive Analytics
Predictive analytics are carried out in an attempt to determine the outcome of an event that
might occur in the future. With predictive analytics, information is enhanced with meaning
to generate knowledge that conveys how that information is related. The strength and
magnitude of the associations form the basis of models that are used to generate future
predictions based upon past events. It is important to understand that the models used for
predictive analytics have implicit dependencies on the conditions under which the past
events occurred. If these underlying conditions change, then the models that make
predictions need to be updated.
Questions are usually formulated using a what-if rationale, such as the following:
• What are the chances that a customer will default on a loan if they have missed a
monthly payment?
• What will be the patient survival rate if Drug B is administered instead of Drug A?
• If a customer has purchased Products A and B, what are the chances that they will
also purchase Product C?
Predictive analytics try to predict the outcomes of events, and predictions are made based
on patterns, trends and exceptions found in historical and current data. This can lead to the
identification of both risks and opportunities.
This kind of analytics involves the use of large datasets comprised of internal and external
data and various data analysis techniques. It provides greater value and requires a more
advanced skillset than both descriptive and diagnostic analytics. The tools used generally
abstract underlying statistical intricacies by providing user-friendly front-end interfaces, as
shown in Figure 1.7.
Figure 1.7 Predictive analytics tools can provide user-friendly front-end interfaces.
Prescriptive Analytics
Prescriptive analytics build upon the results of predictive analytics by prescribing actions
that should be taken. The focus is not only on which prescribed option is best to follow,
but why. In other words, prescriptive analytics provide results that can be reasoned about
because they embed elements of situational understanding. Thus, this kind of analytics can
be used to gain an advantage or mitigate a risk.
Sample questions may include:
• Among three drugs, which one provides the best results?
• When is the best time to trade a particular stock?
Prescriptive analytics provide more value than any other type of analytics and
correspondingly require the most advanced skillset, as well as specialized software and
tools. Various outcomes are calculated, and the best course of action for each outcome is
suggested. The approach shifts from explanatory to advisory and can include the
simulation of various scenarios.
This sort of analytics incorporates internal data with external data. Internal data might
include current and historical sales data, customer information, product data and business
rules. External data may include social media data, weather forecasts and government-
produced demographic data. Prescriptive analytics involve the use of business rules and
large amounts of internal and external data to simulate outcomes and prescribe the best
course of action, as shown in Figure 1.8.
Figure 1.8 Prescriptive analytics involves the use of business rules and internal and/or
external data to perform an in-depth analysis.
Figure 1.10 A KPI dashboard acts as a central reference point for gauging business
performance.
Big Data Characteristics
For a dataset to be considered Big Data, it must possess one or more characteristics that
require accommodation in the solution design and architecture of the analytic
environment. Most of these data characteristics were initially identified by Doug Laney in
early 2001 when he published an article describing the impact of the volume, velocity and
variety of e-commerce data on enterprise data warehouses. To this list, veracity has been
added to account for the lower signal-to-noise ratio of unstructured data as compared to
structured data sources. Ultimately, the goal is to conduct analysis of the data in such a
manner that high-quality results are delivered in a timely manner, which provides optimal
value to the enterprise.
This section explores the five Big Data characteristics that can be used to help differentiate
data categorized as “Big” from other forms of data. The five Big Data traits shown in
Figure 1.11 are commonly referred to as the Five Vs:
• volume
• velocity
• variety
• veracity
• value
Volume
The anticipated volume of data that is processed by Big Data solutions is substantial and
ever-growing. High data volumes impose distinct data storage and processing demands, as
well as additional data preparation, curation and management processes. Figure 1.12
provides a visual representation of the large volume of data being created daily by
organizations and users world-wide.
Figure 1.12 Organizations and users world-wide create over 2.5 EBs of data a day. As a
point of comparison, the Library of Congress currently holds more than 300 TBs of
data.
Typical data sources that are responsible for generating high data volumes can include:
• online transactions, such as point-of-sale and banking
• scientific and research experiments, such as the Large Hadron Collider and Atacama
Large Millimeter/Submillimeter Array telescope
• sensors, such as GPS sensors, RFIDs, smart meters and telematics
• social media, such as Facebook and Twitter
Velocity
In Big Data environments, data can arrive at fast speeds, and enormous datasets can
accumulate within very short periods of time. From an enterprise’s point of view, the
velocity of data translates into the amount of time it takes for the data to be processed once
it enters the enterprise’s perimeter. Coping with the fast inflow of data requires the
enterprise to design highly elastic and available data processing solutions and
corresponding data storage capabilities.
Depending on the data source, velocity may not always be high. For example, MRI scan
images are not generated as frequently as log entries from a high-traffic webserver. As
illustrated in Figure 1.13, data velocity is put into perspective when considering that the
following data volume can easily be generated in a given minute: 350,000 tweets, 300
hours of video footage uploaded to YouTube, 171 million emails and 330 GBs of sensor
data from a jet engine.
Figure 1.13 Examples of high-velocity Big Data datasets produced every minute
include tweets, video, emails and GBs generated from a jet engine.
Variety
Data variety refers to the multiple formats and types of data that need to be supported by
Big Data solutions. Data variety brings challenges for enterprises in terms of data
integration, transformation, processing, and storage. Figure 1.14 provides a visual
representation of data variety, which includes structured data in the form of financial
transactions, semi-structured data in the form of emails and unstructured data in the form
of images.
Figure 1.14 Examples of high-variety Big Data datasets include structured, textual,
image, video, audio, XML, JSON, sensor data and metadata.
Veracity
Veracity refers to the quality or fidelity of data. Data that enters Big Data environments
needs to be assessed for quality, which can lead to data processing activities to resolve
invalid data and remove noise. In relation to veracity, data can be part of the signal or
noise of a dataset. Noise is data that cannot be converted into information and thus has no
value, whereas signals have value and lead to meaningful information. Data with a high
signal-to-noise ratio has more veracity than data with a lower ratio. Data that is acquired
in a controlled manner, for example via online customer registrations, usually contains less
noise than data acquired via uncontrolled sources, such as blog postings. Thus the signal-
to-noise ratio of data is dependent upon the source of the data and its type.
Value
Value is defined as the usefulness of data for an enterprise. The value characteristic is
intuitively related to the veracity characteristic in that the higher the data fidelity, the more
value it holds for the business. Value is also dependent on how long data processing takes
because analytics results have a shelf-life; for example, a 20 minute delayed stock quote
has little to no value for making a trade compared to a quote that is 20 milliseconds old.
As demonstrated, value and time are inversely related. The longer it takes for data to be
turned into meaningful information, the less value it has for a business. Stale results
inhibit the quality and speed of informed decision-making. Figure 1.15 provides two
illustrations of how value is impacted by the veracity of data and the timeliness of
generated analytic results.
Figure 1.15 Data that has high veracity and can be analyzed quickly has more value to
a business.
Apart from veracity and time, value is also impacted by the following lifecycle-related
concerns:
• How well has the data been stored?
• Were valuable attributes of the data removed during data cleansing?
• Are the right types of questions being asked during data analysis?
• Are the results of the analysis being accurately communicated to the appropriate
decision-makers?
Figure 1.17 Examples of machine-generated data include web logs, sensor data,
telemetry data, smart meter data and appliance usage data.
As demonstrated, human-generated and machine-generated data can come from a variety
of sources and be represented in various formats or types. This section examines the
variety of data types that are processed by Big Data solutions. The primary types of data
are:
• structured data
• unstructured data
• semi-structured data
These data types refer to the internal organization of data and are sometimes called data
formats. Apart from these three fundamental data types, another important type of data in
Big Data environments is metadata. Each will be explored in turn.
Structured Data
Structured data conforms to a data model or schema and is often stored in tabular form. It
is used to capture relationships between different entities and is therefore most often
stored in a relational database. Structured data is frequently generated by enterprise
applications and information systems like ERP and CRM systems. Due to the abundance
of tools and databases that natively support structured data, it rarely requires special
consideration in regards to processing or storage. Examples of this type of data include
banking transactions, invoices, and customer records. Figure 1.18 shows the symbol used
to represent structured data.
Figure 1.18 The symbol used to represent structured data stored in a tabular form.
Unstructured Data
Data that does not conform to a data model or data schema is known as unstructured data.
It is estimated that unstructured data makes up 80% of the data within any given
enterprise. Unstructured data has a faster growth rate than structured data. Figure 1.19
illustrates some common types of unstructured data. This form of data is either textual or
binary and often conveyed via files that are self-contained and non-relational. A text file
may contain the contents of various tweets or blog postings. Binary files are often media
files that contain image, audio or video data. Technically, both text and binary files have a
structure defined by the file format itself, but this aspect is disregarded, and the notion of
being unstructured is in relation to the format of the data contained in the file itself.
Figure 1.19 Video, image and audio files are all types of unstructured data.
Special purpose logic is usually required to process and store unstructured data. For
example, to play a video file, it is essential that the correct codec (coder-decoder) is
available. Unstructured data cannot be directly processed or queried using SQL. If it is
required to be stored within a relational database, it is stored in a table as a Binary Large
Object (BLOB). Alternatively, a Not-only SQL (NoSQL) database is a non-relational
database that can be used to store unstructured data alongside structured data.
Semi-structured Data
Semi-structured data has a defined level of structure and consistency, but is not relational
in nature. Instead, semi-structured data is hierarchical or graph-based. This kind of data is
commonly stored in files that contain text. For instance, Figure 1.20 shows that XML and
JSON files are common forms of semi-structured data. Due to the textual nature of this
data and its conformance to some level of structure, it is more easily processed than
unstructured data.
Metadata
Metadata provides information about a dataset’s characteristics and structure. This type of
data is mostly machine-generated and can be appended to data. The tracking of metadata
is crucial to Big Data processing, storage and analysis because it provides information
about the pedigree of the data and its provenance during processing. Examples of
metadata include:
• XML tags providing the author and creation date of a document
• attributes providing the file size and resolution of a digital photograph
Big Data solutions rely on metadata, particularly when processing semi-structured and
unstructured data. Figure 1.21 shows the symbol used to represent metadata.
History
ETI started its life as an exclusive health insurance provider 50 years ago. As a result of
multiple acquisitions over the past 30 years, ETI has extended its services to include
property and casualty insurance plans in the building, marine and aviation sectors. Each of
its four sectors is comprised of a core team of specialized and experienced agents,
actuaries, underwriters and claim adjusters.
The agents generate the company’s revenue by selling policies while the actuaries are
responsible for risk assessment, coming up with new insurance plans and revising existing
plans. The actuaries also perform what-if analyses and make use of dashboards and
scorecards for scenario evaluation. The underwriters evaluate new insurance applications
and decide on the premium amount. The claim adjusters deal with investigating claims
made against a policy and arrive at a settlement amount for the policyholder.
Some of the key departments within ETI include the underwriting, claims settlement,
customer care, legal, marketing, human resource, accounts and IT departments. Both
prospective and existing customers generally contact ETI’s customer care department via
telephone, although contact via email and social media has increased exponentially over
the past few years.
ETI strives to distinguish itself by providing competitive policies and premium customer
service that does not end once a policy has been sold. Its management believes that doing
so helps to achieve increased levels of customer acquisition and retention. ETI relies
heavily on its actuaries to create insurance plans that reflect the needs of its customers.
Technical Infrastructure and Automation Environment
ETI’s IT environment consists of a combination of client-server and mainframe platforms
that support the execution of a number of systems, including policy quotation, policy
administration, claims management, risk assessment, document management, billing,
enterprise resource planning (ERP) and customer relationship management (CRM).
The policy quotation system is used to create new insurance plans and to provide quotes to
prospective customers. It is integrated with the website and customer care portal to
provide website visitors and customer care agents the ability to obtain insurance quotes.
The policy administration system handles all aspects of policy lifecycle management,
including issuance, update, renewal and cancellation of policies. The claims management
system deals with claim processing activities.
A claim is registered when a policyholder makes a report, which is then assigned to a
claim adjuster who analyzes the claim in light of the available information that was
submitted when the claim was made, as well other background information obtained from
different internal and external sources. Based on the analyzed information, the claim is
settled following a certain set of business rules. The risk assessment system is used by the
actuaries to assess any potential risk, such as a storm or a flood that could result in
policyholders making claims. The risk assessment system enables probability-based risk
evaluation that involves executing various mathematical and statistical models.
The document management system serves as a central repository for all kinds of
documents, including policies, claims, scanned documents and customer correspondence.
The billing system keeps track of premium collection from customers and also generates
various reminders for customers who have missed their payment via email and postal
mail. The ERP system is used for day-to-day running of ETI, including human resource
management and accounts. The CRM system records all aspects of customer
communication via phone, email and postal mail and also provides a portal for call center
agents for dealing with customer enquiries. Furthermore, it enables the marketing team to
create, run and manage marketing campaigns. Data from these operational systems is
exported to an Enterprise Data Warehouse (EDW) that is used to generate reports for
financial and performance analysis. The EDW is also used to generate reports for different
regulatory authorities to ensure continuous regulatory compliance.
Marketplace Dynamics
Business Architecture
Business Process Management
Information and Communications Technology
Internet of Everything (IoE)
In many organizations it is now acceptable for a business to be architected in much the
same way as its technology. This shift in perspective is reflected in the expanding domain
of enterprise architecture, which used to be closely aligned with technology architecture
but now includes business architecture as well. Although businesses still view themselves
from a mechanistic system’s point of view, with command and control being passed from
executives to managers to front-line employees, feedback loops based upon linked and
aligned measurements are providing greater insight into the effectiveness of management
decision-making.
This cycle from decision to action to measurement and assessment of results creates
opportunities for businesses to optimize their operations continuously. In fact, the
mechanistic management view is being supplanted by one that is more organic and that
drives the business based upon its ability to convert data into knowledge and insight. One
problem with this perspective is that, traditionally, businesses were driven almost
exclusively by internal data held in their information systems. However, companies are
learning that this is not sufficient in order to execute their business models in a
marketplace that more resembles an ecological system. As such, organizations need to
consume data from the outside to sense directly the factors that influence their
profitability. The use of such external data most often results in “Big Data” datasets.
This chapter explores the business motivations and drivers behind the adoption of Big
Data solutions and technologies. The adoption of Big Data represents the confluence of
several forces to include: marketplace dynamics, an appreciation and formalism of
Business Architecture (BA), the realization that a business’ ability to deliver value is
directly tied to Business Process Management (BPM), innovation in Information and
Communications Technology (ICT) and finally the Internet of Everything (IoE). Each of
these topics will be explored in turn.
Marketplace Dynamics
There has been a fundamental shift in the way businesses view themselves and the
marketplace. In the past 15 years, two large stock market corrections have taken place—
the first was the dot-com bubble burst in 2000, and the second was the global recession
that began in 2008. In each case, businesses entrenched and worked to improve their
efficiency and effectiveness to stabilize their profitability by reducing costs. This of course
is normal. When customers are scarce, cost-cutting often ensues to maintain the corporate
bottom line. In this environment, companies conduct transformation projects to improve
their corporate processes to achieve savings.
As the global economies began to emerge from recession, companies began to focus
outward, looking to find new customers and keep existing customers from defecting to
marketplace competitors. This was accomplished by offering new products and services
and delivering increased value propositions to customers. It is a very different market
cycle to the one that focuses on cost-cutting, for it is not about transformation but instead
innovation. Innovation brings hope to a company that it will find new ways to achieve a
competitive advantage in the marketplace and a consequent increase in top line revenue.
The global economy can experience periods of uncertainty due to various factors. We
generally accept that the economies of the major developed countries in the world are now
inextricably intertwined; in other words, they form a system of systems. Likewise, the
world’s businesses are shifting their perspective about their identity and independence as
they recognize that they are also intertwined in intricate product and service networks.
For this reason, companies need to expand their Business Intelligence activities beyond
retrospective reflection on internal information extracted from their corporate information
systems. They need to open themselves to external data sources as a means of sensing the
marketplace and their position within it. Recognizing that external data brings additional
context to their internal data allows a corporation to move up the analytic value chain from
hindsight to insight with greater ease. With appropriate tooling, which often supports
sophisticated simulation capabilities, a company can develop analytic results that provide
foresight. In this case, the tooling assists in bridging the gap between knowledge and
wisdom as well as provides advisory analytic results. This is the power of Big Data—
enriching corporate perspective beyond introspection, from which a business can only
infer information about marketplace sentiment, to sensing the marketplace itself.
The transition from hindsight to foresight can be understood through the lens of the DIKW
pyramid depicted in Figure 2.1. Note that in this figure, at the top of the triangle, wisdom
is shown as an outline to indicate that it exists but is not typically generated via ICT
systems. Instead, knowledge workers provide the insight and experience to frame the
available knowledge so that it can be integrated to form wisdom. Wisdom generation by
technological means quickly devolves into a philosophical discussion that is not within the
scope of this book. Within business environments, technology is used to support
knowledge management, and personnel are responsible for applying their competency and
wisdom to act accordingly.
Figure 2.1 The DIKW pyramid shows how data can be enriched with context to create
information, information can be supplied with meaning to create knowledge and
knowledge can be integrated to form wisdom.
Business Architecture
Within the past decade, there has been a realization that too often a corporation’s
enterprise architecture is simply a myopic view of its technology architecture. In an effort
to wrest power from the stronghold of IT, business architecture has emerged as a
complementary discipline. In the future, the goal is that enterprise architecture will present
a balanced view between business and technology architectures. Business architecture
provides a means of blueprinting or concretely expressing the design of the business. A
business architecture helps an organization align its strategic vision with its underlying
execution, whether they be technical resources or human capital. Thus, a business
architecture includes linkages from abstract concepts like business mission, vision,
strategy and goals to more concrete ones like business services, organizational structure,
key performance indicators and application services.
These linkages are important because they provide guidance as to how to align the
business and its information technology. It is an accepted view that a business operates as
a layered system—the top layer is the strategic layer occupied by C-level executives and
advisory groups; the middle layer is the tactical or managerial layer that seeks to steer the
organization in alignment with the strategy; and the bottom layer is the operations layer
where a business executes its core processes and delivers value to its customers. These
three layers often exhibit a degree of independence from one another, but each layer’s
goals and objectives are influenced by and often defined by the layer above, in other
words top-down. From a monitoring perspective, communication flows upstream, or
bottom-up via the collection of metrics. Business activity monitoring at the operations
layer generates Performance Indicators (PIs) and metrics, for both services and processes.
They are aggregated to create Key Performance Indicators (KPIs) used at the tactical
layer. These KPIs can be aligned with Critical Success Factors (CSFs) at the strategic
layer, which in turn help measure progress being made toward the achievement of
strategic goals and objectives.
Big Data has ties to business architecture at each of the organizational layers, as depicted
in Figure 2.2. Big Data enhances value as it provides additional context through the
integration of external perspectives to help convert data into information and provide
meaning to generate knowledge from information. For instance, at the operational level,
metrics are generated that simply report on what is happening in the business. In essence,
we are converting data through business concepts and context to generate information. At
the managerial level, this information can be examined through the lens of corporate
performance to answer questions regarding how the business is performing. In other
words, give meaning to the information. This information may be further enriched to
answer questions regarding why the business is performing at the level it is. When armed
with this knowledge, the strategic layer can provide further insight to help answer
questions of which strategy needs to change or be adopted in order to correct or enhance
the performance.
Figure 2.2 The DIKW pyramid illustrates alignment with Strategic, Tactical and
Operational corporate levels.
As with any layered system, the layers do not all change at the same speed. In the case of a
business enterprise, the strategic layer is the slowest moving layer, and the operational
layer is the fastest moving layer. The slower moving layers provide stability and direction
to the faster moving layers. In traditional organizational hierarchies, the management layer
is responsible for directing the operational layer in alignment with the strategy created by
the executive team. Because of this variation in regard to speed of change, it is possible to
envision the three layers as being responsible for strategy execution, business execution
and process execution respectively. Each of these layers relies upon different metrics and
measures, presented through different visualization and reporting functions. For example,
the strategy layer may rely upon balanced scorecards, the management layer upon an
interactive visualization of KPIs and corporate performance and the operational layer on
visualizations of executing business processes and their statuses.
Figure 2.3, a variant of a diagram produced by Joe Gollner in his blog post “The Anatomy
of Knowledge,” shows how an organization can relate and align its organizational layers
by creating a virtuous cycle via a feedback loop. On the right side of the figure, the
strategic layer drives response via the application of judgment by making decisions
regarding corporate strategy, policy, goals and objectives that are communicated as
constraints to the tactical layer. The tactical layer in turn leverages this knowledge to
generate priorities and actions that conform to corporate direction. These actions adjust the
execution of business at the operational layer. This in turn should generate measureable
change in the experience of internal stakeholders and external customers as they deliver
and consume business services. This change, or result, should surface and be visible in the
data in the form of changed PIs that are then aggregated into KPIs. Recall that KPIs are
metrics that can be associated with critical success factors that inform the executive team
as to whether or not their strategies are working. Over time, the strategic and management
layers injection of judgment and action into the loop will serve to refine the delivery of
business services.
Figure 2.3 The creation of a virtuous cycle to align an organization across layers via a
feedback loop.
Business Process Management
Businesses deliver value to customers and other stakeholders via the execution of their
business processes. A business process is a description of how work is performed in an
organization. It describes all work-related activities and their relationships, aligned with
the organizational actors and resources responsible for conducting them. The relationships
between activities may be temporal; for example, activity A is executed before activity B.
The relationships can also describe whether the execution of activities is conditional,
based upon the outputs or conditions generated by other activities or by sensing events
generated outside of the business process itself.
Business process management applies process excellence techniques to improve corporate
execution. Business Process Management Systems (BPMS) provide software developers a
model driven platform that is becoming the Business Application Development
Environment (BADE) of choice. A business application needs to: mediate between
humans and other technology-hosted resources, execute in alignment with corporate
policies and ensure the fair distribution of work to employees. As a BADE, models of a
business process are joined with: models of organizational roles and structure, business
entities and their relationships, business rules and the user-interface. The development
environment integrates these models together to create a business application that manages
screenflow and workflow and provides workload management. This is accomplished in an
execution environment that enforces corporate policy and security and provides state
management for long-running business processes. The state of an individual process, or all
processes, can be interrogated via Business Activity Monitoring (BAM) and visualized.
When BPM is combined with BPMSs that are intelligent, processes can be executed in a
goal-driven manner. Goals are connected to process fragments that are dynamically
chosen and assembled at run-time in alignment with the evaluation of the goals. When the
combination of Big Data analytic results and goal-driven behavior are used together,
process execution can become adaptive to the marketplace and responsive to
environmental conditions. As a simple example, a customer contact process has process
fragments that enable communication with customers via a voice call, email, text message
and traditional postal mail. In the beginning, the choice of these contact methods is
unweighted, and they are chosen at random. However, behind-the-scenes analysis is being
done to measure the effectiveness of the contact method via statistical analysis of
customer responsiveness.
The results of this analysis are tied to a goal responsible for selecting the contact method,
and when a clear preference is determined, the weighting is changed to favor the contact
method that achieves the best response. A more detailed analysis could leverage customer
clustering, which would assign individual customers to groups where one of the cluster
dimensions is the contact method. In this case, customers can be contacted with even
greater refinement, which provides a pathway to one-to-one targeted marketing.
Digitization
For many businesses, digital mediums have replaced physical mediums as the de facto
communications and delivery mechanism. The use of digital artifacts saves both time and
cost as distribution is supported by the vast pre-existing infrastructure of the Internet. As
consumers connect to a business through their interaction with these digital substitutes, it
leads to an opportunity to collect further “secondary” data; for example, requesting a
customer to provide feedback, complete a survey, or simply providing a hook to display a
relevant advertisement and tracking its click-through rate. Collecting secondary data can
be important for businesses because mining this data can allow for customized marketing,
automated recommendations and the development of optimized product features. Figure
2.4 provides a visual representation of examples of digitization.
Figure 2.4 Examples of digitization include online banking, on-demand television and
streaming video.
Social Media
The emergence of social media has empowered customers to provide feedback in near-
realtime via open and public mediums. This shift has forced businesses to consider
customer feedback on their service and product offerings in their strategic planning. As a
result, businesses are storing increasing amounts of data on customer interactions within
their customer relationship management systems (CRM) and from harvesting customer
reviews, complaints and praise from social media sites. This information feeds Big Data
analysis algorithms that surface the voice of the customer in an attempt to provide better
levels of service, increase sales, enable targeted marketing and even create new products
and services. Businesses have realized that branding activity is no longer completely
managed by internal marketing activities. Instead, product brands and corporate reputation
are co-created by the company and its customers. For this reason, businesses are
increasingly interested in incorporating publicly available datasets from social media and
other external data sources.
Hyper-Connected Communities and Devices
The broadening coverage of the Internet and the proliferation of cellular and Wi-Fi
networks has enabled more people and their devices to be continuously active in virtual
communities. Coupled with the proliferation of Internet connected sensors, the
underpinnings of the Internet of Things (IoT), a vast collection of smart Internet-
connected devices, is being formed. As shown in Figure 2.6, this in turn has resulted in a
massive increase in the number of available data streams. While some streams are public,
other streams are channeled directly to corporations for analysis. As an example, the
performance-based management contracts associated with heavy equipment used in the
mining industry incentivize the optimal performance of preventive and predictive
maintenance in an effort to reduce the need and avoid the downtime associated with
unplanned corrective maintenance. This requires detailed analysis of sensor readings
emitted by the equipment for the early detection of issues that can be resolved via the
proactive scheduling of maintenance activities.
Cloud Computing
Cloud computing advancements have led to the creation of environments that are capable
of providing highly scalable, on-demand IT resources that can be leased via pay-as-you-go
models. Businesses have the opportunity to leverage the infrastructure, storage and
processing capabilities provided by these environments in order to build-out scalable Big
Data solutions that can carry out large-scale processing tasks. Although traditionally
thought of as off-premise environments typically depicted with a cloud symbol, businesses
are also leveraging cloud management software to create on premise clouds to more
effectively utilize their existing infrastructure via virtualization. In either case, the ability
of a cloud to dynamically scale based upon load allows for the creation of resilient
analytic environments that maximize efficient utilization of ICT resources.
Figure 2.7 displays an example of how a cloud environment can be leveraged for its
scaling capabilities to perform Big Data processing tasks. The fact that off-premise cloud-
based IT resources can be leased dramatically reduces the required up-front investment of
Big Data projects.
Figure 2.7 The cloud can be used to complete on-demand data analysis at the end of
each month or enable the scaling out of systems with an increase in load.
It makes sense for enterprises already using cloud computing to reuse the cloud for their
Big Data initiatives because:
• personnel already possesses the required cloud computing skills
• the input data already exists in the cloud
Migrating to the cloud is logical for enterprises planning to run analytics on datasets that
are available via data markets, as many data markets make their datasets available in a
cloud environment, such as Amazon S3.
In short, cloud computing can provide three essential ingredients required for a Big Data
solution: external datasets, scalable processing capabilities and vast amounts of storage.
Internet of Everything (IoE)
The convergence of advancements in information and communications technology,
marketplace dynamics, business architecture and business process management all
contribute to the opportunity of what is now known as the Internet of Everything or IoE.
The IoE combines the services provided by smart connected devices of the Internet of
Things into meaningful business processes that possess the ability to provide unique and
differentiating value propositions. It is a platform for innovation enabling the creation of
new products and services and new sources of revenue for businesses. Big Data is the
heart of the IoE. Hyper-connected communities and devices running on affordable
technology and commodity hardware stream digitized data that is subject to analytic
processes hosted in elastic cloud computing environments. The results of the analysis can
provide insight as to how much value is generated by the current process and whether or
not the process should proactively seek opportunities to further optimize itself.
IoE-specific companies can leverage Big Data to establish and optimize workflows and
offer them to third parties as outsourced business processes. As established in the Business
Process Manifesto edited by Roger Burlton (2011), an organization’s business processes
are the source for generating outcomes of value for customers and other stakeholders. In
combination with the analysis of streaming data and customer context, being able to adapt
the execution of these processes to align with the customer’s goals will be a key corporate
differentiator in the future.
One example of an area that has benefited from the IoE is precision agriculture, with
traditional farming equipment manufacturers leading the way. When joined together as a
system of systems, GPS-controlled tractors, in-field moisture and fertilization sensors, on-
demand watering, fertilization, pesticide application systems and variable rate seeding
equipment can maximize field productivity while minimizing cost. Precision agriculture
enables alternative farming approaches that challenge industrial monoculture farms. With
the aid of the IoE, smaller farms are able to compete by leveraging crop diversity and
environmentally sensitive practices. Besides having smart connected farming equipment,
the Big Data analysis of equipment and in-field sensor data can drive a decision support
system that can guide farmers and their machines to optimum yields.
Organization Prerequisites
Data Procurement
Privacy
Security
Provenance
Limited Realtime Support
Distinct Performance Challenges
Distinct Governance Requirements
Distinct Methodology
Clouds
Big Data Analytics Lifecycle
Big Data initiatives are strategic in nature and should be business-driven. The adoption of
Big Data can be transformative but is more often innovative. Transformation activities are
typically low-risk endeavors designed to deliver increased efficiency and effectiveness.
Innovation requires a shift in mindset because it will fundamentally alter the structure of a
business either in its products, services or organization. This is the power of Big Data
adoption; it can enable this sort of change. Innovation management requires care—too
many controlling forces can stifle the initiative and dampen the results, and too little
oversight can turn a best intentioned project into a science experiment that never delivers
promised results. It is against this backdrop that Chapter 3 addresses Big Data adoption
and planning considerations.
Given the nature of Big Data and its analytic power, there are many issues that need to be
considered and planned for in the beginning. For example, with the adoption of any new
technology, the means to secure it in a way that conforms to existing corporate standards
needs to be addressed. Issues related to tracking the provenance of a dataset from its
procurement to its utilization is often a new requirement for organizations. Managing the
privacy of constituents whose data is being handled or whose identity is revealed by
analytic processes must be planned for. Big Data even opens up additional opportunities to
consider moving beyond on-premise environments and into remotely-provisioned,
scalable environments that are hosted in a cloud. In fact, all of the above considerations
require an organization to recognize and establish a set of distinct governance processes
and decision frameworks to ensure that responsible parties understand Big Data’s nature,
implications and management requirements.
Organizationally, the adoption of Big Data changes the approach to performing business
analytics. For this reason, a Big Data analytics lifecycle is introduced in this chapter. The
lifecycle begins with the establishment of a business case for the Big Data project and
ends with ensuring that the analytic results are deployed to the organization to generate
maximal value. There are a number of stages in between that organize the steps of
identifying, procuring, filtering, extracting, cleansing and aggregating of data. This is all
required before the analysis even occurs. The execution of this lifecycle requires new
competencies to be developed or hired into the organization.
As demonstrated, there are many things to consider and account for when adopting Big
Data. This chapter explains the primary potential issues and considerations.
Organization Prerequisites
Big Data frameworks are not turn-key solutions. In order for data analysis and analytics to
offer value, enterprises need to have data management and Big Data governance
frameworks. Sound processes and sufficient skillsets for those who will be responsible for
implementing, customizing, populating and using Big Data solutions are also necessary.
Additionally, the quality of the data targeted for processing by Big Data solutions needs to
be assessed.
Outdated, invalid, or poorly identified data will result in low-quality input which,
regardless of how good the Big Data solution is, will continue to produce low-quality
results. The longevity of the Big Data environment also needs to be planned for. A
roadmap needs to be defined to ensure that any necessary expansion or augmentation of
the environment is planned out to stay in sync with the requirements of the enterprise.
Data Procurement
The acquisition of Big Data solutions themselves can be economical, due to the
availability of open-source platforms and tools and opportunities to leverage commodity
hardware. However, a substantial budget may still be required to obtain external data. The
nature of the business may make external data very valuable. The greater the volume and
variety of data that can be supplied, the higher the chances are of finding hidden insights
from patterns.
External data sources include government data sources and commercial data markets.
Government-provided data, such as geo-spatial data, may be free. However, most
commercially relevant data will need to be purchased and may involve the continuation of
subscription costs to ensure the delivery of updates to procured datasets.
Privacy
Performing analytics on datasets can reveal confidential information about organizations
or individuals. Even analyzing separate datasets that contain seemingly benign data can
reveal private information when the datasets are analyzed jointly. This can lead to
intentional or inadvertent breaches of privacy.
Addressing these privacy concerns requires an understanding of the nature of data being
accumulated and relevant data privacy regulations, as well as special techniques for data
tagging and anonymization. For example, telemetry data, such as a car’s GPS log or smart
meter data readings, collected over an extended period of time can reveal an individual’s
location and behavior, as shown in Figure 3.1.
Figure 3.1 Information gathered from running analytics on image files, relational data
and textual data is used to create John’s profile.
Security
Some of the components of Big Data solutions lack the robustness of traditional enterprise
solution environments when it comes to access control and data security. Securing Big
Data involves ensuring that the data networks and repositories are sufficiently secured via
authentication and authorization mechanisms.
Big Data security further involves establishing data access levels for different categories
of users. For example, unlike traditional relational database management systems, NoSQL
databases generally do not provide robust built-in security mechanisms. They instead rely
on simple HTTP-based APIs where data is exchanged in plaintext, making the data prone
to network-based attacks, as shown in Figure 3.2.
Figure 3.2 NoSQL databases can be susceptible to network-based attacks.
Provenance
Provenance refers to information about the source of the data and how it has been
processed. Provenance information helps determine the authenticity and quality of data,
and it can be used for auditing purposes. Maintaining provenance as large volumes of data
are acquired, combined and put through multiple processing stages can be a complex task.
At different stages in the analytics lifecycle, data will be in different states due to the fact
it may be being transmitted, processed or in storage. These states correspond to the notion
of data-in-motion, data-in-use and data-at-rest. Importantly, whenever Big Data changes
state, it should trigger the capture of provenance information that is recorded as metadata.
As data enters the analytic environment, its provenance record can be initialized with the
recording of information that captures the pedigree of the data. Ultimately, the goal of
capturing provenance is to be able to reason over the generated analytic results with the
knowledge of the origin of the data and what steps or algorithms were used to process the
data that led to the result. Provenance information is essential to being able to realize the
value of the analytic result. Much like scientific research, if results cannot be justified and
repeated, they lack credibility. When provenance information is captured on the way to
generating analytic results as in Figure 3.3, the results can be more easily trusted and
thereby used with confidence.
Figure 3.3 Data may also need to be annotated with source dataset attributes and
processing step details as it passes through the data transformation steps.
Distinct Methodology
A methodology will be required to control how data flows into and out of Big Data
solutions. It will need to consider how feedback loops can be established to enable the
processed data to undergo repeated refinement, as shown in Figure 3.5. For example, an
iterative approach may be used to enable business personnel to provide IT personnel with
feedback on a periodic basis. Each feedback cycle provides opportunities for system
refinement by modifying data preparation or data analysis steps.
Figure 3.5 Each repetition can help fine-tune processing steps, algorithms and data
models to improve the accuracy of results and deliver greater value to the business.
Clouds
As mentioned in Chapter 2, clouds provide remote environments that can host IT
infrastructure for large-scale storage and processing, among other things. Regardless of
whether an organization is already cloud-enabled, the adoption of a Big Data environment
may necessitate that some or all of that environment be hosted within a cloud. For
example, an enterprise that runs its CRM system in a cloud decides to add a Big Data
solution in the same cloud environment in order to run analytics on its CRM data. This
data can then be shared with its primary Big Data environment that resides within the
enterprise boundaries.
Common justifications for incorporating a cloud environment in support of a Big Data
solution include:
• inadequate in-house hardware resources
• upfront capital investment for system procurement is not available
• the project is to be isolated from the rest of the business so that existing business
processes are not impacted
• the Big Data initiative is a proof of concept
• datasets that need to be processed are already cloud resident
• the limits of available computing and storage resources used by an in-house Big
Data solution are being reached
Big Data Analytics Lifecycle
Big Data analysis differs from traditional data analysis primarily due to the volume,
velocity and variety characteristics of the data being processes. To address the distinct
requirements for performing analysis on Big Data, a step-by-step methodology is needed
to organize the activities and tasks involved with acquiring, processing, analyzing and
repurposing data. The upcoming sections explore a specific data analytics lifecycle that
organizes and manages the tasks and activities associated with the analysis of Big Data.
From a Big Data adoption and planning perspective, it is important that in addition to the
lifecycle, consideration be made for issues of training, education, tooling and staffing of a
data analytics team.
The Big Data analytics lifecycle can be divided into the following nine stages, as shown in
Figure 3.6:
1. Business Case Evaluation
2. Data Identification
3. Data Acquisition & Filtering
4. Data Extraction
5. Data Validation & Cleansing
6. Data Aggregation & Representation
7. Data Analysis
8. Data Visualization
9. Utilization of Analysis Results
Figure 3.6 The nine stages of the Big Data analytics lifecycle.
Data Identification
The Data Identification stage shown in Figure 3.8 is dedicated to identifying the datasets
required for the analysis project and their sources.
Figure 3.8 Data Identification is stage 2 of the Big Data analytics lifecycle.
Identifying a wider variety of data sources may increase the probability of finding hidden
patterns and correlations. For example, to provide insight, it can be beneficial to identify
as many types of related data sources as possible, especially when it is unclear exactly
what to look for.
Depending on the business scope of the analysis project and nature of the business
problems being addressed, the required datasets and their sources can be internal and/or
external to the enterprise.
In the case of internal datasets, a list of available datasets from internal sources, such as
data marts and operational systems, are typically compiled and matched against a pre-
defined dataset specification.
In the case of external datasets, a list of possible third-party data providers, such as data
markets and publicly available datasets, are compiled. Some forms of external data may be
embedded within blogs or other types of content-based web sites, in which case they may
need to be harvested via automated tools.
Figure 3.10 Metadata is added to data from internal and external sources.
Data Extraction
Some of the data identified as input for the analysis may arrive in a format incompatible
with the Big Data solution. The need to address disparate types of data is more likely with
data from external sources. The Data Extraction lifecycle stage, shown in Figure 3.11, is
dedicated to extracting disparate data and transforming it into a format that the underlying
Big Data solution can use for the purpose of the data analysis.
Figure 3.11 Stage 4 of the Big Data analytics lifecycle.
The extent of extraction and transformation required depends on the types of analytics and
capabilities of the Big Data solution. For example, extracting the required fields from
delimited textual data, such as with webserver log files, may not be necessary if the
underlying Big Data solution can already directly process those files.
Similarly, extracting text for text analytics, which requires scans of whole documents, is
simplified if the underlying Big Data solution can directly read the document in its native
format.
Figure 3.12 illustrates the extraction of comments and a user ID embedded within an XML
document without the need for further transformation.
Figure 3.12 Comments and user IDs are extracted from an XML document.
Figure 3.13 demonstrates the extraction of the latitude and longitude coordinates of a user
from a single JSON field.
Figure 3.13 The user ID and coordinates of a user are extracted from a single JSON
field.
Further transformation is needed in order to separate the data into two separate fields as
required by the Big Data solution.
Figure 3.15 Data validation can be used to examine interconnected datasets in order to
fill in missing valid data.
For batch analytics, data validation and cleansing can be achieved via an offline ETL
operation. For realtime analytics, a more complex in-memory system is required to
validate and cleanse the data as it arrives from the source. Provenance can play an
important role in determining the accuracy and quality of questionable data. Data that
appears to be invalid may still be valuable in that it may possess hidden patterns and
trends, as shown in Figure 3.16.
Figure 3.16 The presence of invalid data is resulting in spikes. Although the data
appears abnormal, it may be indicative of a new pattern.
Figure 3.18 A simple example of data aggregation where two datasets are aggregated
together using the Id field.
Figure 3.19 shows the same piece of data stored in two different formats. Dataset A
contains the desired piece of data, but it is part of a BLOB that is not readily accessible for
querying. Dataset B contains the same piece of data organized in column-based storage,
enabling each field to be queried individually.
Figure 3.19 Dataset A and B can be combined to create a standardized data structure
with a Big Data solution.
Data Analysis
The Data Analysis stage shown in Figure 3.20 is dedicated to carrying out the actual
analysis task, which typically involves one or more types of analytics. This stage can be
iterative in nature, especially if the data analysis is exploratory, in which case analysis is
repeated until the appropriate pattern or correlation is uncovered. The exploratory analysis
approach will be explained shortly, along with confirmatory analysis.
Figure 3.21 Data analysis can be carried out as confirmatory or exploratory analysis.
Confirmatory data analysis is a deductive approach where the cause of the phenomenon
being investigated is proposed beforehand. The proposed cause or assumption is called a
hypothesis. The data is then analyzed to prove or disprove the hypothesis and provide
definitive answers to specific questions. Data sampling techiniques are typically used.
Unexpected findings or anomalies are usually ignored since a predetermined cause was
assumed.
Exploratory data analysis is an inductive approach that is closely associated with data
mining. No hypothesis or predetermined assumptions are generated. Instead, the data is
explored through analysis to develop an understanding of the cause of the phenomenon.
Although it may not provide definitive answers, this method provides a general direction
that can facilitate the discovery of patterns or anomalies.
Data Visualization
The ability to analyze massive amounts of data and find useful insights carries little value
if the only ones that can interpret the results are the analysts.
The Data Visualization stage, shown in Figure 3.22, is dedicated to using data
visualization techniques and tools to graphically communicate the analysis results for
effective interpretation by business users.
Figure 3.22 Stage 8 of the Big Data analytics lifecycle.
Business users need to be able to understand the results in order to obtain value from the
analysis and subsequently have the ability to provide feedback, as indicated by the dashed
line leading from stage 8 back to stage 7.
The results of completing the Data Visualization stage provide users with the ability to
perform visual analysis, allowing for the discovery of answers to questions that users have
not yet even formulated. Visual analysis techniques are covered later in this book.
The same results may be presented in a number of different ways, which can influence the
interpretation of the results. Consequently, it is important to use the most suitable
visualization technique by keeping the business domain in context.
Another aspect to keep in mind is that providing a method of drilling down to
comparatively simple statistics is crucial, in order for users to understand how the rolled
up or aggregated results were generated.
Figure 4.1 OLTP systems perform simple database operations to provide sub-second
response times.
The queries supported by OLTP systems are comprised of simple insert, delete and update
operations with sub-second response times. Examples include ticket reservation systems,
banking and point of sale systems.
Online Analytical Processing (OLAP)
Online analytical processing (OLAP) systems are used for processing data analysis
queries. OLAPs form an integral part of business intelligence, data mining and machine
learning processes. They are relevant to Big Data in that they can serve as both a data
source as well as a data sink that is capable of receiving data. They are used in diagnostic,
predictive and prescriptive analytics. As shown in Figure 4.2, OLAP systems perform
long-running, complex queries against a multidimensional database whose structure is
optimized for performing advanced analytics.
Data Warehouses
A data warehouse is a central, enterprise-wide repository consisting of historical and
current data. Data warehouses are heavily used by BI to run various analytical queries, and
they usually interface with an OLAP system to support multi-dimensional analytical
queries, as shown in Figure 4.4.
Figure 4.4 Batch jobs periodically load data into a data warehouse from operational
systems like ERP, CRM and SCM.
Data pertaining to multiple business entities from different operational systems is
periodically extracted, validated, transformed and consolidated into a single denormalized
database. With periodic data imports from across the enterprise, the amount of data
contained in a given data warehouse will continue to increase. Over time this leads to
slower query response times for data analysis tasks. To resolve this shortcoming, data
warehouses usually contain optimized databases, called analytical databases, to handle
reporting and data analysis tasks. An analytical database can exist as a separate DBMS, as
in the case of an OLAP database.
Data Marts
A data mart is a subset of the data stored in a data warehouse that typically belongs to a
department, division, or specific line of business. Data warehouses can have multiple data
marts. As shown in Figure 4.5, enterprise-wide data is collected and business entities are
then extracted. Domain-specific entities are persisted into the data warehouse via an ETL
process.
Figure 4.5 A data warehouse’s single version of “truth” is based on cleansed data,
which is a prerequisite for accurate and error-free reports, as per the output shown on
the right.
Traditional BI
Traditional BI primarily utilizes descriptive and diagnostic analytics to provide
information on historical and current events. It is not “intelligent” because it only provides
answers to correctly formulated questions. Correctly formulating questions requires an
understanding of business problems and issues and of the data itself. BI reports on
different KPIs through:
• ad-hoc reports
• dashboards
Ad-hoc Reports
Ad-hoc reporting is a process that involves manually processing data to produce custom-
made reports, as shown in Figure 4.6. The focus of an ad-hoc report is usually on a
specific area of the business, such as its marketing or supply chain management. The
generated custom reports are detailed and often tabular in nature.
Figure 4.6 OLAP and OLTP data sources can be used by BI tools for both ad-hoc
reporting and dashboards.
Dashboards
Dashboards provide a holistic view of key business areas. The information displayed on
dashboards is generated at periodic intervals in realtime or near-realtime. The presentation
of data on dashboards is graphical in nature, using bar charts, pie charts and gauges, as
shown in Figure 4.7.
Figure 4.7 BI tools use both OLAP and OLTP to display the information on
dashboards.
As previously explained, data warehouses and data marts contain consolidated and
validated information about enterprise-wide business entities. Traditional BI cannot
function effectively without data marts because they contain the optimized and segregated
data that BI requires for reporting purposes. Without data marts, data needs to be extracted
from the data warehouse via an ETL process on an ad-hoc basis whenever a query needs
to be run. This increases the time and effort to execute queries and generate reports.
Traditional BI uses data warehouses and data marts for reporting and data analysis
because they allow complex data analysis queries with multiple joins and aggregations to
be issued, as shown in Figure 4.8.
Big Data BI
Big Data BI builds upon traditional BI by acting on the cleansed, consolidated enterprise-
wide data in the data warehouse and combining it with semi-structured and unstructured
data sources. It comprises both predictive and prescriptive analytics to facilitate the
development of an enterprise-wide understanding of business performance.
While traditional BI analyses generally focus on individual business processes, Big Data
BI analyses focus on multiple business processes simultaneously. This helps reveal
patterns and anomalies across a broader scope within the enterprise. It also leads to data
discovery by identifying insights and information that may have been previously absent or
unknown.
Big Data BI requires the analysis of unstructured, semi-structured and structured data
residing in the enterprise data warehouse. This requires a “next-generation” data
warehouse that uses new features and technologies to store cleansed data originating from
a variety of sources in a single uniform data format. The coupling of a traditional data
warehouse with these new technologies results in a hybrid data warehouse. This
warehouse acts as a uniform and central repository of structured, semi-structured and
unstructured data that can provide Big Data BI tools with all of the required data. This
eliminates the need for Big Data BI tools to have to connect to multiple data sources to
retrieve or access data. In Figure 4.9, a next-generation data warehouse establishes a
standardized data access layer across a range of data sources.
Clusters
File Systems and Distributed File Systems
NoSQL
Sharding
Replication
Sharding and Replication
CAP Theorem
ACID
BASE
Data acquired from external sources is often not in a format or structure that can be
directly processed. To overcome these incompatibilities and prepare data for storage and
processing, data wrangling is necessary. Data wrangling includes steps to filter, cleanse
and otherwise prepare the data for downstream analysis. From a storage perspective, a
copy of the data is first stored in its acquired format, and, after wrangling, the prepared
data needs to be stored again. Typically, storage is required whenever the following
occurs:
• external datasets are acquired, or internal data will be used in a Big Data
environment
• data is manipulated to be made amenable for data analysis
• data is processed via an ETL activity, or output is generated as a result of an
analytical operation
Due to the need to store Big Data datasets, often in multiple copies, innovative storage
strategies and technologies have been created to achieve cost-effective and highly scalable
storage solutions. In order to understand the underlying mechanisms behind Big Data
storage technology, the following topics are introduced in this chapter:
• clusters
• file systems and distributed files systems
• NoSQL
• sharding
• replication
• CAP theorem
• ACID
• BASE
Clusters
In computing, a cluster is a tightly coupled collection of servers, or nodes. These servers
usually have the same hardware specifications and are connected together via a network to
work as a single unit, as shown in Figure 5.1. Each node in the cluster has its own
dedicated resources, such as memory, a processor, and a hard drive. A cluster can execute
a task by splitting it into small pieces and distributing their execution onto different
computers that belong to the cluster.
Figure 5.4 A NoSQL database can provide an API- or SQL-like query interface.
Sharding
Sharding is the process of horizontally partitioning a large dataset into a collection of
smaller, more manageable datasets called shards. The shards are distributed across
multiple nodes, where a node is a server or a machine (Figure 5.5). Each shard is stored on
a separate node and each node is responsible for only the data stored on it. Each shard
shares the same schema, and all shards collectively represent the complete dataset.
Figure 5.5 An example of sharding where a dataset is spread across Node A and Node
B, resulting in Shard A and Shard B, respectively.
Sharding is often transparent to the client, but this is not a requirement. Sharding allows
the distribution of processing loads across multiple nodes to achieve horizontal scalability.
Horizontal scaling is a method for increasing a system’s capacity by adding similar or
higher capacity resources alongside existing resources. Since each node is responsible for
only a part of the whole dataset, read/write times are greatly improved.
Figure 5.6 presents an illustration of how sharding works in practice:
1. Each shard can independently service reads and writes for the specific subset of data
that it is responsible for.
2. Depending on the query, data may need to be fetched from both shards.
Figure 5.6 A sharding example where data is fetched from both Node A and Node B.
A benefit of sharding is that it provides partial tolerance toward failures. In case of a node
failure, only data stored on that node is affected.
With regards to data partitioning, query patterns need to be taken into account so that
shards themselves do not become performance bottlenecks. For example, queries requiring
data from multiple shards will impose performance penalties. Data locality keeps
commonly accessed data co-located on a single shard and helps counter such performance
issues.
Replication
Replication stores multiple copies of a dataset, known as replicas, on multiple nodes
(Figure 5.7). Replication provides scalability and availability due to the fact that the same
data is replicated on various nodes. Fault tolerance is also achieved since data redundancy
ensures that data is not lost when an individual node fails. There are two different methods
that are used to implement replication:
• master-slave
• peer-to-peer
Figure 5.7 An example of replication where a dataset is replicated to Node A and Node
B, resulting in Replica A and Replica B.
Master-Slave
During master-slave replication, nodes are arranged in a master-slave configuration, and
all data is written to a master node. Once saved, the data is replicated over to multiple
slave nodes. All external write requests, including insert, update and delete, occur on the
master node, whereas read requests can be fulfilled by any slave node. In Figure 5.8,
writes are managed by the master node and data can be read from either Slave A or Slave
B.
Figure 5.8 An example of master-slave replication where Master A is the single point
of contact for all writes, and data can be read from Slave A and Slave B.
Master-slave replication is ideal for read intensive loads rather than write intensive loads
since growing read demands can be managed by horizontal scaling to add more slave
nodes. Writes are consistent, as all writes are coordinated by the master node. The
implication is that write performance will suffer as the amount of writes increases. If the
master node fails, reads are still possible via any of the slave nodes.
A slave node can be configured as a backup node for the master node. In the event that the
master node fails, writes are not supported until a master node is reestablished. The master
node is either resurrected from a backup of the master node, or a new master node is
chosen from the slave nodes.
One concern with master-slave replication is read inconsistency, which can be an issue if a
slave node is read prior to an update to the master being copied to it. To ensure read
consistency, a voting system can be implemented where a read is declared consistent if the
majority of the slaves contain the same version of the record. Implementation of such a
voting system requires a reliable and fast communication mechanism between the slaves.
Figure 5.9 illustrates a scenario where read inconsistency occurs.
1. User A updates data.
2. The data is copied over to Slave A by the Master.
3. Before the data is copied over to Slave B, User B tries to read the data from Slave
B, which results in an inconsistent read.
4. The data will eventually become consistent when Slave B is updated by the Master.
Figure 5.9 An example of master-slave replication where read inconsistency occurs.
Peer-to-Peer
With peer-to-peer replication, all nodes operate at the same level. In other words, there is
not a master-slave relationship between the nodes. Each node, known as a peer, is equally
capable of handling reads and writes. Each write is copied to all peers, as illustrated in
Figure 5.10.
Figure 5.10 Writes are copied to Peers A, B and C simultaneously. Data is read from
Peer A, but it can also be read from Peers B or C.
Peer-to-peer replication is prone to write inconsistencies that occur as a result of a
simultaneous update of the same data across multiple peers. This can be addressed by
implementing either a pessimistic or optimistic concurrency strategy.
• Pessimistic concurrency is a proactive strategy that prevents inconsistency. It uses
locking to ensure that only one update to a record can occur at a time. However, this
is detrimental to availability since the database record being updated remains
unavailable until all locks are released.
• Optimistic concurrency is a reactive strategy that does not use locking. Instead, it
allows inconsistency to occur with knowledge that eventually consistency will be
achieved after all updates have propagated.
With optimistic concurrency, peers may remain inconsistent for some period of time
before attaining consistency. However, the database remains available as no locking is
involved. Like master-slave replication, reads can be inconsistent during the time period
when some of the peers have completed their updates while others perform their updates.
However, reads eventually become consistent when the updates have been executed on all
peers.
To ensure read consistency, a voting system can be implemented where a read is declared
consistent if the majority of the peers contain the same version of the record. As
previously indicated, implementation of such a voting system requires a reliable and fast
communication mechanism between the peers.
Figure 5.11 demonstrates a scenario where an inconsistent read occurs.
1. User A updates data.
2. a. The data is copied over to Peer A.
b. The data is copied over to Peer B.
3. Before the data is copied over to Peer C, User B tries to read the data from Peer C,
resulting in an inconsistent read.
4. The data will eventually be updated on Peer C, and the database will once again
become consistent.
Figure 5.13 An example that shows the combination of sharding and master-slave
replication.
Combining Sharding and Peer-to-Peer Replication
When combining sharding with peer-to-peer replication, each shard is replicated to
multiple peers, and each peer is only responsible for a subset of the overall dataset.
Collectively, this helps achieve increased scalability and fault tolerance. As there is no
master involved, there is no single point of failure and fault-tolerance for both read and
write operations is supported.
In Figure 5.14:
• Each node contains replicas of two different shards.
• Writes (id = 3) are replicated to both Node A and Node C (Peers) as they are
responsible for Shard C.
• Reads (id = 6) can be served by either Node B or Node C as they each contain Shard
B.
CAP Theorem
The Consistency, Availability, and Partition tolerance (CAP) theorem, also known as
Brewer’s theorem, expresses a triple constraint related to distributed database systems. It
states that a distributed database system, running on a cluster, can only provide two of the
following three properties:
• Consistency – A read from any node results in the same data across multiple nodes
(Figure 5.15).
Figure 5.15 Consistency: all three users get the same value for the amount column even
though three different nodes are serving the record.
• Availability – A read/write request will always be acknowledged in the form of a
success or a failure (Figure 5.16).
Figure 5.16 Availability and partition tolerance: in the event of a communication
failure, requests from both users are still serviced (1, 2). However, with User B, the
update fails as the record with id = 3 has not been copied over to Peer C. The user is
duly notified (3) that the update has failed.
• Partition tolerance – The database system can tolerate communication outages that
split the cluster into multiple silos and can still service read/write requests (Figure
5.16).
The following scenarios demonstrate why only two of the three properties of the CAP
theorem are simultaneously supportable. To aid this discussion, Figure 5.17 provides a
Venn diagram showing the areas of overlap between consistency, availability and partition
tolerance.
Figure 5.17 A Venn diagram summarizing the CAP theorem.
If consistency (C) and availability (A) are required, available nodes need to communicate
to ensure consistency (C). Therefore, partition tolerance (P) is not possible.
If consistency (C) and partition tolerance (P) are required, nodes cannot remain available
(A) as the nodes will become unavailable while achieving a state of consistency (C).
If availability (A) and partition tolerance (P) are required, then consistency (C) is not
possible because of the data communication requirement between the nodes. So, the
database can remain available (A) but with inconsistent results.
In a distributed database, scalability and fault tolerance can be improved through
additional nodes, although this challenges consistency (C). The addition of nodes can also
cause availability (A) to suffer due to the latency caused by increased communication
between nodes.
Distributed database systems cannot be 100% partition tolerant (P). Although
communication outages are rare and temporary, partition tolerance (P) must always be
supported by a distributed database; therefore, CAP is generally a choice between
choosing either C+P or A+P. The requirements of the system will dictate which is chosen.
ACID
ACID is a database design principle related to transaction management. It is an acronym
that stands for:
• atomicity
• consistency
• isolation
• durability
ACID is a transaction management style that leverages pessimistic concurrency controls to
ensure consistency is maintained through the application of record locks. ACID is the
traditional approach to database transaction management as it is leveraged by relational
database management systems.
Atomicity ensures that all operations will always succeed or fail completely. In other
words, there are no partial transactions.
The following steps are illustrated in Figure 5.18:
1. A user attempts to update three records as a part of a transaction.
2. Two records are successfully updated before the occurrence of an error.
3. As a result, the database roll backs any partial effects of the transaction and puts the
system back to its prior state.
BASE
BASE is a database design principle based on the CAP theorem and leveraged by database
systems that use distributed technology. BASE stands for:
• basically available
• soft state
• eventual consistency
When a database supports BASE, it favors availability over consistency. In other words,
the database is A+P from a CAP perspective. In essence, BASE leverages optimistic
concurrency by relaxing the strong consistency constraints mandated by the ACID
properties.
If a database is “basically available,” that database will always acknowledge a client’s
request, either in the form of the requested data or a success/failure notification.
In Figure 5.23, the database is basically available, even though it has been partitioned as a
result of a network failure.
Figure 5.23 User A and User B receive data despite the database being partitioned by a
network failure.
Soft state means that a database may be in an inconsistent state when data is read; thus, the
results may change if the same data is requested again. This is because the data could be
updated for consistency, even though no user has written to the database between the two
reads. This property is closely related to eventual consistency.
In Figure 5.24:
1. User A updates a record on Peer A.
2. Before the other peers are updated, User B requests the same record from Peer C.
3. The database is now in a soft state, and stale data is returned to User B.
Figure 5.24 An example of the soft state property of BASE is shown here.
Eventual consistency is the state in which reads by different clients, immediately
following a write to the database, may not return consistent results. The database only
attains consistency once the changes have been propagated to all nodes. While the
database is in the process of attaining the state of eventual consistency, it will be in a soft
state.
In Figure 5.25:
1. User A updates a record.
2. The record only gets updated at Peer A, but before the other peers can be updated,
User B requests the same record.
3. The database is now in a soft state. Stale data is returned to User B from Peer C.
4. However, the consistency is eventually attained, and User C gets the correct value.
Figure 5.25 An example of the eventual consistency property of BASE.
BASE emphasizes availability over immediate consistency, in contrast to ACID, which
ensures immediate consistency at the expense of availability due to record locking. This
soft approach toward consistency allows BASE compliant databases to serve multiple
clients without any latency albeit serving inconsistent results. However, BASE-compliant
databases are not useful for transactional systems where lack of consistency is a concern.
Case Study Example
ETI’s IT environment currently utilizes both Linux and Windows operating
systems. Consequently, both ext and NTFS file systems are in use. The webservers
and some of the application servers employ ext, while the rest of the application
servers, the database servers and the end-users’ PCs are configured to use NTFS.
Network-attached storage (NAS) configured with RAID 5 is also used for fault-
tolerant document storage. Although the IT team is conversant with file systems, the
concepts of cluster, distributed file system and NoSQL are new to the group.
Nevertheless, after a discussion with the trained IT team members, the entire group
is able to understand these concepts and technologies.
ETI’s current IT landscape comprises entirely of relational databases that employ
the ACID database design principle. The IT team has no understanding of the
BASE principle and is having trouble comprehending the CAP theorem. Some of
the team members are unsure about the need and importance of these concepts with
regards to Big Data dataset storage. Seeing this, the IT-trained employees try to
ease their fellow team members’ confusion by explaining that these concepts are
only applicable to the storage of enormous amounts of data in a distributed fashion
on a cluster. Clusters have become the obvious choice for storing very large volume
of data due to their ability to support linear scalability by scaling out.
Since clusters are comprised of nodes connected via a network, communication
failures that create silos or partitions of a cluster are inevitable. To address the
partition issue, the BASE principle and CAP theorem are introduced. They further
explain that any database following the BASE principle becomes more responsive
to its clients, albeit the data being read may be inconsistent when compared to a
database that follows the ACID principle. Having understood the BASE principle,
the IT team more easily comprehends why a database implemented in a cluster has
to choose between consistency and availability.
Although none of the existing relational databases use sharding, almost all
relational databases are replicated for disaster recovery and operational reporting.
To better understand the concepts of sharding and replication, the IT team goes
through an exercise of how these concepts can be applied to the insurance quote
data as a large number of quotes are created and accessed quickly. For sharding, the
team believes that using the type (the insurance sector—heath, building, marine and
aviation) of the insurance quote as sharding criteria will create a balanced set of
data across multiple nodes, for queries are mostly executed within the same
insurance sector, and inter-sector queries are rare. With regards to replication, the
team is in favor of choosing a NoSQL database that implements the peer-to-peer
replication strategy. The reason behind their decision is that the insurance quotes
are created and retrieved quite frequently but seldom updated. Hence the chances of
getting an inconsistent record are low. Considering this, the team favors read/write
performance over consistency by choosing peer-to-peer replication.
Chapter 6. Big Data Processing Concepts
Figure 6.1 A task can be divided into three sub-tasks that are executed in parallel on
three different processors within the same machine.
Hadoop
Hadoop is an open-source framework for large-scale data storage and data processing that
is compatible with commodity hardware. The Hadoop framework has established itself as
a de facto industry platform for contemporary Big Data solutions. It can be used as an
ETL engine or as an analytics engine for processing large amounts of structured, semi-
structured and unstructured data. From an analysis perspective, Hadoop implements the
MapReduce processing framework. Figure 6.3 illustrates some of Hadoop’s features.
Figure 6.3 Hadoop is a versatile framework that provides both processing and storage
capabilities.
Processing Workloads
A processing workload in Big Data is defined as the amount and nature of data that is
processed within a certain amount of time. Workloads are usually divided into two types:
• batch
• transactional
Batch
Batch processing, also known as offline processing, involves processing data in batches
and usually imposes delays, which in turn results in high-latency responses. Batch
workloads typically involve large quantities of data with sequential read/writes and
comprise of groups of read or write queries.
Queries can be complex and involve multiple joins. OLAP systems commonly process
workloads in batches. Strategic BI and analytics are batch-oriented as they are highly
read-intensive tasks involving large volumes of data. As shown in Figure 6.4, a batch
workload comprises grouped read/writes that have a large data footprint and may contain
complex joins and provide high-latency responses.
Figure 6.4 A batch workload can include grouped read/writes to INSERT, SELECT,
UPDATE and DELETE.
Transactional
Transactional processing is also known as online processing. Transactional workload
processing follows an approach whereby data is processed interactively without delay,
resulting in low-latency responses. Transaction workloads involve small amounts of data
with random reads and writes.
OLTP and operational systems, which are generally write-intensive, fall within this
category. Although these workloads contain a mix of read/write queries, they are generally
more write-intensive than read-intensive.
Transactional workloads comprise random reads/writes that involve fewer joins than
business intelligence and reporting workloads. Given their online nature and operational
significance to the enterprise, they require low-latency responses with a smaller data
footprint, as shown in Figure 6.5.
Figure 6.5 Transactional workloads have few joins and lower latency responses than
batch workloads.
Cluster
In the same manner that clusters provide necessary support to create horizontally scalable
storage solutions, clusters also provides the mechanism to enable distributed data
processing with linear scalability. Since clusters are highly scalable, they provide an ideal
environment for Big Data processing as large datasets can be divided into smaller datasets
and then processed in parallel in a distributed manner. When leveraging a cluster, Big Data
datasets can either be processed in batch mode or realtime mode (Figure 6.6). Ideally, a
cluster will be comprised of low-cost commodity nodes that collectively provide increased
processing capacity.
Figure 6.6 A cluster can be utilized to support batch processing of bulk data and
realtime processing of streaming data.
An additional benefit of clusters is that they provide inherent redundancy and fault
tolerance, as they consist of physically separate nodes. Redundancy and fault tolerance
allow resilient processing and analysis to occur if a network or node failure occurs. Due to
fluctuations in the processing demands placed upon a Big Data environment, leveraging
cloud-host infrastructure services, or ready-made analytical environments as the backbone
of a cluster, is sensible due to their elasticity and pay-for-use model of utility-based
computing.
Figure 6.8 An illustration of a MapReduce job with the map stage highlighted.
Map tasks
• map
• combine (optional)
• partition
Reduce tasks
• shuffle and sort
• reduce
Map
The first stage of MapReduce is known as map, during which the dataset file is divided
into multiple smaller splits. Each split is parsed into its constituent records as a key-value
pair. The key is usually the ordinal position of the record, and the value is the actual
record.
The parsed key-value pairs for each split are then sent to a map function or mapper, with
one mapper function per split. The map function executes user-defined logic. Each split
generally contains multiple key-value pairs, and the mapper is run once for each key-value
pair in the split.
The mapper processes each key-value pair as per the user-defined logic and further
generates a key-value pair as its output. The output key can either be the same as the input
key or a substring value from the input value, or another serializable user-defined object.
Similarly, the output value can either be the same as the input value or a substring value
from the input value, or another serializable user-defined object.
When all records of the split have been processed, the output is a list of key-value pairs
where multiple key-value pairs can exist for the same key. It should be noted that for an
input key-value pair, a mapper may not produce any output key-value pair (filtering) or
can generate multiple key-value pairs (demultiplexing.) The map stage can be summarized
by the equation shown in Figure 6.9.
Combine
Generally, the output of the map function is handled directly by the reduce function.
However, map tasks and reduce tasks are mostly run over different nodes. This requires
moving data between mappers and reducers. This data movement can consume a lot of
valuable bandwidth and directly contributes to processing latency.
With larger datasets, the time taken to move the data between map and reduce stages can
exceed the actual processing undertaken by the map and reduce tasks. For this reason, the
MapReduce engine provides an optional combine function (combiner) that summarizes a
mapper’s output before it gets processed by the reducer. Figure 6.10 illustrates the
consolidation of the output from the map stage by the combine stage.
Figure 6.10 The combine stage groups the output from the map stage.
A combiner is essentially a reducer function that locally groups a mapper’s output on the
same node as the mapper. A reducer function can be used as a combiner function, or a
custom user-defined function can be used.
The MapReduce engine combines all values for a given key from the mapper output,
creating multiple key-value pairs as input to the combiner where the key is not repeated
and the value exists as a list of all corresponding values for that key. The combiner stage is
only an optimization stage, and may therefore not even be called by the MapReduce
engine.
For example, a combiner function will work for finding the largest or the smallest number,
but will not work for finding the average of all numbers since it only works with a subset
of the data. The combine stage can be summarized by the equation shown in Figure 6.11.
Figure 6.12 The partition stage assigns output from the map task to reducers.
Although each partition contains multiple key-value pairs, all records for a particular key
are assigned to the same partition. The MapReduce engine guarantees a random and fair
distribution between reducers while making sure that all of the same keys across multiple
mappers end up with the same reducer instance.
Depending on the nature of the job, certain reducers can sometimes receive a large number
of key-value pairs compared to others. As a result of this uneven workload, some reducers
will finish earlier than others. Overall, this is less efficient and leads to longer job
execution times than if the work was evenly split across reducers. This can be rectified by
customizing the partitioning logic in order to guarantee a fair distribution of key-value
pairs.
The partition function is the last stage of the map task. It returns the index of the reducer
to which a particular partition should be sent. The partition stage can be summarized by
the equation in Figure 6.13.
Figure 6.13 A summary of the partition stage.
Reduce
Reduce is the final stage of the reduce task. Depending on the user-defined logic specified
in the reduce function (reducer), the reducer will either further summarize its input or will
emit the output without making any changes. In either case, for each key-value pair that a
reducer receives, the list of values stored in the value part of the pair is processed and
another key-value pair is written out.
The output key can either be the same as the input key or a substring value from the input
value, or another serializable user-defined object. The output value can either be the same
as the input value or a substring value from the input value, or another serializable user-
defined object.
Note that just like the mapper, for the input key-value pair, a reducer may not produce any
output key-value pair (filtering) or can generate multiple key-value pairs (demultiplexing).
The output of the reducer, that is the key-value pairs, is then written out as a separate file
—one file per reducer. This is depicted in Figure 6.16, which highlights the reduce stage
of the reduce task. To view the full output from the MapReduce job, all the file parts must
be combined.
Figure 6.16 The reduce stage is the last stage of the reduce task.
The number of reducers can be customized. It is also possible to have a MapReduce job
without a reducer, for example when performing filtering.
Note that the output signature (key-value types) of the map function should match that of
the input signature (key-value types) of the reduce/combine function. The reduce stage can
be summarized by the equation in Figure 6.17.
Figure 7.1 On-disk storage can be implemented with a distributed file system or a
database.
Figure 7.2 A distributed file system accessing data in streaming mode with no random
reads and writes.
A distributed file system storage device is suitable when large datasets of raw data are to
be stored or when archiving of datasets is required. In addition, it provides an inexpensive
storage option for storing large amounts of data over a long period of time that needs to
remain online. This is because more disks can simply be added to the cluster without
needing to offload the data to offline data storage, such as tapes. It should be noted that
distributed file systems do not provide the ability to search the contents of files as standard
out-of-the-box capability.
RDBMS Databases
Relational database management systems (RDBMSs) are good for handling transactional
workloads involving small amounts of data with random read/write properties. RDBMSs
are ACID-compliant, and, to honor this compliance, they are generally restricted to a
single node. For this reason, RDBMSs do not provide out-of-the-box redundancy and fault
tolerance.
To handle large volumes of data arriving at a fast pace, relational databases generally need
to scale. RDBMSs employ vertical scaling, not horizontal scaling, which is a more costly
and disruptive scaling strategy. This makes RDBMSs less than ideal for long-term storage
of data that accumulates over time.
Note that some relational databases, for example IBM DB2 pureScale, Sybase ASE
Cluster Edition, Oracle Real Application Clusters (RAC) and Microsoft Parallel Data
Warehouse (PDW), are capable of being run on clusters (Figure 7.3). However, these
database clusters still use shared storage that can act as a single point of failure.
Figure 7.3 A clustered rational database uses a shared storage architecture, which is a
potential single point of failure that affects the availability of the database.
Relational databases need to be manually sharded, mostly using application logic. This
means that the application logic needs to know which shard to query in order to get the
required data. This further complicates data processing when data from multiple shards is
required.
The following steps are shown in Figure 7.4:
1. A user writes a record (id = 2).
2. The application logic determines which shard it should be written to.
3. It is sent to the shard determined by the application logic.
4. The user reads a record (id = 4), and the application logic determines which shard
contains the data.
5. The data is read and returned to the application.
6. The application then returns the record to the user.
NoSQL Databases
Not-only SQL (NoSQL) refers to technologies used to develop next generation non-
relational databases that are highly scalable and fault-tolerant. The symbol used to
represent NoSQL databases is shown in Figure 7.6.
Figure 7.6 The symbol used to represent a NoSQL database.
Characteristics
Below is a list of the principal features of NoSQL storage devices that differentiate them
from traditional RDBMSs. This list should only be considered a general guide, as not all
NoSQL storage devices exhibit all of these features.
• Schema-less data model – Data can exist in its raw form.
• Scale out rather than scale up – More nodes can be added to obtain additional
storage with a NoSQL database, in contrast to having to replace the existing node
with a better, higher performance/capacity one.
• Highly available – This is built on cluster-based technologies that provide fault
tolerance out of the box.
• Lower operational costs – Many NoSQL databases are built on Open Source
platforms with no licensing costs. They can often be deployed on commodity
hardware.
• Eventual consistency – Data reads across multiple nodes but may not be consistent
immediately after a write. However, all nodes will eventually be in a consistent state.
• BASE, not ACID – BASE compliance requires a database to maintain high
availability in the event of network/node failure, while not requiring the database to
be in a consistent state whenever an update occurs. The database can be in a
soft/inconsistent state until it eventually attains consistency. As a result, in
consideration of the CAP theorem, NoSQL storage devices are generally AP or CP.
• API driven data access – Data access is generally supported via API based queries,
including RESTful APIs, whereas some implementations may also provide SQL-like
query capability.
• Auto sharding and replication – To support horizontal scaling and provide high
availability, a NoSQL storage device automatically employs sharding and replication
techniques where the dataset is partitioned horizontally and then copied to multiple
nodes.
• Integrated caching – This removes the need for a third-party distributed caching
layer, such as Memcached.
• Distributed query support – NoSQL storage devices maintain consistent query
behavior across multiple shards.
• Polyglot persistence – The use of NoSQL storage does not mandate retiring
traditional RDBMSs. In fact, both can be used at the same time, thereby supporting
polyglot persistence, which is an approach of persisting data using different types of
storage technologies within the same solution architecture. This is good for
developing systems requiring structured as well as semi/unstructured data.
• Aggregate-focused – Unlike relational databases that are most effective with fully
normalized data, NoSQL storage devices store de-normalized aggregated data (an
entity containing merged, often nested, data for an object) thereby eliminating the
need for joins and extensive mapping between application objects and the data
stored in the database. One exception, however, is that graph database storage
devices (introduced shortly) are not aggregate-focused.
Rationale
The emergence of NoSQL storage devices can primarily be attributed to the volume,
velocity and variety characteristics of Big Data datasets.
Volume
The storage requirement of ever increasing data volumes commands the use of databases
that are highly scalable while keeping costs down for the business to remain competitive.
NoSQL storage devices fulfill this requirement by providing scale out capability while
using inexpensive commodity servers.
Velocity
The fast influx of data requires databases with fast access data write capability. NoSQL
storage devices enable fast writes by using schema-on-read rather than schema-on-write
principle. Being highly available, NoSQL storage devices can ensure that write latency
does not occur because of node or network failure.
Variety
A storage device needs to handle different data formats including documents, emails,
images and videos and incomplete data. NoSQL storage devices can store these different
forms of semi-structured and unstructured data formats. At the same time, NoSQL storage
devices are able to store schema-less data and incomplete data with the added ability of
making schema changes as the data model of the datasets evolve. In other words, NoSQL
databases support schema evolution.
Types
NoSQL storage devices can mainly be divided into four types based on the way they store
data, as shown in Figures 7.7–7.10:
• key-value
• document
• column-family
• graph
Figure 7.7 An example of key-value NoSQL storage.
Key-Value
Key-value storage devices store data as key-value pairs and act like hash tables. The table
is a list of values where each value is identified by a key. The value is opaque to the
database and is typically stored as a BLOB. The value stored can be any aggregate,
ranging from sensor data to videos.
Value look-up can only be performed via the keys as the database is oblivious to the
details of the stored aggregate. Partial updates are not possible. An update is either a delete
or an insert operation.
Key-value storage devices generally do not maintain any indexes, therefore writes are
quite fast. Based on a simple storage model, key-value storage devices are highly scalable.
As keys are the only means of retrieving the data, the key is usually appended with the
type of the value being saved for easy retrieval. An example of this is 123_sensor1.
To provide some structure to the stored data, most key-value storage devices provide
collections or buckets (like tables) into which key-value pairs can be organized. A single
collection can hold multiple data formats, as shown in Figure 7.11. Some implementations
support compressing values for reducing the storage footprint. However, this introduces
latency at read time, as the data needs to be decompressed first before being returned.
Document
Document storage devices also store data as key-value pairs. However, unlike key-value
storage devices, the stored value is a document that can be queried by the database. These
documents can have a complex nested structure, such as an invoice, as shown in Figure
7.12. The documents can be encoded using either a text-based encoding scheme, such as
XML or JSON, or using a binary encoding scheme, such as BSON (Binary JSON).
Column-Family
Column-family storage devices store data much like a traditional RDBMS but group
related columns together in a row, resulting in column-families (Figure 7.13). Each
column can be a collection of related columns itself, referred to as a super-column.
Figure 7.13 The highlighted columns depict the flexible schema feature supported by
the column-family databases, where each row can have a different set of columns.
Each super-column can contain an arbitrary number of related columns that are generally
retrieved or updated as a single unit. Each row consists of multiple column-families and
can have a different set of columns, thereby manifesting flexible schema support. Each
row is identified by a row key.
Column-family storage devices provide fast data access with random read/write capability.
They store different column-families in separate physical files, which improves query
responsiveness as only the required column-families are searched.
Some column-family storage devices provide support for selectively compressing column-
families. Leaving searchable column-families uncompressed can make queries faster
because the target column does not need to be decompressed for lookup. Most
implementations support data versioning while some support specifying an expiry time for
column data. When the expiry time has passed, the data is automatically removed.
A column-family storage device is appropriate when:
• realtime random read/write capability is needed and data being stored has some
defined structure
• data represents a tabular structure, each row consists of a large number of columns
and nested groups of interrelated data exist
• support for schema evolution is required as column families can be added or
removed without any system downtime
• certain fields are mostly accessed together, and searches need to be performed using
field values
• efficient use of storage is required when the data consists of sparsely populated rows
since column-family databases only allocate storage space if a column exists for a
row. If no column is present, no space is allocated.
• query patterns involve insert, select, update and delete operations
A column-family storage device is inappropriate when:
• relational data access is required; for example, joins
• ACID transactional support is required
• binary data needs to be stored
• SQL-compliant queries need to be executed
• query patterns are likely to change frequently because that could initiate a
corresponding restructuring of how column-families are arranged
Examples of column-family storage devices include Cassandra, HBase and Amazon
SimpleDB.
Graph
Graph storage devices are used to persist inter-connected entities. Unlike other NoSQL
storage devices, where the emphasis is on the structure of the entities, graph storage
devices place emphasis on storing the linkages between entities (Figure 7.14).
Figure 7.14 Graph storage devices store entities and their relationships.
Entities are stored as nodes (not to be confused with cluster nodes) and are also called
vertices, while the linkages between entities are stored as edges. In RDBMS parlance,
each node can be thought of a single row while the edge denotes a join.
Nodes can have more than one type of link between them through multiple edges. Each
node can have attribute data as key-value pairs, such as a customer node with ID, name
and age attributes.
Each edge can have its own attribute data as key-value pairs, which can be used to further
filter query results. Having multiple edges are similar to defining multiple foreign keys in
an RDBMS; however, not every node is required to have the same edges. Queries
generally involve finding interconnected nodes based on node attributes and/or edge
attributes, commonly referred to as node traversal. Edges can be unidirectional or
bidirectional, setting the node traversal direction. Generally, graph storage devices provide
consistency via ACID compliance.
The degree of usefulness of a graph storage device depends on the number and types of
edges defined between the nodes. The greater the number and more diverse the edges are,
the more diverse the types of queries it can handle. As a result, it is important to
comprehensively capture the types of relations that exist between the nodes. This is not
only true for existing usage scenarios, but also for exploratory analysis of data.
Graph storage devices generally allow adding new types of nodes without making changes
to the database. This also enables defining additional links between nodes as new types of
relationships or nodes appear in the database.
A graph storage device is appropriate when:
• interconnected entities need to be stored
• querying entities based on the type of relationship with each other rather than the
attributes of the entities
• finding groups of interconnected entities
• finding distances between entities in terms of the node traversal distance
• mining data with a view toward finding patterns
A graph storage device is inappropriate when:
• updates are required to a large number of node attributes or edge attributes, as this
involves searching for nodes or edges, which is a costly operation compared to
performing node traversals
• entities have a large number of attributes or nested data—it is best to store
lightweight entities in a graph storage device while storing the rest of the attribute
data in a separate non-graph NoSQL storage device
• binary storage is required
• queries based on the selection of node/edge attributes dominate node traversal
queries
Examples include Neo4J, Infinite Graph and OrientDB.
NewSQL Databases
NoSQL storage devices are highly scalable, available, fault-tolerant and fast for read/write
operations. However, they do not provide the same transaction and consistency support as
exhibited by ACID compliant RDBMSs. Following the BASE model, NoSQL storage
devices provide eventual consistency rather than immediate consistency. They therefore
will be in a soft state while reaching the state of eventual consistency. As a result, they are
not appropriate for use when implementing large scale transactional systems.
NewSQL storage devices combine the ACID properties of RDBMS with the scalability
and fault tolerance offered by NoSQL storage devices. NewSQL databases generally
support SQL compliant syntax for data definition and data manipulation operations, and
they often use a logical relational data model for data storage.
NewSQL databases can be used for developing OLTP systems with very high volumes of
transactions, for example a banking system. They can also be used for realtime analytics,
for example operational analytics, as some implementations leverage in-memory storage.
As compared to a NoSQL storage device, a NewSQL storage device provides an easier
transition from a traditional RDBMS to a highly scalable database due to its support for
SQL.
Examples of NewSQL databases include VoltDB, NuoDB and InnoDB.
Write-through
Any write (insert/update/delete) to the IMDG is written synchronously in a transactional
manner to the backend on-disk storage device, such as a database. If the write to the
backend on-disk storage device fails, the IMDG’s update is rolled back. Due to this
transactional nature, data consistency is achieved immediately between the two data
stores. However, this transactional support is provided at the expense of write latency as
any write operation is considered complete only when feedback (write success/failure)
from the backend storage is received (Figure 7.22).
Figure 7.22 A client inserts a new key-value pair (K3,V3) which is inserted into both
the IMDG (1a) and the backend storage (1b) in a transactional manner. Upon successful
insertion of data into the IMDG (2a) and the backend storage (2b), the client is
informed that data has been successfully inserted (3).
Write-behind
Any write to the IMDG is written asynchronously in a batch manner to the backend on-
disk storage device, such as a database.
A queue is generally placed between the IMDG and the backend storage for keeping track
of the required changes to the backend storage. This queue can be configured to write data
to the backend storage at different intervals.
The asynchronous nature increases both write performance (the write operation is
considered completed as soon as it is written to the IMDG) and read performance (data
can be read from the IMDG as soon as it is written to the IMDG) and
scalability/availability in general.
However, the asynchronous nature introduces inconsistency until the backend storage is
updated at the specified interval.
In Figure 7.23:
1. Client A updates value of K3, which is updated in the IMDG (a) and is also sent to a
queue (b).
2. However, before the backend storage is updated, Client B makes a request for the
same key.
3. The old value is sent.
4. After the configured interval…
5. … the backend storage is eventually updated.
6. Client C makes a request for the same key.
7. This time, the updated value is sent.
In-Memory Databases
IMDBs are in-memory storage devices that employ database technology and leverage the
performance of RAM to overcome runtime latency issues that plague on-disk storage
devices. The symbol for an IMDB is shown in Figure 7.25.
Figure 7.25 The symbol used to represent an IMDB.
In Figure 7.26:
1. A relational dataset is stored into an IMDB.
2. A client requests a customer record (id = 2) via SQL.
3. The relevant customer record is then returned by the IMDB, which is directly
manipulated by the client without the need for any deserialization.
Quantitative Analysis
Qualitative Analysis
Data Mining
Statistical Analysis
Machine Learning
Semantic Analysis
Visual Analysis
Big Data analysis blends traditional statistical data analysis approaches with
computational ones. Statistical sampling from a population is ideal when the entire dataset
is available, and this condition is typical of traditional batch processing scenarios.
However, Big Data can shift batch processing to realtime processing due to the need to
make sense of streaming data. With streaming data, the dataset accumulates over time, and
the data is time-ordered. Streaming data places an emphasis on timely processing, for
analytic results have a shelf-life. Whether it is the recognition of an upsell opportunity that
presents itself due to the current context of a customer, or the detection of anomalous
conditions in an industrial setting that require intervention to protect equipment or ensure
product quality, time is of the essence, and freshness of the analytic result is essential.
In 2003, William Agresti recognized the shift toward computational
approaches and argued for the creation of a new computational discipline
named Discovery Informatics. Agresti’s view of this field was one that
embraced composition. In other words, he believed that discovery informatics
was a synthesis of the following fields: pattern recognition (data mining);
artificial intelligence (machine learning); document and text processing
(semantic processing); database management and information storage and
retrieval. Agresti’s insight into the importance and breadth of computational
approaches to data analysis was forward-thinking at the time, and his
perspective on the matter has only been reinforced by the passage of time and
the emergence of data science as a discipline.
In any fast moving field like Big Data, there are always opportunities for innovation. An
example of this is the question of how to best blend statistical and computational
approaches for a given analytical problem. Statistical techniques are commonly preferred
for exploratory data analysis, after which computational techniques that leverage the
insight gleaned from the statistical study of a dataset can be applied. The shift from batch
to realtime presents other challenges as realtime techniques need to leverage
computationally-efficient algorithms.
One challenge concerns the best way of balancing the accuracy of an analytic result
against the run-time of the algorithm. In many cases, an approximation may be sufficient
and affordable. From a storage perspective, multi-tiered storage solutions which leverage
RAM, solid-state drives and hard-disk drives will provide near-term flexibility and
realtime analytic capability with long-term, cost-effective persistent storage. In the long
run, an organization will operate its Big Data analysis engine at two speeds: processing
streaming data as it arrives and performing batch analysis of this data as it accumulates to
look for patterns and trends. (The symbol used to represent data analysis is shown in
Figure 8.1.)
Quantitative Analysis
Quantitative analysis is a data analysis technique that focuses on quantifying the patterns
and correlations found in the data. Based on statistical practices, this technique involves
analyzing a large number of observations from a dataset. Since the sample size is large,
the results can be applied in a generalized manner to the entire dataset. Figure 8.2 depicts
the fact that quantitative analysis produces numerical results.
Qualitative Analysis
Qualitative analysis is a data analysis technique that focuses on describing various data
qualities using words. It involves analyzing a smaller sample in greater depth compared to
quantitative data analysis. These analysis results cannot be generalized to an entire dataset
due to the small sample size. They also cannot be measured numerically or used for
numerical comparisons. For example, an analysis of ice cream sales may reveal that May’s
sales figures were not as high as June’s. The analysis results state only that the figures
were “not as high as,” and do not provide a numerical difference. The output of qualitative
analysis is a description of the relationship using words as shown in Figure 8.3.
Figure 8.3 Qualitative results are descriptive in nature and not generalizable to the
entire dataset.
Data Mining
Data mining, also known as data discovery, is a specialized form of data analysis that
targets large datasets. In relation to Big Data analysis, data mining generally refers to
automated, software-based techniques that sift through massive datasets to identify
patterns and trends.
Specifically, it involves extracting hidden or unknown patterns in the data with the
intention of identifying previously unknown patterns. Data mining forms the basis for
predictive analytics and business intelligence (BI). The symbol used to represent data
mining is shown in Figure 8.4.
Statistical Analysis
Statistical analysis uses statistical methods based on mathematical formulas as a means for
analyzing data. Statistical analysis is most often quantitative, but can also be qualitative.
This type of analysis is commonly used to describe datasets via summarization, such as
providing the mean, median, or mode of statistics associated with the dataset. It can also
be used to infer patterns and relationships within the dataset, such as regression and
correlation.
This section describes the following types of statistical analysis:
• A/B Testing
• Correlation
• Regression
A/B Testing
A/B testing, also known as split or bucket testing, compares two versions of an element to
determine which version is superior based on a pre-defined metric. The element can be a
range of things. For example, it can be content, such as a Web page, or an offer for a
product or service, such as deals on electronic items. The current version of the element is
called the control version, whereas the modified version is called the treatment. Both
versions are subjected to an experiment simultaneously. The observations are recorded to
determine which version is more successful.
Although A/B testing can be implemented in almost any domain, it is most often used in
marketing. Generally, the objective is to gauge human behavior with the goal of increasing
sales. For example, in order to determine the best possible layout for an ice cream ad on
Company A’s Web site, two different versions of the ad are used. Version A is an existing
ad (the control) while Version B has had its layout slightly altered (the treatment). Both
versions are then simultaneously shown to different users:
• Version A to Group A
• Version B to Group B
The analysis of the results reveals that Version B of the ad resulted in more sales as
compared to Version A.
In other areas such as the scientific domains, the objective may simply be to observe
which version works better in order to improve a process or product. Figure 8.5 provides
an example of A/B testing on two different email versions sent simultaneously.
Figure 8.5 Two different email versions are sent out simultaneously as part of a
marketing campaign to see which version brings in more prospective customers.
Sample questions can include:
• Is the new version of a drug better than the old one?
• Do customers respond better to advertisements delivered by email or postal mail?
• Is the newly designed homepage of the Web site generating more user traffic?
Correlation
Correlation is an analysis technique used to determine whether two variables are related to
each other. If they are found to be related, the next step is to determine what their
relationship is. For example, the value of Variable A increases whenever the value of
Variable B increases. We may be further interested in discovering how closely Variables A
and B are related, which means we may also want to analyze the extent to which Variable
B increases in relation to Variable A’s increase.
The use of correlation helps to develop an understanding of a dataset and find
relationships that can assist in explaining a phenomenon. Correlation is therefore
commonly used for data mining where the identification of relationships between
variables in a dataset leads to the discovery of patterns and anomalies. This can reveal the
nature of the dataset or the cause of a phenomenon.
When two variables are considered to be correlated they are aligned based on a linear
relationship. This means that when one variable changes, the other variable also changes
proportionally and constantly.
Correlation is expressed as a decimal number between –1 to +1, which is known as the
correlation coefficient. The degree of relationship changes from being strong to weak
when moving from –1 to 0 or +1 to 0.
Figure 8.6 shows a correlation of +1, which suggests that there is a strong positive
relationship between the two variables.
Figure 8.6 When one variable increases, the other also increases and vice versa.
Figure 8.7 shows a correlation of 0, which suggests that there is no relationship at all
between the two variables.
Figure 8.7 When one variable increases, the other may stay the same, or increase or
decrease arbitrarily.
In Figure 8.8, a slope of –1 suggests that there is a strong negative relationship between
the two variables.
Figure 8.8 When one variable increases, the other decreases and vice versa.
For example, managers believe that ice cream stores need to stock more ice cream for hot
days, but don’t know how much extra to stock. To determine if a relationship actually
exists between temperature and ice cream sales, the analysts first apply correlation to the
number of ice creams sold and the recorded temperature readings. A value of +0.75
suggests that there exists a strong relationship between the two. This relationship indicates
that as temperature increases, more ice creams are sold.
Further sample questions addressed by correlation can include:
• Does distance from the sea affect the temperature of a city?
• Do students who perform well at elementary school perform equally well at high
school?
• To what extent is obesity linked with overeating?
Regression
The analysis technique of regression explores how a dependent variable is related to an
independent variable within a dataset. As a sample scenario, regression could help
determine the type of relationship that exists between temperature, the independent
variable, and crop yield, the dependent variable.
Applying this technique helps determine how the value of the dependent variable changes
in relation to changes in the value of the independent variable. When the independent
variable increases, for example, does the dependent variable also increase? If yes, is the
increase in a linear or non-linear proportion?
For example, in order to determine how much extra stock each ice cream store needs to
have, the analysts apply regression by feeding in the values of temperature readings. These
values are based on the weather forecast as an independent variable and the number of ice
creams sold as the dependent variable. What the analysts discover is that 15% of
additional stock is required for every 5-degree increase in temperature.
More than one independent variable can be tested at the same time. However, in such
cases, only one independent variable may change, while others are kept constant.
Regression can help enable a better understanding of what a phenomenon is and why it
occurred. It can also be used to make predictions about the values of the dependent
variable.
Linear regression represents a constant rate of change, as shown in Figure 8.9.
Machine Learning
Humans are good at spotting patterns and relationships within data. Unfortunately, we
cannot process large amounts of data very quickly. Machines, on the other hand, are very
adept at processing large amounts of data quickly, but only if they know how.
If human knowledge can be combined with the processing speed of machines, machines
will be able to process large amounts of data without requiring much human intervention.
This is the basic concept of machine learning.
In this section, machine learning and its relationship to data mining are explored through
coverage of the following types of machine learning techniques:
• Classification
• Clustering
• Outlier Detection
• Filtering
Outlier Detection
Outlier detection is the process of finding data that is significantly different from or
inconsistent with the rest of the data within a given dataset. This machine learning
technique is used to identify anomalies, abnormalities and deviations that can be
advantageous, such as opportunities, or unfavorable, such as risks.
Outlier detection is closely related to the concept of classification and clustering, although
its algorithms focus on finding abnormal values. It can be based on either supervised or
unsupervised learning. Applications for outlier detection include fraud detection, medical
diagnosis, network data analysis and sensor data analysis. A scatter graph visually
highlights data points that are outliers, as shown in Figure 8.13.
Figure 8.13 A scatter graph highlights an outlier.
For example, in order to find out whether or not a transaction is likely to be fraudulent, the
bank’s IT team builds a system employing an outlier detection technique that is based on
supervised learning. A set of known fraudulent transactions is first fed into the outlier
detection algorithm. After training the system, unknown transactions are then fed into the
outlier detection algorithm to predict if they are fraudulent or not.
Sample questions can include:
• Is an athlete using performance enhancing drugs?
• Are there any wrongly identified fruits and vegetables in the training dataset used
for a classification task?
• Is there a particular strain of virus that does not respond to medication?
Filtering
Filtering is the automated process of finding relevant items from a pool of items. Items
can be filtered either based on a user’s own behavior or by matching the behavior of
multiple users. Filtering is generally applied via the following two approaches:
• collaborative filtering
• content-based filtering
A common medium by which filtering is implemented is via the use of a recommender
system. Collaborative filtering is an item filtering technique based on the collaboration, or
merging, of a user’s past behavior with the behaviors of others. A target user’s past
behavior, including their likes, ratings, purchase history and more, is collaborated with the
behavior of similar users. Based on the similarity of the users’ behavior, items are filtered
for the target user.
Collaborative filtering is solely based on the similarity between users’ behavior. It requires
a large amount of user behavior data in order to accurately filter items. It is an example of
the application of the law of large numbers.
Content-based filtering is an item filtering technique focused on the similarity between
users and items. A user profile is created based on that user’s past behavior, for example,
their likes, ratings and purchase history. The similarities identified between the user
profile and the attributes of various items lead to items being filtered for the user. Contrary
to collaborative filtering, content-based filtering is solely dedicated to individual user
preferences and does not require data about other users.
A recommender system predicts user preferences and generates suggestions for the user
accordingly. Suggestions commonly pertain to recommending items, such as movies,
books, Web pages and people. A recommender system typically uses either collaborative
filtering or content-based filtering to generate suggestions. It may also be based on a
hybrid of both collaborative filtering and content-based filtering to fine-tune the accuracy
and effectiveness of generated suggestions.
For example, in order to realize cross-selling opportunities, the bank builds a
recommender system that uses content-based filtering. Based on matches found between
financial products purchased by customers and the properties of similar financial products,
the recommender system automates suggestions for potential financial products that
customers may also be interested in.
Sample questions can include:
• How can only the news articles that a user is interested in be displayed?
• Which holiday destinations can be recommended based on the travel history of a
vacationer?
• Which other new users can be suggested as friends based on the current profile of a
person?
Semantic Analysis
A fragment of text or speech data can carry different meanings in different contexts,
whereas a complete sentence may retain its meaning, even if structured in different ways.
In order for the machines to extract valuable information, text and speech data needs to be
understood by the machines in the same way as humans do. Semantic analysis represents
practices for extracting meaningful information from textual and speech data.
This section describes the following types of semantic analysis:
• Natural Language Processing
• Text Analytics
• Sentiment Analysis
Text Analytics
Unstructured text is generally much more difficult to analyze and search in comparison to
structured text. Text analytics is the specialized analysis of text through the application of
data mining, machine learning and natural language processing techniques to extract value
out of unstructured text. Text analytics essentially provides the ability to discover text
rather than just search it.
Useful insights from text-based data can be gained by helping businesses develop an
understanding of the information that is contained within a large body of text. As a
continuation of the preceding NLP example, the transcribed textual data is further
analyzed using text analytics to extract meaningful information about the common reasons
behind customer discontent.
The basic tenet of text analytics is to turn unstructured text into data that can be searched
and analyzed. As the amount of digitized documents, emails, social media posts and log
files increases, businesses have an increasing need to leverage any value that can be
extracted from these forms of semi-structured and unstructured data. Solely analyzing
operational (structured) data may cause businesses to miss out on cost-saving or business
expansion opportunities, especially those that are customer-focused.
Applications include document classification and search, as well as building a 360-degree
view of a customer by extracting information from a CRM system.
Text analytics generally involves two steps:
1. Parsing text within documents to extract:
• Named Entities – person, group, place, company
• Pattern-Based Entities – social security number, zip code
• Concepts – an abstract representation of an entity
• Facts – relationship between entities
2. Categorization of documents using these extracted entities and facts.
The extracted information can be used to perform a context-specific search on entities,
based on the type of relationship that exists between the entities. Figure 8.14 shows a
simplified representation of text analysis.
Figure 8.14 Entities are extracted from text files using semantic rules and structured so
that they can be searched.
Sample questions can include:
• How can I categorize Web sites based on the content of their Web pages?
• How can I find the books that contain content that is relevant to the topic that I am
studying?
• How can I identify contracts that contain confidential company information?
Sentiment Analysis
Sentiment analysis is a specialized form of text analysis that focuses on determining the
bias or emotions of individuals. This form of analysis determines the attitude of the author
of the text by analyzing the text within the context of the natural language. Sentiment
analysis not only provides information about how individuals feel, but also the intensity of
their feeling. This information can then be integrated into the decision-making process.
Common applications for sentiment analysis include identifying customer satisfaction or
dissatisfaction early, gauging product success or failure, and spotting new trends.
For example, an ice cream company would like to learn about which of its ice cream
flavors are most liked by children. Sales data alone does not provide this information
because the children that consume the ice cream are not necessarily the purchasers of the
ice cream. Sentiment analysis is applied to archived customer feedback left on the ice
cream company’s Web site to extract information specifically regarding children’s
preferences for certain ice cream flavors over other flavors.
Sample questions can include:
• How can customer reactions to the new packaging of the product be gauged?
• Which contestant is a likely winner of a singing contest?
• Can customer churn be measured by social media comments?
Visual Analysis
Visual analysis is a form of data analysis that involves the graphic representation of data to
enable or enhance its visual perception. Based on the premise that humans can understand
and draw conclusions from graphics more quickly than from text, visual analysis acts as a
discovery tool in the field of Big Data.
The objective is to use graphic representations to develop a deeper understanding of the
data being analyzed. Specifically, it helps identify and highlight hidden patterns,
correlations and anomalies. Visual analysis is also directly related to exploratory data
analysis as it encourages the formulation of questions from different angles.
This section describes the following types of visual analysis:
• Heat Maps
• Time Series Plots
• Network Graphs
• Spatial Data Mapping
Heat Maps
Heat maps are an effective visual analysis technique for expressing patterns, data
compositions via part-whole relations and geographic distributions of data. They also
facilitate the identification of areas of interest and the discovery of extreme (high/low)
values within a dataset.
For example, in order to identify the top- and worst-selling regions for ice cream sales, the
ice cream sales data is plotted using a heat map. Green is used to highlight the best
performing regions, while red is used to highlight worst performing regions.
The heat map itself is a visual, color-coded representation of data values. Each value is
given a color according to its type or the range that it falls under. For example, a heat map
may assign the values of 0–3 to the color red, 4–6 to amber and 7–10 to green.
A heat map can be in the form of a chart or a map. A chart represents a matrix of values in
which each cell is color-coded according to the value, as shown in Figure 8.15. It can also
represent hierarchical values by using color-coded nested rectangles.
Figure 8.15 This chart heat map depicts the sales of three divisions within a company
over a period of six months.
In Figure 8.16, a map represents a geographic measure by which different regions are
color-coded or shaded according to a certain theme. Instead of coloring or shading the
whole region, the map may be superimposed by a layer made up of collections of
colored/shaded points relating to various regions, or colored/shaded shapes representing
various regions.
Figure 8.17 A line chart depicts a sales time series from 1990 to 1996.
The time series presented in Figure 8.17 spans seven years. The evenly spaced peaks
toward the end of each year show seasonal periodic patterns, for example Christmas sales.
The dotted red circles represent short-term irregular variations. The blue line shows an
upward trend, indicating an increase in sales.
Sample questions can include:
• How much yield should the farmer expect based on historical yield data?
• What is the expected increase in population in the next 5 years?
• Is the current decrease in sales a one-off occurrence or does it occur regularly?
Network Graphs
Within the context of visual analysis, a network graph depicts an interconnected collection
of entities. An entity can be a person, a group, or some other business domain object such
as a product. Entities may be connected with one another directly or indirectly. Some
connections may only be one-way, so that traversal in the reverse direction is not possible.
Network analysis is a technique that focuses on analyzing relationships between entities
within the network. It involves plotting entities as nodes and connections as edges
between nodes. There are specialized variations of network analysis, including:
• route optimization
• social network analysis
• spread prediction, such as the spread of a contagious disease
The following is a simple example based on ice cream sales for the application of network
analysis for route optimization.
Some ice cream store managers are complaining about the time it takes for delivery trucks
to drive between the central warehouse and stores in remote areas. On hotter days, ice
cream delivered from the central warehouse to the remote stores melts and cannot be sold.
Network analysis is used to find the shortest routes between the central warehouse and the
remote stores in order to minimize the durations of deliveries.
Consider the social network graph in Figure 8.18 for a simple example of social network
analysis:
• John has many friends, whereas Alice only has one friend.
• The results of a social network analysis reveal that Alice will most likely befriend
John and Katie, since they have a common friend named Oliver.
Figure 8.18 An example of a social network graph.
Sample questions may include:
• How can I identify influencers within a large group of users?
• Are two individuals related to each other via a long chain of ancestry?
• How can I identify interaction patterns among a very large number of protein-to-
protein interactions?
Figure 8.19 Spatial data analysis can be used for targeted marketing.
Sample questions can include:
• How many houses will be affected due to a road widening project?
• How far do customers have to commute in order to get to a supermarket?
• Where are the high and low concentrations of a particular mineral based on
readings taken from a number of sample locations within an area?
ETI has successfully developed the “fraudulent claim detection” solution, which has
provided the IT team experience and confidence in the realm of Big Data storage and
analysis. More importantly, they see that they have achieved only a part of one of the key
objectives established by the senior management. Still left are projects that are intended
to: improve risk assessment for applications for new policies, perform catastrophe
management to decrease the number of claims related to a calamity, decrease customer
churn by providing more efficient claims settlement and personalized policies and, finally,
achieve full regulatory compliance.
Knowing that “success breeds success,” the corporate innovation manager, working from
a prioritized backlog of projects, informs the IT team that they will next tackle current
efficiency problems that have resulted in slow claims processing. While the IT team was
busy learning enough Big Data to implement a solution for fraud detection, the innovation
manager had deployed a team of business analysts to document and analyze the claims
processing business process. These process models will be used to drive an automation
activity that will be implemented with a BPMS. The innovation manager selected this as
the next target because they want to generate maximal value from the model for fraud
detection. This will be achieved when it is being called from within the process
automation framework. This will allow the further collection of training data that can
drive incremental refinement of the supervised machine learning algorithm that drives the
classification of claims as either legitimate or fraudulent.
Another advantage of implementing process automation is the standardization of work
itself. If claims examiners are all forced to follow the same claims processing procedures,
variation in customer service should decline, and this should help ETI’s customers achieve
a greater level of confidence that their claims are being processed correctly. Although this
is an indirect benefit, it is one that recognizes the fact that it is through the execution of
ETI’s business processes that customers will perceive the value of their relationship with
ETI. Although the BPMS itself is not a Big Data initiative, it will generate an enormous
amount of data related to things like end-to-end process time, dwell time of individual
activities and the throughput of individual employees that process claims. This data can be
collected and mined for interesting relationships, especially when combined with customer
data. It would be valuable to know whether or not customer defection rates are correlated
with claims processing times for defecting customers. If they are, a regression model
could be developed to predict which customers are at risk for defection, and they can be
proactively contacted by customer care personnel.
ETI is seeing improvement in its daily operations through the creation of a virtuous cycle
of management action followed by the measurement and analysis of organizational
response. The executive team is finding it useful to view the organization not as a machine
but as an organism. This perspective has allowed a paradigm shift that encourages not
only deeper analytics of internal data but also a realization of the need to incorporate
external data. ETI used to have to embarrassingly admit that they were primarily running
their business on descriptive analytics from OLTP systems. Now, broader perspectives on
analytics and business intelligence are enabling more efficient use of their EDW and
OLAP capabilities. In fact, ETI’s ability to examine its customer base across the Marine,
Aviation and Property lines of business has allowed the organization to identify that there
are many customers that have separate policies for boats, planes and high-end luxury
properties. This insight alone has opened up new marketing strategies and customer
upselling opportunities.
Furthermore, the future of ETI is looking brighter as the company embraces data-driven
decision-making. Now that its business has experienced benefit from diagnostic and
predictive analytics, the organization is considering ways to use prescriptive analytics to
achieve risk-avoidance goals. ETI’s ability to incrementally adopt Big Data and use it as a
means of bettering the alignment between business and IT has brought unbelievable
benefits. ETI’s executive team has agreed that Big Data is a big deal, and they expect that
their shareholders will feel the same way as ETI returns to profitability.
About the Authors
Thomas Erl
Thomas Erl is a top-selling IT author, founder of Arcitura Education and series editor of
the Prentice Hall Service Technology Series from Thomas Erl. With more than 200,000
copies in print worldwide, his books have become international bestsellers and have been
formally endorsed by senior members of major IT organizations, such as IBM, Microsoft,
Oracle, Intel, Accenture, IEEE, HL7, MITRE, SAP, CISCO, HP and many others. As CEO
of Arcitura Education Inc., Thomas has led the development of curricula for the
internationally recognized Big Data Science Certified Professional (BDSCP), Cloud
Certified Professional (CCP) and SOA Certified Professional (SOACP) accreditation
programs, which have established a series of formal, vendor-neutral industry certifications
obtained by thousands of IT professionals around the world. Thomas has toured more than
20 countries as a speaker and instructor. More than 100 articles and interviews by Thomas
have been published in numerous publications, including The Wall Street Journal and CIO
Magazine.
Wajid Khattak
Wajid Khattak is a Big Data researcher and trainer at Arcitura Education Inc. His areas of
interest include Big Data engineering and architecture, data science, machine learning,
analytics and SOA. He has extensive .NET software development experience in the
domains of business intelligence reporting solutions and GIS.
Wajid completed his MSc in Software Engineering and Security with distinction from
Birmingham City University in 2008. Prior to that, in 2003, he earned his BSc (Hons)
degree in Software Engineering from Birmingham City University with first-class
recognition. He holds MCAD & MCTS (Microsoft), SOA Architect, Big Data Scientist,
Big Data Engineer and Big Data Consultant (Arcitura) certifications.
Paul Buhler
Dr. Paul Buhler is a seasoned professional who has worked in commercial, government
and academic environments. He is a respected researcher, practitioner and educator of
service-oriented computing concepts, technologies and implementation methodologies.
His work in XaaS naturally extends to cloud, Big Data and IoE areas. Dr. Buhler’s more
recent work has been focused on closing the gap between business strategy and process
execution by leveraging responsive design principles and goal-based execution.
As Chief Scientist at Modus21, Dr. Buhler is responsible for aligning corporate strategy
with emerging trends in business architecture and process execution frameworks. He also
holds an Affiliate Professorship at the College of Charleston, where he teaches both
graduate and undergraduate computer science courses. Dr. Buhler earned his Ph.D. in
Computer Engineering at the University of South Carolina. He also holds an MS degree in
Computer Science from Johns Hopkins University and a BS in Computer Science from
The Citadel.
Index
A
A/B testing, 185-186
ACID database design, 108-112
acquisition of data (Big Data analytics lifecycle), 58-60
case study, 74
active querying, 168
ad-hoc reporting, 82
affordable technology, as business motivation for Big Data, 38-39
aggregation of data (Big Data analytics lifecycle), 64-66
case study, 75
in data visualization tools, 86
algorithm design in MapReduce, 135-137
analysis. See data analysis
analytics. See data analytics
architecture. See business architecture
atomicity in ACID database design, 109
availability in CAP theorem, 106
B
BADE (Business Application Development Environment), 36
BASE database design, 113-116
basically available in BASE database design, 114
batch processing, 123-125
case study, 143-144
data analysis and, 182-183
with MapReduce, 125-126
algorithm design, 135-137
combine stage, 127-128
divide-and-conquer principle, 134-135
example, 133
map stage, 127
partition stage, 129-130
reduce stage, 131-132
shuffle and sort stage, 130-131
terminology, 126
BI (Business Intelligence)
Big Data BI, 84-85
case study, 87-88
data visualization tools, 84-86
case study, 25
defined, 12
marketplace dynamics and, 31
traditional BI, 82
ad-hoc reporting, 82
dashboards, 82-83
Big Data
analytics lifecycle, 55
Business Case Evaluation stage, 56-57
case study, 73-76
Data Acquisition and Filtering stage, 58-60
Data Aggregation and Representation stage, 64-66
Data Analysis stage, 66-67
Data Extraction stage, 60-62
Data Identification stage, 57-58
Data Validation and Cleansing stage, 62-64
Data Visualization stage, 68
Utilization of Analysis Results stage, 69-70
characteristics, 13
case study, 26-27
value, 16-17
variety, 15, 154
velocity, 14-15, 137, 154
veracity, 16
volume, 14, 154
defined, 4-5
processing. See data processing
terminology, 5-13
case study, 24-25
types of data (data formats), 17-18
case study, 27
metadata, 20
semi-structured data, 19-20
structured data, 18
unstructured data, 19
Big Data BI (Business Intelligence), 84-85
case study, 87-88
data visualization tools, 84-86
BPM (business process management)
as business motivation for Big Data, 36-37
case study, 43-44
BPMS (Business Process Management Systems), 36
bucket testing, 185-186
Business Application Development Environment (BADE), 36
business architecture
as business motivation for Big Data, 33-35, 78
case study, 44
Business Case Evaluation stage (Big Data analytics lifecycle), 56-57
case study, 73-74
Business Intelligence (BI). See BI (Business Intelligence)
business motivation and drivers
business architecture, 33-35, 78
business process management (BPM), 36-37
case study, 43-45
information and communications technology (ICT), 37
affordable technology, 38-39
cloud computing, 40-42
data analytics and data science, 37
digitization, 38
hyper-connection, 40
social media, 39
Internet of Everything (IoE), 42-43
marketplace dynamics, 30-32
business process management (BPM)
as business motivation for Big Data, 36-37
case study, 43-44
Business Process Management Systems (BPMS), 36
C
CAP (Consistency, Availability, and Partition tolerance) theorem, 106-108
case studies, ETI (Ensure to Insure)
background, 20-24
Big Data analytics lifecycle, 73-76
Big Data BI (Business Intelligence), 87-88
Big Data characteristics, 26-27
business motivation and drivers, 43-45
conclusion, 208-209
data analysis, 204-205
data formats, 27
data processing, 143-144
enterprise technologies, 86-87
planning considerations, 71-73
storage devices, 179
storage technologies, 117-118
terminology, 24-25
types of analytics, 25
CEP (complex event processing), 141
classification, 190-191
case study, 205
cleansing data (Big Data analytics lifecycle), 62-64
case study, 75
cloud computing
as business motivation for Big Data, 40-42
planning considerations for, 54
clustering, 191-192
case study, 205
clusters, 93
in data processing, 124
in RDBMSs, 149
collaborative filtering, 194
column-family NoSQL storage, 155, 159-160
combine stage (MapReduce), 127-128
commodity hardware, as business motivation for Big Data, 38-39
complex event processing (CEP), 141
computational analysis, statistical analysis versus, 182-183
confirmatory analysis, 66-67
confounding factor, 189
consistency
in ACID database design, 110
in BASE database design, 115-116
in CAP theorem, 106
in SCV principle, 138
Consistency, Availability, and Partition tolerance (CAP) theorem, 106-108
content-based filtering, 194
continuous querying, 168
correlation, 186-188
case study, 204
regression versus, 189-190
Critical Success Factors (CSFs) in business architecture, 33
D
dashboards, 82-83
data
defined, 31
in DIKW pyramid, 32
Data Acquisition and Filtering stage (Big Data analytics lifecycle), 58-60
case study, 74
Data Aggregation and Representation stage (Big Data analytics lifecycle), 64-66
case study, 75
data analysis
case study, 204-205
data mining, 184
defined, 6
machine learning, 190
classification, 190-191
clustering, 191-192
filtering, 193-194
outlier detection, 192-193
qualitative analysis, 184
quantitative analysis, 183
realtime support, 52
semantic analysis techniques
natural language processing, 195
sentiment analysis, 197
text analytics, 196-197
statistical analysis techniques, 184
A/B testing, 185-186
computational analysis versus, 182-183
correlation, 186-188
regression, 188-190
visual analysis techniques, 198
heat maps, 198-200
network graphs, 201-202
spatial data mapping, 202-204
time series plots, 200-201
Data Analysis stage (Big Data analytics lifecycle), 66-67
case study, 75
data analytics
as business motivation for Big Data, 37
case study, 25
defined, 6-8
descriptive analytics, defined, 8
diagnostic analytics, defined, 9-10
enterprise technologies
case study, 86-87
data marts, 81
data warehouses, 80
ETL (Extract Transform Load), 79
OLAP (online analytical processing), 79
OLTP (online transaction processing), 78
lifecycle, 55
Business Case Evaluation stage, 56-57
case study, 73-76
Data Acquisition and Filtering stage, 58-60
Data Aggregation and Representation stage, 64-66
Data Analysis stage, 66-67
Data Extraction stage, 60-62
Data Identification stage, 57-58
Data Validation and Cleansing stage, 62-64
Data Visualization stage, 68
Utilization of Analysis Results stage, 69-70
predictive analytics, defined, 10-11
prescriptive analytics, defined, 11-12
databases
IMDBs, 175-178
NewSQL, 163
NoSQL, 152
characteristics, 152-153
rationale for, 153-154
types of devices, 154-162
RDBMSs, 149-152
data discovery, 184
Data Extraction stage (Big Data analytics lifecycle), 60-62
case study, 74
data formats, 17-18
case study, 27
metadata, 20
semi-structured data, 19-20
structured data, 18
unstructured data, 19
Data Identification stage (Big Data analytics lifecycle), 57-58
case study, 74
data marts, 81
traditional BI and, 82
data mining, 184
data parallelism, 135
data processing, 120
batch processing, 123-125
with MapReduce, 125-137
case study, 143-144
clusters, 124
distributed data processing, 121
Hadoop, 122
parallel data processing, 120-121
realtime mode, 137
CEP (complex event processing), 141
ESP (event stream processing), 140
MapReduce, 142-143
SCV (speed consistency volume) principle, 137-142
transactional processing, 123-124
workloads, 122
data procurement, cost of, 49
data provenance, tracking, 51-52
data science, as business motivation for Big Data, 37
datasets, defined, 5-6
Data Validation and Cleansing stage (Big Data analytics lifecycle), 62-64
case study, 75
Data Visualization stage (Big Data analytics lifecycle), 68
in Big Data BI, 84-86
case study, 76
data warehouses, 80
Big Data BI and, 84-85
traditional BI and, 82
data wrangling, 92
descriptive analytics
case study, 25
defined, 8
diagnostic analytics
case study, 25
defined, 9-10
digitization, as business motivation for Big Data, 38
DIKW pyramid, 32
alignment with business architecture, 34, 78
discovery informatics, 182
distributed data processing, 121
distributed file systems, 93-94, 147-148
divide-and-conquer principle (MapReduce), 134-135
document NoSQL storage, 155-158
drill-down in data visualization tools, 86
durability in ACID database design, 111-112
E
edges (in graph NoSQL storage), 161-162
Ensure to Insure (ETI) case study. See case studies, ETI (Ensure to Insure)
enterprise technologies for analytics
case study, 86-87
data marts, 81
data warehouses, 80
ETL (Extract Transform Load), 79
OLAP (online analytical processing), 79
OLTP (online transaction processing), 78
ESP (event stream processing), 140
ETI (Ensure to Insure) case study. See case studies, ETI (Ensure to Insure)
ETL (Extract Transform Load), 79
evaluation of business case (Big Data analytics lifecycle), 56-57
case study, 73-74
event processing. See realtime mode
event stream processing (ESP), 140
eventual consistency in BASE database design, 115-116
exploratory analysis, 66-67
extraction of data (Big Data analytics lifecycle), 60-62
case study, 74
Extract Transform Load (ETL), 79
F
fault tolerance in clusters, 125
feedback loops
in business architecture, 35
methodology, 53-54
files, 93
file systems, 93
filtering of data (Big Data analytics lifecycle), 58-60, 193-194
case study, 74
in data visualization tools, 86
G-H
Geographic Information System (GIS), 202
governance framework, 53
graphic data representations. See visual analysis techniques
graph NoSQL storage, 155, 160-162
Hadoop, 122
heat maps, 198-200
horizontal scaling, 95
in-memory storage, 165
human-generated data, 17
hyper-connection as business motivation for Big Data, 40
I
ICT (information and communications technology)
as business motivation for Big Data, 37
affordable technology, 38-39
cloud computing, 40-42
data analytics and data science, 37
digitization, 38
hyper-connection, 40
social media, 39
case study, 44-45
identification of data (Big Data analytics lifecycle), 57-58
case study, 74
IMDBs (in-memory databases), 175-178
IMDGs (in-memory data grids), 166-175
read-through approach, 170-171
refresh-ahead approach, 172-174
write-behind approach, 172-173
write-through approach, 170-171
information
defined, 31
in DIKW pyramid, 32
information and communications technology (ICT). See ICT (information and
communications technology)
in-memory storage devices, 163-166
IMDBs, 175-178
IMDGs, 166-175
innovation, transformation versus, 48
interactive mode, 137
Internet of Things (IoT), 42-43
Internet of Everything (IoE), as business motivation for Big Data, 42-43
isolation in ACID database design, 110-111
J-K
jobs (MapReduce), 126
key-value NoSQL storage, 155-157
knowledge
defined, 31
in DIKW pyramid, 32
KPIs (key performance indicators)
in business architecture, 33, 78
case study, 25
defined, 12
L-M
latency in RDBMSs, 152
linear regression, 188
machine-generated data, 17-18
machine learning, 190
classification, 190-191
clustering, 191-192
filtering, 193-194
outlier detection, 192-193
managerial level, 33-35, 78
MapReduce, 125-126
algorithm design, 135-137
case study, 143-144
combine stage, 127-128
divide-and-conquer principle, 134-135
example, 133
map stage, 127
partition stage, 129-130
realtime processing, 142-143
reduce stage, 131-132
shuffle and sort stage, 130-131
terminology, 126
map stage (MapReduce), 127
map tasks (MapReduce), 126
marketplace dynamics, as business motivation for Big Data, 30-32
master-slave replication, 98-100
combining with sharding, 104
mechanistic management view, organic management view versus, 30
memory. See in-memory storage devices
metadata
case study, 27
in Data Acquisition and Filtering stage (Big Data analytics lifecycle), 60
defined, 20
methodologies for feedback loops, 53-54
N
natural language processing, 195
network graphs, 201-202
NewSQL, 163
nodes (in graph NoSQL storage), 161-162
noise, defined, 16
non-linear regression, 188
NoSQL, 94, 152
characteristics, 152-153
rationale for, 153-154
types of devices, 154-162
column-family, 159-160
document, 157-158
graph, 160-162
key-value, 156-157
O
offline processing. See batch processing
OLAP (online analytical processing), 79
OLTP (online transaction processing), 78
on-disk storage devices, 147
databases
NewSQL, 163
NoSQL, 152-162
RDBMSs, 149-152
distributed file systems, 147-148
online analytical processing (OLAP), 79
online processing, 123-124
online transaction processing (OLTP), 78
operational level of business, 33-35, 78
optimistic concurrency, 101
organic management view, mechanistic management view versus, 30
organization prerequisites for Big Data adoption, 49
outlier detection, 192-193
P
parallel data processing, 120-121
partition stage (MapReduce), 129-130
partition tolerance in CAP theorem, 106
peer-to-peer replication, 100-102
combining with sharding, 105
performance
considerations, 53
KPIs. See KPIs (key performance indicators)
sharding and, 96
Performance Indicators (PIs) in business architecture, 33
pessimistic concurrency, 101
planning considerations, 48
Big Data analytics lifecycle, 55
Business Case Evaluation stage, 56-57
case study, 73-76
Data Acquisition and Filtering stage, 58-60
Data Aggregation and Representation stage, 64-66
Data Analysis stage, 66-67
Data Extraction stage, 60-62
Data Identification stage, 57-58
Data Validation and Cleansing stage, 62-64
Data Visualization stage, 68
Utilization of Analysis Results stage, 69-70
case study, 71-73
cloud computing, 54
data procurement, cost of, 49
feedback loop methodology, 53-54
governance framework, 53
organization prerequisites, 49
performance, 53
privacy concerns, 49-50
provenance, 51-52
realtime support in data analysis, 52
security concerns, 50-51
predictive analytics
case study, 25
defined, 10-11
prerequisites for Big Data adoption, 49
prescriptive analytics
case study, 25
defined, 11-12
privacy concerns, addressing, 49-50
processing. See data processing
procurement of data, cost of, 49
provenance, tracking, 51-52
Q-R
qualitative analysis, 184
quantitative analysis, 183
RDMBSs (relational database management systems), 149-152
read-through approach (IMDGs), 170-171
realtime mode, 137
case study, 144
CEP (complex event processing), 141
data analysis and, 182-183
ESP (event stream processing), 140
MapReduce, 142-143
SCV (speed consistency volume) principle, 137-142
realtime support in data analysis, 52
reconciling data (Big Data analytics lifecycle), 64-66
case study, 75
reduce stage (MapReduce), 131-132
reduce tasks (MapReduce), 126
redundancy in clusters, 125
refresh-ahead approach (IMDGs), 172-174
regression, 188-190
case study, 204
correlation versus, 189-190
relational database management systems (RDMBSs), 149-152
replication, 97
combining with sharding, 103
master-slave replication, 104
peer-to-peer replication, 105
master-slave, 98-100
peer-to-peer, 100-102
results of analysis, utilizing (Big Data analytics lifecycle), 69-70
case study, 76
roll-up in data visualization tools, 86
S
schemas in RDBMSs, 152
SCV (speed consistency volume) principle, 137-142
security concerns, addressing, 50-51
semantic analysis techniques
natural language processing, 195
sentiment analysis, 197
text analytics, 196-197
semi-structured data
case study, 27
defined, 19-20
sentiment analysis, 197
sharding, 95-96
combining with replication, 103
master-slave replication, 104
peer-to-peer replication, 105
in RDBMSs, 150-151
shuffle and sort stage (MapReduce), 130-131
signal-to-noise ratio, defined, 16
signals, defined, 16
social media, as business motivation for Big Data, 39
soft state in BASE database design, 114-115
spatial data mapping, 202-204
speed in SCV principle, 137
split testing, 185-186
statistical analysis, 184
A/B testing, 185-186
computational analysis versus, 182-183
correlation, 186-188
regression, 188-190
storage devices, 146
case study, 179
in-memory storage, 163-166
IMDBs, 175-178
IMDGs, 166-175
on-disk storage, 147
databases, 149-163
distributed file systems, 147-148
storage technologies
ACID database design, 108-112
BASE database design, 113-116
CAP theorem, 106-108
case study, 117-118
clusters, 93
distributed file systems, 93-94
file systems, 93
NoSQL databases, 94
replication, 97
combining with sharding, 103-105
master-slave, 98-100
peer-to-peer, 100-102
sharding, 95-96
combining with replication, 103-105
strategic level of business, 33-35, 78
stream processing. See realtime mode
structured data
case study, 27
defined, 18
supervised machine learning, 190-191
T
tactical level of business, 33-35, 78
task parallelism, 134
text analytics, 196-197
time series plots, 200-201
case study, 205
traditional BI (Business Intelligence), 82
ad-hoc reporting, 82
dashboards, 82-83
transactional processing, 123-124
transformation, innovation versus, 48
U-V
unstructured data
case study, 27
defined, 19
unsupervised machine learning, 191-192
Utilization of Analysis Results stage (Big Data analytics lifecycle), 69-70
case study, 76
validation of data (Big Data analytics lifecycle), 62-64
case study, 75
value
case study, 27
defined, 16-17
variety
case study, 26
defined, 15
in NoSQL, 154
velocity
case study, 26
defined, 14-15
in-memory storage, 165
in NoSQL, 154
realtime mode, 137
veracity
case study, 26
defined, 16
vertical scaling, 149
virtuous cycles in business architecture, 35
visual analysis techniques, 198
heat maps, 198-200
network graphs, 201-202
spatial data mapping, 202-204
time series plots, 200-201
visualization of data (Big Data analytics lifecycle), 68
in Big Data BI, 84-86
case study, 76
volume
case study, 26
defined, 14
in NoSQL, 154
in SCV principle, 138
W-X-Y-Z
what-if analysis in data visualization tools, 86
wisdom in DIKW pyramid, 32
Working Knowledge (Davenport and Prusak), 31
workloads (data processing), 122
batch processing, 123-125
with MapReduce, 125-137
case study, 143
transactional processing, 123-124
write-behind approach (IMDGs), 172-173
write-through approach (IMDGs), 170-171