Harnessing The Value of Big Data Analytics
Harnessing The Value of Big Data Analytics
Leading organizations are exploring alterna- The Challenges of gigabytes in 2011.2 Thats the equiva-
tive solutions that use the MapReduce soft- Converting Big Data lent of 57.5 billion 32GB Apple iPads.
ware framework, such as Apache Hadoop. Volumes into Insight > Velocity: Data continues changing at
While Hadoop can cost-effectively load, an increasing rate of speed, making it
What business value does data bring to
store, and refine multi-structured data, it difficult for companies to capture and
your organization? If your company is
is not well-suited for low latency, iterative analyze. For example, machine-generated
like most, you wouldnt think of shifting
data discovery or classic enterprise busi- data from sensors and web log data is
production schedules, developing a market-
ness intelligence (BI). These applications being ingested in real-time by many
ing campaign, or forging a product strategy
require a strong ecosystem of tools that applications. Without real-time analyt-
without insight gleaned from business
provide ANSI SQL support as well as high ics to decipher these dynamic data
analytics tools. Using data from transac-
performance and interactivity. streams, companies cannot make sense
tional systems, your team reviews historical
purchase patterns, tracks sales, balances of the information in time to take
The more complete solution is to implement
the books, and seeks to understand tran- meaningful action.
a data discovery platform that can integrate
Hadoop with a relational integrated data sactional trends and behaviors. If your > Variety: Its no longer enough to collect
warehouse. New data discovery platforms analytics practice is advanced, you may just transactional data such as sales,
like the Teradata Aster MapReduce Platform even predict the likely outcomes of events. inventory details, or procurement
combine the power of the MapReduce information. Analysts are increasingly
But its not enough. Despite the value interested in new data types, such as
analytic framework with SQL-based BI tools
delivered by your current data warehouse sentiments expressed in product reviews,
that are familiar to analysts. The result is a
and analytics practices, you are only skim- unstructured text from call records and
unified solution that helps companies gain
ming the surface of the deep pool of service reports, online behavior such as
valuable business insight from new and
business value that data can deliver. Today click streams, images and videos, and
existing data using existing BI tools and
there are huge volumes of interactional and geospatial and temporal details. These
skill sets as well as enhanced MapReduce
observational data being created by busi- data types add richness that supports
analytic capabilities.
nesses and consumers around the world. more detailed analyses.
But which analytic workloads are best suited Generated by web logs, sensors, social
> Complexity: With more details and
for Hadoop, the data discovery platform, and media sites, and call centers, for example,
sources, the data is more complex and
an integrated data warehouse? How can these so-called big data volumes are
difficult to analyze. In the past, banks
these specialized systems best work together? difficult to process, store, and analyze.
used just transactional data to predict
What are the schema requirements for
According to industry analyst Gartner, 1 the probability of a customer closing
different data types? Which system provides
any effort to tackle the big data challenge an account. Now, these companies want
an optimized processing environment that
must address multiple factors, including: to understand the last mile of the
delivers maximum business value with the
customers decision process. By gaining
lowest total cost of ownership? This paper > Volume: The amount of data generated
visibility into common consumer
answers these questions and shows you by companies and their customers,
behavior patterns across the web site,
how to use MapReduce, Hadoop, and a competitors, and partners continues
social networks, call centers, and
unified big data architecture to support to grow exponentially. According to
branches, banks can address issues
big data analytics. industry analyst IDC, the digital universe
impacting customer loyalty before
created and replicated 1.8 trillion
1 Source: Big Data is Only the Beginning of Extreme Information Management, Gartner, April 2011
2 Source: Extracting Value from Chaos, John Gantz and David Reinsel, IDC, June 2011
consumers decide to defect. Analyzing Some data formats may not fit well into a Data scientists, business analysts, enter-
and detecting patterns on the fly schema without heavy pre-processing or prise architects, developers, and IT
across and all customer records is may have requirements for loading and managers are looking for alternative
time-consuming and costly. Replicating storing in their native format. Dealing methods to collect and analyze big data
that effort over time can be even more with this variety of data types efficiently streams. Whats needed is a unified big
challenging. can be difficult. As a result, many organi- data architecture that lets them refine raw
zations simply delete this data or never data into valuable analytical assets. (See
Addressing the multiple challenges posed
bother to capture it at all. Figure 1.) Specifically, they need to:
by big data volumes is not easy. Unlike
> Capture, store, and refine raw, multi-
transactional data, which can be stored in
Clear Path to New Value structured data in a data refinery
a stable schema that changes infrequently,
Companies that recognize the opportuni- platform. This platform extends
interactional data types are more dynamic.
ties inherent in big data analytics can take existing architectures that have been
They require an evolving schema, which
steps to unlock the value of these new data traditionally used to store data from
is defined dynamically often on-the-fly
flows. According to Gartner, CIOs face structured information sources, such
at query runtime. The ability to load data
significant challenges in addressing the as transactional systems.
quickly, and evolve the schema over time
if needed, is a tremendous advantage for issues surrounding big data New tech- > Explore and uncover value and new
analysts who want to reduce time to nologies and applications are emerging insights, quickly and iteratively, in a
valuable insights. and should be investigated to understand data discovery platform
their potential value. 3
Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualization, etc.
Figure 1. Architecture for Refining Big Data Volumes into Analytical Assets.
3 Source: CEO Advisory: Big Data Equals Big Opportunity, Gartner, March 31, 2011
> Provide IT and business users with a MapReduce and Hadoop: A Primer
variety of analytic tools and techniques
How do technologies such as MapReduce and Hadoop help
to discover and explore patterns
organizations harness the value of unstructured and semi-
> Store valuable data and metadata in an structured data?
integrated data warehouse so analysts
MapReduce supports distributed processing of the common
and business applications can opera-
map and reduction operations. In the map step, a master
tionalize new insights from multi-
node divides a query or request into smaller problems. It
structured data
distributes each query to a set of map tasks scheduled on a
worker node within a cluster of execution nodes. The output
Choosing the Ideal Big
of the map steps is sent to nodes that combine or reduce
Data Analytics Solution
the output and create a response to the query. Because
To maximize the value of traditional and both the map and reduce functions can be distributed to
multi-structured data assets, companies clusters of commodity hardware and performed in parallel,
need to deploy technologies that integrate MapReduce techniques are appropriate for larger datasets.
Hadoop and relational database systems.
Apache Hadoop consists of two components: Hadoop
Although the two worlds were separate not
MapReduce for parallel data processing and the Hadoop
long ago, vendors are beginning to intro-
Distributed File System (HDFS) for low-cost, reliable data
duce solutions that effectively combine the
storage. Hadoop, the most popular open-source implemen-
technologies. For example, market leaders
tation of the MapReduce framework, can be used to refine
like Teradata and Hortonworks are
unstructured and semi-structured data into structured
partnering to deliver reference architec-
formats that can be analyzed or loaded into other analytic
tures and innovative product integration
platforms.
that unify Hadoop with data discovery
platforms and integrated data warehouses.
What should companies look for to get many multi-structured data types with needing to understand the programming
the most value from Hadoop? Most unknown initial value. It also serves as a behind them. With this architecture,
importantly, you need a unified big data cost-effective platform for retaining large enterprise architects can easily and cost-
architecture that tightly integrates the volumes of data and files for long periods effectively incorporate Hadoop storage
Hadoop/MapReduce programming model of time. and batch processing strengths together
with traditional SQL-based enterprise data with the relational database system.
The unified big data architecture also
warehousing. (See Figure 2.)
preserves the declarative and storage- A critical part of the unified big data
The unified big data architecture is based on independence benefits of SQL, without architecture is a discovery platform that
a system that can capture and store a wide compromising MapReduces ability to leverages the strengths of Hadoop for scale
range of multi-structured raw data sources. extend SQLs analytic capabilities. By and processing while bridging the gaps
It uses MapReduce to refine this data into offering the intuitiveness of SQL, the around BI tool support, SQL access, and
usable formats, helping to fuel new insights solution helps less-experienced users exploit interactive analytical workloads. SQL-
for the business. In this respect, Hadoop is the analytical capabilities of existing and MapReduce helps bridge this gap by pro-
an ideal choice for capturing and refining packaged MapReduce functions, without viding a distinct execution engine within
Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualization, etc.
Analytics
the discovery platform. This allows the Finally, an interactive development tool To support rapid iterations in the data dis-
advanced analytical functions to execute can reduce the effort required to build and covery processes, the solution also must
automatically, in parallel across the nodes test custom-developed functions. Such offer high performance and ease of analytic
of the machine cluster, while providing a tools can also be used to import existing iteration. Look for standard SQL and BI tools
standard SQL interface that can be lever- Java MapReduce programs. that can leverage both SQL and MapReduce
aged by BI tools. natively. By leveraging relational technology
To ensure that the platform delivers relevant
as the data store, analysts receive the
Some products include a library of insights, it must also offer enough scalability
performance benefit of a query optimizer,
prebuilt analytic functions such as path, to support entire data sets not just data
indexes, data partitioning, and simple SQL
pattern, statistical, graph, text and cluster samples. The more data you can analyze,
statements that can execute instantaneously.
analysis, and data transformation that the more accurate your results will be. As
help speed the deployment of analytic data science expert Anand Rajaraman In sum, a unified big data architecture
applications. Users should be able to write recently wrote on the Datawocky blog, blends the best of Hadoop and SQL,
custom functions as needed, in a variety Adding more, independent data usually allowing users to:
of languages, for use in both batch and beats out designing ever-better algorithms > Capture and refine data from a wide
interactive environments. to analyze an existing data set.4 variety of sources
4 More data usually beats better algorithms, Anand Rajaraman, Datawocky, March 24, 2008,
http://anand.typepad.com/datawocky/2008/03/more-data-usual.html.
Choosing the Right Big type of data and schema exist in your includes data captured by machines
Data Analytics Solution environment. Possibilities include: with a well-defined format, but no
> Data that uses a stable schema (struc- semantics, such as images, videos, web
As big data challenges become more
tured) This can include data from pages, and PDF documents. Semantics
pressing, vendors are introducing products
packaged business processes with can be extracted from the raw data by
designed to help companies effectively
well-defined and known attributes, interpreting the format and pulling out
handle the huge volumes of data and
such as ERP data, inventory records, required data. This is often done with
perform insight-enhancing analytics.
and supply chain records. shapes from a video, face recognition in
But selecting the appropriate solution for
images, and logo detection. Sometimes
your requirements need not be difficult. > Data that has an evolving schema
formatted data is accompanied by meta-
(semi-structured) Examples include
With the inherent technical differences data that can have a stable schema or
data generated by machine processes,
in data types, schema requirements, and an evolving schema, which needs to be
with known but changing sets of
analytical workloads, its no surprise that classified and treated separately.
attributes, such as web logs, call detail
certain solutions lend themselves to records, sensor logs, JSON (JavaScript Each of these three schema types may
optimal performance in different parts Object Notation), social profiles, and include a wide spectrum of workloads that
of the unified big data architecture. The Twitter feeds. must be performed on the data. Table 1
first criteria to consider should be what
> Data that has a format, but no schema lists several common data tasks and
(unstructured) Unstructured data workload considerations.
Low-cost storage Retains raw data in manner that can provide low TCO-per-terabyte storage costs
and retention Requires access in deep storage, but not at same speeds as in a front-line system
Loading Brings data into the system from the source system
Pre-processing/ Prepares data for downstream processing by, for example, fetching dimension
prep/cleansing/ data, recording a new incoming batch, or archiving old window batch.
constraint
validation
Transformation Converts one structure of data into another structure. This may require going
from third-normal form in a relational database to a star or snowflake schema,
or from text to a relational database, or from relational technology to a graph,
as with structural transformations.
Reporting Queries historical data such as what happened, where it happened, how much
happened, who did it (e.g., sales of a given product by region)
Analytics (including Performs relationship modeling via declarative SQL (e.g., scoring or basic stats)
user-driven, inter- Performs relationship modeling via procedural MapReduce (e.g., model building
active, or ad-hoc) or time series)
Aster
Aster
Evolving (SQL +
Hadoop Aster/Hadoop (joining with Aster
Schema MapReduce
structured data)
Analytics)
Aster
Format,
Hadoop Hadoop Hadoop (MapReduce
No Schema
Analytics)
load (ETL) jobs, data lineage, and metadata > Customers that want to store large data dynamically and automatically com-
management throughout the data pipeline volumes and perform light transforma- press cold data, driving higher volumes
from storage to refining through reporting tions can use the Teradata Extreme Data of data into the cold tier.
and analytics. Appliance. This platform offers low-cost
Evolving Schema
data storage with high compression
Recommended Approach: Leverage the Sample applications: Interactive data
rates at a highly affordable price.
strength of the relational model and SQL. discovery, including web click stream,
> For CPU-intensive transformations,
You may also want to use Hadoop to social feeds, set-top box analysis, sensor
the Teradata Data Warehouse Appli-
support low-cost, scale-out storage and logs, and JSON.
ance supports mid-level data storage
retention for some transactional data,
with built-in automatic compression Characteristics: Data generated by
which requires less rigor in security and
engines. machine processes typically requires a
metadata management.
> Customers that want to minimize data schema that changes or evolves rapidly.
Suggested Products: Teradata provides movement and complexity and are The schema itself may be structured, but
multiple solutions to handle low-cost executing transformations that require the changes occur too quickly for most
storage and retention applications as well reference data can use the Teradata data models, ETL steps, and reports to
as loading and transformation tasks. With Active Enterprise Data Warehouse. keep pace. Company e-commerce sites,
this architectural flexibility, Teradata This appliance provides a hybrid, social media, and other fast-changing
products help customers meet varying multi-temp architecture that places systems are good examples of evolving
cost, data latency, and performance cold data on hard disks and hot data schema. In many cases, an evolving
requirements. For example: on solid-state storage devices. With schema has two components one fixed
Teradata Database, customers can and one variable. For example, web logs
generate an IP address, time stamp, and read. The system handles this processing However, it appears less relational than
cookie ID, which are fixed. The URL string behind the scenes, allowing the analyst to non-relational, lacks semantics, and does
which is rich with information such as interpret and model data on-the-fly, based not easily fit into the notion of traditional
referral URLs and search terms used to on different analytic requirements. Ana- RDBMS rows and columns. There is often
find a page varies more. lysts never need to change data models or a need to store these data types in their
build new ETL scripts in order to break native file formats.
Recommended Approach: The design of
out the variable data. This feature reduces
web sites, applications, third-party sites, Recommended Approach: Hadoop
cost and saves time, giving analysts the
search engine marketing, and search MapReduce provides a large-scale process-
freedom to explore data without being
engine optimization strategies changes ing framework for workloads that need to
constrained by a rigid schema.
dynamically over time. Look for a solution extract semantics from raw file data. By
that eases the management of evolving Hadoop can also ingest files and store interpreting the format and pulling out
schema data by providing features that: them without structure, providing a required data, Hadoop can discern and
> Leverage the back end of the relational scalable data landing and staging area categorize shapes from video and perform
database management system (RDBMS), for huge volumes of machine-generated face recognition in images. Sometimes
so you can easily add or remove columns data. Because Hadoop uses the HDFS file format data is accompanied by meta-data,
system for storage instead of a relational which can be extracted, classified, and
> Make it easy for queries to do late
database, it requires additional processing treated separately.
binding of the structure
steps to create schema on-the-fly for
> Optimize queries dynamically by Suggested Products: When running batch
analysis. Therefore, Hadoop can slow an
collecting relevant statistics on the jobs to extract metadata from images or text,
iterative, interactive data discovery process.
variable part of the data Hadoop is an ideal platform. You can then
> Support encoding and enforcement However, if your process includes known analyze or join this metadata with other
of constraints on the variable part of batch data transformation steps that require dimensional data to provide additional
the data limited interactivity, Hadoop MapReduce value. Once youve used Hadoop to prepare
can be a good choice. Hadoop MapReduce the refined data, load it into Teradata Aster
Suggested Products: Teradata Aster is an enables large-scale data refining, so you can to quickly and easily join the data with other
ideal platform for ingesting and analyzing extract higher-value data from raw files for evolving- or stable-schema data.
data in an evolving schema. The product downstream data discovery and analytics. In
provides a discovery platform, which an evolving schema, Hadoop and Teradata Big Data Analytics in
allows evolving data to be stored natively Aster are a perfect complement for Action
without pre-defining how the variable part ingesting, refining, and discovering
Since its recent release, the Teradata Aster
of the data should be broken up. valuable insights from big data volumes.
discovery platform has already helped
Teradata Aster also allows the fixed part No Schema dozens of customers realize dramatic
of the data to be stored in a schema and Sample applications: Image processing, business benefit through enhanced insight.
indexed for performance. With this audio/video storage and refining, storage, The following examples illustrate how a
feature, analysts can define structure of and batch transformation and extraction company can use the Teradata Aster
the variable component at query run time. discovery platform, Teradata integrated
This task happens as part of the SQL- Characteristics: With data that has a data warehouse technology, and Hadoop
MapReduce analytic workflow in a process format, but no schema, the data structure to deliver new business insight from big
called late data binding or schema on is typically a well-defined file format. data analytics.
Multi-structured
Raw Data Call Data
Teradata Aster Analysis
Hadoop Check Data and
Discovery
Call Center
Platform Marketing
Voice Records
Automation
Analytic Results
Dimensional Data
(Customer
Check Images Capture, Retention Retention
and
Transformation
Campaign)
Traditional Layer
Data Flow
Data Sources
ETL Tools Teradata
Integrated DW
Customer Retention and platform. The bank also uses its Teradata identifies the unhappy sentiment data
Profitability integrated data warehouse to store and from Mr. Jones call to the contact center.
Banks and other companies with retail analyze high-resolution check images. In addition, the analyst notes that one of
operations know that keeping a customer the customers deposited checks is drawn
Using Hadoop, analysts can efficiently
satisfied is far less costly than replacing a on the account of another bank, with the
capture these huge volumes of image and
dissatisfied customer. A unified big data note brokerage account opening bonus.
call data. Then they can use the Aster-
architecture can help companies better The analyst can recommend that a cus-
Hadoop adaptor or Aster SQL-H
understand customer communications tomer support agent reach out to Mr.
method for on-the-fly data access of
and take action to prevent unhappy Jones with an offer designed to prevent
Hadoop data at query runtime to merge
consumers from defecting to a competitor. him from leaving.
the unhappy customer data from call
(See Figure 4.)
center records with the check data. Furthermore, the analyst can use these
For example, assume that a customer, Mr. tools to reveal customers with similar
By using Aster nPath one of the SQL-
Jones, calls a banks contact center to behavior to that of Mr. Jones. Marketing
MapReduce-enabled functions in the
complain about an account fee. The bank and sales personnel can proactively
Teradata Aster MapReduce Platform an
collects interactive voice response infor- approach these dissatisfied customers,
analyst can quickly determine whether
mation from the call center, storing this making offers that save those relation-
Mr. Jones may be about to switch over to
unstructured data in the data discovery ships, too.
the new financial institution. The analyst
Aster Database
Tokenize SQL
Analysis
Parallel
Extraction
SQL- Sentiment
MapReduce
Document
Parser Graph
MS Office 97-2010 Visualization
(doc, ppt, xls, msg) Email
PDF, HTML, ePub, Parser Form/table/mail ID
text, jpg
Image
File Server Metadata
Extract Custom parsing of
subject line (e.g.
CaseNumber)
All MapReduce
code automatically Custom
TxtProc
executes in parallel
in database
Text Extraction and Analysis Once stored, these files can be processed to data can help companies identify customers
Applications such as e-discovery, sentiment extract the relevant data and structure it likely to churn or to identify brand
analysis, and search rely on the ability to for analysis. advocates who might be open to a market-
store, process, and analyze massive amounts ing affiliation program to help drive
Next, analysts can use SQL-MapReduce
of documents, text, and emails. In their awareness and sales.
functions for tokenization, e-mail parsing,
native formats, it is very difficult to analyze
sentiment analysis, and other types of
these data types. Huge data volumes further For More Information
processing. These features allow businesses
complicate analysis efforts.
to identify positive or negative consumer For more information about how you can
The Teradata Aster MapReduce Platform sentiments or look for trends or correla- bring more value to the business through
includes features that support text extrac- tions in email communications. New a unified big data architecture, contact your
tion and analysis applications. Hadoops insights can be combined with other Teradata or Teradata Aster representative or
HDFS is ideal for quickly loading and information about the customer in the visit us on the web at Teradata.com or
storing any type of file in its native format. integrated data warehouse. Analyzing this Asterdata.com.
The Best Decision Possible is a trademark, and Teradata and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.S.
or worldwide. Teradata continually improves products as new technologies and components become available. Teradata, therefore, reserves the right to change
specifications without prior notice. All features, functions, and operations described herein may not be marketed in all parts of the world. Consult your Teradata
representative or Teradata.com for more information.
Copyright 2012 by Teradata Corporation All Rights Reserved. Produced in U.S.A.