Architecting A Platform For Big Data Analytics
Architecting A Platform For Big Data Analytics
BUSINESS
STRATEGIES
WHITE PAPER
By Mike Ferguson
Intelligent Business Strategies
March 2016
Prepared for:
Architecting a Platform For Big Data Analytics – 2nd Edition
Table of Contents
Introduction ....................................................................................................................... 3
Integrating Big Data Into Your DW/BI Environment – The Logical Data Warehouse ...... 12
Vendor Example: IBM’s End-to-End Platform For Big Data Analytics ............................ 15
IBM BigInsights and Open Data Platform with Apache Hadoop ........................ 15
Apache Spark ..................................................................................................... 16
IBM Technology Integration With Apache Spark ................................... 16
Spark-as-a-Service on IBM Bluemix ...................................................... 17
IBM PureData System for Analytics ................................................................... 17
IBM dashDB ....................................................................................................... 18
IBM Data Integration For The Big Data Enterprise ............................................. 18
IBM BigInsights BigIntegrate and BigInsights BigQuality ...................... 18
IBM Streams – Real-time Optimisation Using Big Data ..................................... 19
Accessing The Logical Data Warehouse Using IBM Big SQL And IBM Fluid
Query.................................................................................................................. 20
IBM Big SQL .......................................................................................... 20
IBM Fluid Query ..................................................................................... 21
Analytical Tools ..................................................................................... 21
Conclusions..................................................................................................................... 22
INTRODUCTION
In this digital age, ‘disruption’ is a term we hear a lot about. It can be defined as
Data and analytics
provide the insights for
a the key to business
“A disturbance that interrupts an event, activity, or process”
disruption
In the context of this paper it means competition appearing from unexpected
quarters that rapidly takes business away from traditional suppliers. It is made
possible because of data, that when collected, cleaned, integrated and
analysed, provides sufficient insights to identify new market opportunities and
prospective customers. Those that have the data can see the opportunities and
cause disruption. Those that don’t have data, or only a subset of it, cannot.
Disruption is The speed at which disruption can occur is accelerating. People with mobile
accelerating fuelled by devices can easily search for products and services, find available suppliers,
people who are compare products and services, read reviews, rate products and collaborate
increasingly informed
with others to tell them about what they like and dislike all while on the move.
before they buy
This kind of power puts prospective customers in a very powerful position when
making buying decisions because they are increasingly informed before they
buy and, armed with this information, they can switch suppliers at the click of a
mouse if a more personalised, better quality product or service is available.
Companies are having The result is that, with this information at hand, loyalty is becoming cheap.
to fight harder to retain People spread the word on social networks and disruption takes hold in the
existing customers market. In this kind of market, no one is safe. Companies have to fight harder
while also trying to to retain existing customers while also trying to grow their customer base.
grow
Given this backdrop, it is not surprising that many companies are therefore
Companies are trying focusing on improving quality and customer service to try to retain their
to offer improved customers. Process bottlenecks are being removed and process errors that
quality of service and impact on the customers’ experience are being fixed to make sure that
personalisation to everything runs smoothly. In addition, many companies are trying to improve
retain customers customer engagement and ensure the same customer experience across all
physical and on-line channels. That requires all customer facing employees
and systems to have access to deeper customer insight and to know about all
customer interactions. The objective (often referred to as an Omni-channel
initiative) is to create a smart front office with personalised customer insight
Analysing enriched and personalised customer marketing recommendations available across all
customer data using channels. This is shown in Figure 1 where personalisation is made possible by
predictice and analysing enriched customer data using predictive and prescriptive analytics.
prescriptive analytics is
the key to Until recently, the way in which we produced customer insight and customer
personalisation recommendations was to simply analyse transaction activity in a data
warehouse. However the limitation is that only transaction activity is analysed.
It does not include analysis of other high value data that when pieced together
Insights produced by offers a much more complete understanding of a customer’s “DNA”. Therein,
analysing of
lies the problem. Analysis of traditional transaction data in a data warehouse is
transactional activity is
insufficient for a company to gain disruptive insight. More data is needed.
no longer enough
tocause disruption Therefore new data requirements and new technical requirements need to be
defined to identify, process and analyse all the data needed to enable
companies to become ‘disrupters’. Let’s take a look at what those
requirements are.
Customers
Insights produced by Improve customer
engagement
analysing transactional Digital channels are
and non-transactional generating big data
data is now needed Front-Office Operations
E-commerce M-commerce Social commerce Sales Force Customer facing Customer
across all traditional application Mobile apps applications automation apps bricks & mortar apps service apps
and digital channels e.g. In-store apps
In-branch apps
OMNI-CHANNEL
Figure 1
Social networks Sentiment can also come from in-bound email and CRM system notes.
provide new data, Identifying influencers in social networks is important because it allows
sentiment and an marketing departments to run ‘target the influencer’ marketing campaigns to
understanding of who see if they can recruit new customers, cause revenue uplifts and therefore
influencers are
improve marketing effectiveness.
SENSOR DATA
Sensors can also be used to gain a deeper understanding of customers in that
they can be embedded in products those customers own. As a result data can
Sensor data in smart be collected to understand how products are used and to capture data about
products is another people’s health. The use of sensor data is particularly relevant to customer
good data source that location, which can be revealed using smart phone GPS sensors. This
can reveal new combined with clickstream data allows telecommunications companies, for
business opportunities
example, to monitor what people are browsing on-line and also where they are
while browsing. This allows Telecommunications companies to disrupt the
advertising industry by offering location-based mobile advertising - a totally
new line of business. Sensors can also be used to monitor customer
movement and product usage.
Sensor data can help Beyond customer, sensor data is frequently used to help optimise operations,
optimise business reduce risk and provide insights that may lead to new products. Sensors allow
operations and avoid companies to monitor live operations, prevent problems happening to keep
unplanned operational processes optimised and avoid unplanned costs. Typical use cases include:
cost o Supply/distribution chain optimisation
o Asset management and field service optimisation
o Manufacturing production line optimisation
o Location based advertising (mobile phones)
o Grid health monitoring e,g. Electricity, water, mobile phone cell network
o Oil and gas drilling activity monitoring, well integrity and asset management
o Usage / consumption monitoring via smart metering
o Healthcare
o Traffic optimisation
New kinds of analytical • Support new kinds of analytical workloads to enable disruption including:
workloads are needed
to enable disruption o Real-time analysis of data in motion (e.g. analyse clickstream while
the visitor is on your website, or sensor data to predict asset failure)
o Exploratory analysis of un-modeled, multi-structured data e.g.
Social network text, open government data, sensor data
o Graph analysis e.g. community analysis, social network influencer
Real-time analysis of analysis
streaming data, graph
o Machine learning to:
analysis, machine
learning and § Develop predictive models on large volumes of structured
exploratory analysis of data e.g. clickstream data
multi-structured data
are all needed § Score and predict propensity to buy, trends in behavior and
customer churn using a more comprehensive set of data
§ Score and predict asset or network failure that would impact
negatively on revenue, customer experience, risk or
operational cost
Figure 2
New data should not • Irrespective of whether development of data capture, data preparation, data
impose additional work integration and analysis jobs are done by IT professionals or business
on IT and business analysts, it should be possible to develop once, and run on the platform
analysts to process it best suited to do the work irrespective if this is on-premises or on the cloud
• Support the ability to flatten semi-structured data e.g. JSON, XML
• Support shared nothing parallel execution of each and every data
cleansing, data transformation and data integration task defined in a data
Scalable data
preparation and integration job. This is needed to be able to exploit the full
cleansing,
transformation and power of underlying hardware
integration is needed to • Support multiple types of parallelism across nodes including parallel
handle large data execution of each transformation (as described above) across partitioned
volumes
data and also pipeline parallelism whereby the output of one task
(executing in parallel) can flow to the next task while the first task is still
executing
Data cleansing, • Push data cleansing, transformation and integration tasks to execute where
transformation and the data is located and not have to take the data to a central point where
integration needs to
cleansing, transformation and integration tasks are located
happen as close to the
data as possible • Process more data simply by adding more hardware
• Undertake probabilistic matching (fuzzy matching) of sentiment with
customer master data at scale to understand customer sentiment
• Archive and manage data movement between data stores (e.g. data
warehouse and Hadoop) ensuring all governance rules associated with
sensitivity, quality and retention are upheld irrespective of data location
• Control access to data by applications, tools and users to protect data
Governance aware • Have governance aware execution engines to enforce data governance
execution engines are
rules anywhere in the analytical ecosystem shown in Figure 2 to enforce
needed to enforce
defined policies
governance rules
anywhere that data to be • Map new insights produced by analysing new data, into a shared business
governed is located vocabulary in a business glossary so that the meaning of this data can be
understood before it is shared
Data should be available • Publish new datasets, data integration workflows, analytical workflows, and
as a service within the insights, as part of a data curation process, to an information catalog for
enterprise and other users and applications to consume and use
documented in an
It should be possible for data scientists and business analysts to:
information catalog
• Access an information catalog to see what data exists, where it is located,
Data lineage is needed what state it is in, where it came from, how it has been transformed,
to help understand whether they can trust it and if it is available for use
where data came from
• Easily search the information catalog to find new datasets and insights so
that they can quickly see what exists
Business users should
be able to search the • Subscribe to receive new data and insights, published in an information
information catalog to catalog, for delivery to wherever they require it, in whatever format they
find and subscribe to require it subject to any governance rules being applied to enforce national,
data
regional, or other jurisdictional policies that restrict its use
• Access authorised data and insights that may exist in multiple analytical
data stores and data streaming platforms through a common SQL interface
It should be possible to from self-service tools to simplify access to data
federate queries across
streaming data, Hadoop
• Federate queries data across Hadoop, traditional data warehouses, and
and traditional data live data streams to produce disruptive actionable insights
warehouses to produce • Access Hadoop data using SQL from traditional data warehouse data
disruptive insights
stores as well as via SQL on Hadoop initiatives
It should be possible to
hide complexity of data • It should be possible to query and analyse data in a logical data warehouse
access by creating a (across multiple analytical data stores and real-time streaming platforms)
logical data warehouse using traditional and Cognitive Analytical tools irrespective of whether the
comprised of multiple data in the logical data warehouse is on-premises, on the cloud or both
underlying platforms
D
actions
Filtered
Stream
Data in Hadoop can be processing data Enterprise Information Management Tool Suite
integrated with data in
Streaming
traditional data data
XML,
warehouses to produce clickstream
JSON
sensors Web logs web services RDBMS feeds social Cloud Files office docs
disruptive insights
Copyright © Intelligent Business Strategies 1992-2016!
Figure 3
Apache Spark is rapidly Furthermore, Apache Spark can sit on top of all data stores as a general
emerging as an in- purpose massively parallel in-memory analytics layer primarily aimed at data
memory analytics layer science. It could also be embedded in database products for in-database, in-
that can run on top of
memory analytics.
muliple data stores or
even inside DBMSs Figure 3 also shows an enterprise information management (EIM) tool suite to
manage data flows into and between data stores in a hybrid multi-platform
analytical ecosystem. The EIM suite includes tooling to support:
An enterpise information
management tool suite
• An information governance catalog and business glossary
enables data curation • Data modeling
and governance across
a multi-platform • Data and metadata relationship discovery
analytical ecosystem
• Data quality profiling and monitoring
“How can you enable development of big data analytics to produce disruptive
insight quickly while also dealing with a deluge of new and complex data?”
There are several ways • Use the cloud as a low cost way to get started in create new insights to
to introduce agility into add to what you already know
big data development
processes • Make use of data virtualisation and federated query processing to
create a ‘logical data warehouse’ across multiple analytical data stores
Accelera'ng*Delivery*Using*A*Publish*And*Subscribe*Approach*
trusted*data*
as*a*service*
Acquire publish*
Acquire Data*Cleansing*,* Info*
data Acquire Transform,*Enrich,*Filter catalog*
A publish and subscribe source*
production line approach New*predic've*
analy'c*pipelines*
together with an (as*a*service)*
information catalog
encourages reuse and subscribe* publish* Info* subscribe* publish* Analy'cs*
consume* Data* catalog* consume* Analyse* catalog*
can significantly reduce Integra'on* (e.g.score)*
the time to produce trusted,*integrated*
data*ad*a*service*
disruptive insights
other*e.g.*embed**
publish* New*analy'c*
analy'c*applica'ons*
subscribe* Solu'ons* applica'ons*
Visualise catalog* use*
consume*
publish* New*prescrip've*
Decide Act analy'c*pipelines*
Copyright © Intelligent Business Strategies 1992-2015!
Figure 4
IBM Open Platform with Apache Hadoop is a 100% open source common
Open Data Platform (ODP) core of Apache Hadoop (inclusive of HDFS, YARN,
and MapReduce) and Apache Ambari software. The following components ship
as part of this:
Component Description
Ambari Hadoop cluster management
Apache Kafka Scalable message handling for inbound streaming
data
IBM Open Platform
with Hadoop includes Flume Web log data ingestion into Hadoop HDFS
core Hadoop HBase Column family NoSQL database for high velocity data
technologies such as ingest and operational reporting
HDFS, Hive, Pig, Spark HDFS Hadoop Distributed File System that partitions &
and HBase distributes data across a cluster
Hive SQL access to HDFS and HBase data
Knox An API Gateway for protection of Rest APIs
Lucene Java-based indexing and search technology
Oozie Scheduling
Parquet Columnar
IBM BigInsights includes IBM BigInsights for Apache Hadoop is a collection of value-added services that
a set of packaged
can be installed on top of the IBM® Open Platform with Apache Hadoop or any
modules aimed at
different types of user
incumbent Hadoop distribution (e.g. Cloudera, MapR etc.). It offers analytic
and enterprise capabilities for Hadoop and includes the following modules:
APACHE SPARK
IBM Technology Integration With Apache Spark
IBM has made a A number of IBM software products now integrate with Apache Spark. Some of
strategic commitment these are shown in the table below along with a description of how they
to using Apache Spark integrate.
Figure 5
Spark-as-a-Service on IBM Bluemix
IBM has also made Spark available as a service on IBM Bluemix. Analytics for
Apache Spark works with commonly used tools available in IBM Bluemix so
that you can quickly start tapping into the full power of Apache Spark. The tools
include the following:
• Jupyter Notebooks for interactive and reproducible data analysis and
visualization
• SWIFT Object Storage for storage and management of data files
• Apache Spark for data processing at scale
IBM PureData System for The IBM PureData System for Analytics integrates database, server, storage
Analytics is optimised for and advanced analytic capabilities into a single system. It scales from 1 TB to
advanced analytics on 1.5 petabytes includes special processors to filter data as it comes off disk so
structured data and for that only data relevant to a query is processed in the RDBMS. The IBM
some data warehouse
Netezza Analytic RDBMS requires no indexing or tuning which makes it easier
workloads
to manage. It is designed to interface with traditional BI tools including IBM
Cognos Analytics and also runs IBM SPSS developed advanced analytical
models deployed in the database on large volumes of data.
IBM PureData System for Complementing the IBM PureData System for Analytics, is an advanced
Analytics provides free in- analytics framework. In addition to providing a large library of parallelised
database analytics advanced and predictive algorithms, it allows creation of custom analytics
capabilities allowing you created in a number of different programming languages (including C, C++,
to create and apply Java, Perl, Python, Lua, R, and even Fortran) and it allows integration of
complex and leading third party analytic software offerings from companies like SAS, SPSS,
sophisticated analytics
Revolution Analytics, and Fuzzy Logix.
right inside the appliance.
IBM PureData System for Analytics allows you to create, test and apply models
to score data right inside the appliance, eliminating the need to move data and
giving you access to more of the data and more attributes than you might
otherwise be able to use if you needed to extract the data to another computer.
IBM DASHDB
IBM dashDB is a new
IBM dashDB is a fully managed, cloud-based MPP DBMS enabling IBM to offer
massively parallel
cloud-based DBMS data warehouse-as-a-service. It provides in-database analytics, in-memory
offering data columnar computing and connectivity to a wide range of analytical toolsets,
warehouse-as-a- including Watson Analytics and many third-party BI tools.
service
A second deployment option of IBM dashDB, currently in early access preview,
It can also be deployed is also available for fast deployment into private or virtual private clouds via
in a private cloud Docker container.
Data integration jobs can • Integration with business glossary and data modeling software in the
also be published in an same EIM platform via shared metadata
information catalog to • An information catalog and the ability to publish data integration jobs as
offer trusted information
data services in InfoSphere Information Governance Catalog so that
services to information
information consumers can see what data services are available for
consumers
them to shop for and order via InfoSphere Data Click
IBM Streams ships with To help expedite real-time analytic application development, IBM also ships
pre-built toolkits and with pre-built analytical toolkits and connectors for popular data sources. Third
connectors to expedite party analytic libraries are also available from IBM partners. In addition, an
development of real- Eclipse based integrated development environment (IDE) is included to allow
time analytic organisations to build their own custom built real-time analytic applications for
applications
stream processing. It is also possible to embed IBM SPSS predictive models or
analytic decision management models in IBM Streams analytic application
workflows to predict business impact of event patterns.
Events can be analysed Scalability is provided by deploying IBM Streams applications on multi-core,
and acted upon in real- multi-processor hardware clusters optimised for real-time analytics and via
time or filtered and integration with Apache Spark. Events of interest to the business can also be
stored for further filtered out and pumped to other IBM analytical data stores for further analysis
analysis and replay in
and/or replay. IBM Streams can therefore be used to continually ingest data of
IBM BigInsights
interest into IBM BigInsights to analyse. It is also possible to summarise high
volume data streams and route these to IBM Cognos Analytics for visualisation
in a dashboard for further human analysis.
ACCESSING THE LOGICAL DATA WAREHOUSE USING IBM BIG SQL AND
IBM FLUID QUERY
IBM can also simplify Users and analytic applications need access to data in a variety of data
access to multiple data
repositories and platforms without concern for the data’s location or access
stores via a federated
SQL interface
method or the need to rewrite a query. To make this possible, IBM provides Big
SQL and Fluid Query.
IBM Big SQL can
access data in both IBM Big SQL
Hadoop and relational IBM Big SQL is IBM’s flagship multiple-platform SQL query engine for
DBMSs accessing Hadoop and non-Hadoop data or both. It therefore creates a Logical
Data Warehouse layer over the top of multiple underlying analytical data stores
IBM Big SQL can be and can federate queries to make those platforms work locally on the
used to create a Logical necessary data. Business analysts can connect directly to Big SQL from self-
Data Warehouse layer service BI tools that generate SQL. Data scientists and IT developers who
over the top of multiple want to access Hadoop and non-Hadoop data using SQL from within their
underlying data stores
analytic applications can also use it.
IBM Big SQL is a Spark When processing in-bound SQL, IBM Big SQL bypass Hadoop MapReduce,
compliant massively Tez and Spark execution environments. Instead it runs natively under YARN on
parallel SQL engine that a Hadoop cluster with direct access to HDFS and HBase data. Big SQL is fully
can be used by Spark integrated with the Hive metastore and can therefore see Hive Tables, Hive
analytic applications and SerDes, Hive partitioning and Hive Statistics. It is also fully compliant with
BI tools to query Spark but does not require Spark. By that we mean that IBM Big SQL can be
Hadoop and non- used by Spark analytic applications written in Python, Java, Scala and R to
Hadoop data access data as an alternative to Spark SQL. This works because Spark
applications can access data via Big SQL and get back Spark RDDs to analyse
IBM Big SQL provides data in memory across a cluster. The difference is that there is more
support for the full 2011 functionality in Big SQL, it is fully ANSI 2011 compliant and has optimisation
ANSI SQL standard
capability to perform query re-write to improve performance. It supports
aggregate, scalar and OLAP functions, virtual tables, JAQL UDFs for analysis
It supports complex data of unstructured data, and data types such as STRUCT, ARRAY, MAP and
types, OLAP functions BINARY to handle more complex data. In addition, it supports Hadoop
and can use UDFs to columnar file formats such as ORC, Parquet, and RCFile and has no
analyse unstructed data
proprietary storage format of its own. In terms of security, IBM Big SQL offers
role-based access plus column and row security. IBM Big SQL can also
potentially ‘push down’ query functionality into Spark to execute if it deems it
necessary (e.g. to make use of GraphX functions).
Analytical Tools
A wide range of tools A wide range of third party and IBM analytical tools like IBM Cognos Analytics,
can leverage scalable IBM Watson Analytics, IBM SPSS, BigSheets, IBM BigR and IBM BigSheets
analytics running in the can all make use of BigSQL and Fluid Query to access data and invoke in-
IBM Analytics Platform database, in-memory, in-Hadoop and in-stream scalable analytics running in
the IBM Analytics Platform.
CONCLUSIONS
With technology advancing rapidly companies need to work out how to piece
Organisations need to
understand what they
together and integrate the necessary components to maximise their ability to
want to achieve though produce disruptive insight. They need to be able to define what they need to
the production of produce to cause disruption, understand their data and analytical requirements
disruptive insight and then select the technology components necessary to be able to get
started. It should be possible to get started quickly in the cloud and then move
on-premises if needs be.
They need to
understand data and In addition they also need to be able to produce disruptive insight in a
analytical requirements productive manner without the need for major re-training to use new
before selecting
technologies like Hadoop, Spark and streaming analytics. To that end, if
technology components
companies can make use of existing tools used in traditional data warehousing
It should be possible to to also clean, integrate and analyse data in big data environments then the
start in the cloud and time to value will come down significantly. Also, tools should be able to exploit
move on-premises or the scalability of underlying hardware, when data volumes and data velocity is
create a hybrid solution high, without users needing to know how that is done.
Exploiting existing tools It should also be possible to simplify access to data in multiple data stores and
and skillsets that can join data across multiple data stores so that complexity is kept to a minimum.
exploit traditional and This is true irrespective of whether a business analyst needs to access data
big data technologies from a self-service analytics tool or whether it an IT developer or data scientist
fuels productivity needs to access data from a custom built analytical application. As such a
common federated SQL layer is needed to create a ‘Logical Data Warehouse’.
IBM is building an
IBM is pursuing all of this both on the cloud and on-premises with the ability
architecture for Big Data
and traditional analytics
deploy BigInsights, Apache Spark Open Platform with Apache Hadoop both on
on the cloud and on- the cloud and on premises. In addition it is creating a common scalable
premises analytical RDBMS code base that works across its dashDB, Big SQL, DB2,
PureData System for Analytics Appliances and as a software version on private
Data virtualisation and cloud with dashDB for software defined environments. Also Spark is being
federated query support integrated everywhere to push down analytics to make them run as close to the
simplifies access to data as possible. In addition, Big SQL and Fluid Query simplify access to
traditional and big data traditional data warehouses and Hadoop, helping to create a logical data
stores by creating a warehouse layer. And there is more to come.
Logical Data
Warehouse Today we are beyond the point where big data is in the prototype stage. We
are entering an era where automation, integration and end-to-end solutions
need to be built rapidly to facilitate disruption. Companies need to architect a
platform for Big Data (and traditional data) analytics. Given this requirement,
IBM would have to be a short list contender to helping any organisation
become a disrupter whether it be on the cloud or on-premises.
Author
Mike Ferguson is Managing Director of Intelligent Business Strategies
Limited. As an analyst and consultant he specialises in business intelligence
and enterprise business integration. With over 34 years of IT experience, Mike
has consulted for dozens of companies on business intelligence strategy, big
data, data governance, master data management and enterprise architecture.
He has spoken at events all over the world and written numerous articles. He
has written many articles, and blogs providing insights on the industry.
Formerly he was a principal and co-founder of Codd and Date Europe Limited
– the inventors of the Relational Model, a Chief Architect at Teradata on the
Teradata DBMS and European Managing Director of Database Associates, an
independent analyst organisation. He teaches popular master classes in Big
Data Analytics, New Technologies for Business Intelligence and Data
Warehousing, Enterprise Data Governance, Master Data Management, and
Enterprise Business Integration.
INTELLIGENT
BUSINESS
STRATEGIES