BDA UNIT-I
BDA UNIT-I
1
–Analytic Accelerators
–Application and industry accelerators
–Visualization
2
Big Data Platform - Data Warehousing:
Workload optimized systems
–Deep analytics appliance
–Configurable operational analytics appliance
–Data warehousing software
Capabilities
•Massive parallel processing engine
•High performance OLAP
•Mixed operational and analytic workloads
Big Data Platform - Information Integration and Governance
Integrate any type of data to the big data platform
–Structured
–Unstructured
–Streaming
Governance and trust for big data
–Secure sensitive data
–Lineage and metadata of new big data sources
–Lifecycle management to control data growth
–Master data to establish single version of the truth
Leverage purpose-built connectors for multiple data sources:
3
•Developers
•Similarity in tooling and languages
•Mature open source tools with enterprise capabilities
•Integration among environments
Administrators
•Consoles to aid in systems management
Big Data Platform –Accelerators:
Analytic accelerators
–Analytics, operators, rule sets
Industry and Horizontal Application Accelerators
–Analytics
–Models
–Visualization / user interfaces
–Adapters
Big Data Platform - Analytic Applications:
Big Data Platform is designed for analytic application development and integration.
BI/Reporting – Cognos BI, Attivio
Predictive Analytics – SPSS, G2, SAS
Exploration/Visualization – BigSheets, Datameer
Instrumentation Analytics – Brocade, IBM GBS
Content Analytics – IBM Content Analytics
Functional Applications – Algorithmics, Cognos Consumer Insights, Clickfox, i2, IBM GBS
Industry Applications – TerraEchos, Cisco, IBM GBS
4
• Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years,
currently around 70 – 80MB / sec
How long it will take to read 1TB of data?
• 1TB (at 80Mb / sec):
• – 1 disk - 3.4 hours
• – 10 disks - 20 min
• – 100 disks - 2 min
• – 1000 disks - 12 sec
What do we care about when we process data?
• Handle partial hardware failures without going down:
– If machine fails, we should be switch over to stand by machine
– If disk fails – use RAID or mirror disk
• Able to recover on major failures:
– Regular backups
– Logging
– Mirror database at different site
• Capability:
– Increase capacity without restarting the whole system
– More computing power should equal to faster processing
• Result consistency:
– Answer should be consistent (independent of something failing) and returned
in reasonable amount of time
4. Nature of Data
Big data is a term thrown around in a lot of articles, and for those who understand what
big data means that is fine, but for those struggling to understand exactly what big data is, it
5
can get frustrating. There are several definitions of big data as it is frequently used as an all-
encompassing term for everything from actual data sets to big data technology and big data
analytics. However, this article will focus on the actual types of data that are contributing to
the ever growing collection of data referred to as big data. Specifically we focus on the data
created outside of an organization, which can be grouped into two broad categories: structured
and unstructured.
Structured Data
1. Created
Created data is just that; data businesses purposely create, generally for market
research. This may consist of customer surveys or focus groups. It also includes more modern
methods of research, such as creating a loyalty program that collects consumer information or
asking users to create an account and login while they are shopping online.
2. Provoked
A Forbes Article defined provoked data as, “Giving people the opportunity to express
their views.” Every time a customer rates a restaurant, an employee, a purchasing experience
or a product they are creating provoked data. Rating sites, such as Yelp, also generate this type
of data.
3. Transacted
Transactional data is also fairly self-explanatory. Businesses collect data on every
transaction completed, whether the purchase is completed through an online shopping cart or
in-store at the cash register. Businesses also collect data on the steps that lead to a purchase
online. For example, a customer may click on a banner ad that leads them to the product pages
which then spurs a purchase. As explained by the Forbes article, “Transacted data is a powerful
way to understand exactly what was bought, where it was bought, and when. Matching this
type of data with other information, such as weather, can yield even more insights.
4. Compiled
Compiled data is giant databases of data collected on every U.S. household. Companies
like Acxiom collect information on things like credit scores, location, demographics, purchases
and registered cars that marketing companies can then access for supplemental consumer data.
5. Experimental
Experimental data is created when businesses experiment with different marketing
pieces and messages to see which are most effective with consumers. You can also look at
experimental data as a combination of created and transactional data.
Unstructured Data
People in the business world are generally very familiar with the types of structured
data mentioned above. However, unstructured is a little less familiar not because there’s less
of it, but before technologies like NoSQL and Hadoop came along, harnessing unstructured
data wasn’t possible. In fact, most data being created today is unstructured. Unstructured data,
as the name suggests, lacks structure. It can’t be gathered based on clicks, purchases or a
barcode, so what is it exactly?
6. Captured
Captured data is created passively due to a person’s behaviour. Every time someone
enters a search term on Google that is data that can be captured for future benefit. The GPS
info on our smart phones is another example of passive data that can be captured with big data
technologies.
6
7. User-generated
User-generated data consists of all of the data individuals are putting on the Internet
every day. From tweets, to Facebook posts, to comments on news stories, to videos put up on
YouTube, individuals are creating a huge amount of data that businesses can use to better target
consumers and get feedback on products.
Big data is made up of many different types of data. The seven listed above comprise
types of external data included in the big data spectrum. There are, of course, many types of
internal data that contribute to big data as well, but hopefully breaking down the types of data
helps you to better see why combining all of this data into big data is so powerful for business.
Sources of Big Data:
7
Medical records
Data produced by businesses
Commercial transactions
Banking/stock records
E-commerce
Credit cards
3. Internet of Things (machine-generated data): derived from the phenomenal growth in the
number of sensors and machines used to measure and record the events and situations in the
physical world. The output of these sensors is machine-generated data, and from simple sensor
records to complex computer logs, it is well structured. As sensors proliferate and data volumes
grow, it is becoming an increasingly important component of the information stored and
processed by many businesses. Its well-structured nature is suitable for computer processing,
but its size and speed is beyond traditional approaches.
Data from sensors
Fixed sensors
Home automation
Weather/pollution sensors
Traffic sensors/webcam
Scientific sensors
Security/surveillance videos/images
Mobile sensors (tracking)
Mobile phone location
Cars
Satellite images
Data from computer systems
Logs
Web logs
5. Analytic Processes and Tool
8
Open Source Big Data Tools
Based on the popularity and usability we have listed the following ten open source tools
as the best open source big data tools
1. Hadoop
Apache Hadoop is the most prominent and used tool in big data industry with its
enormous capability of large-scale processing data. This is 100% open source framework and
runs on commodity hardware in an existing data center. Furthermore, it can run on a cloud
infrastructure. Hadoop consists of four parts:
Hadoop Distributed File System: Commonly known as HDFS, it is a distributed file
system compatible with very high scale bandwidth.
MapReduce: A programming model for processing big data.
YARN: It is a platform used for managing and scheduling Hadoop’s resources in
Hadoop infrastructure.
Libraries: To help other modules to work with Hadoop.
2. Apache Spark
Apache Spark is the next hype in the industry among the big data tools. The key point
of this open source big data tool is it fills the gaps of Apache Hadoop concerning data
processing. Interestingly, Spark can handle both batch data and real-time data. As Spark does
in-memory data processing, it processes data much faster than traditional disk processing. This
is indeed a plus point for data analysts handling certain types of data to achieve the faster
outcome.
Apache Spark is flexible to work with HDFS as well as with other data stores, for
example with OpenStack Swift or Apache Cassandra. It’s also quite easy to run Spark on a
single local system to make development and testing easier. Spark Core is the heart of the
project, and it facilitates many things like
distributed task transmission
scheduling
I/O functionality
Spark is an alternative to Hadoop’s MapReduce. Spark can run jobs 100 times faster
than Hadoop’s MapReduce.
3. Apache Storm
Apache Storm is a distributed real-time framework for reliably processing the
unbounded data stream. The framework supports any programming language. The unique
features of Apache Storm are:
Massive scalability
Fault-tolerance
“fail fast, auto restart” approach
The guaranteed process of every tuple
Written in Clojure
Runs on the JVM
Supports direct acrylic graph(DAG) topology
9
Supports multiple languages
Supports protocols like JSON
Storm topologies can be considered similar to MapReduce job. However, in case of
Storm, it is real-time stream data processing instead of batch data processing. Based on the
topology configuration, Storm scheduler distributes the workloads to nodes. Storm can
interoperate with Hadoop’s HDFS through adapters if needed which is another point that makes
it useful as an open source big data tool.
4. Cassandra
Apache Cassandra is a distributed type database to manage a large set of data across the
servers. This is one of the best big data tools that mainly process structured data sets. It provides
highly available service with no single point of failure. Additionally, it has certain capabilities
which no other relational database and any NoSQL database can provide. These capabilities
are:
Continuous availability as a data source
Linear scalable performance
Simple operations
Across the data centers easy distribution of data
Cloud availability points
Scalability
Performance
Apache Cassandra architecture does not follow master-slave architecture, and all nodes
play the same role. It can handle numerous concurrent users across data centers. Hence, adding
a new node is no matter in the existing cluster even at its up time.
5. RapidMiner
RapidMiner is a software platform for data science activities and provides an integrated
environment for:
Preparing data
Machine learning
Text mining
Predictive analytics
Deep learning
Application development
Prototyping
This is one of the useful big data tools that support different steps of machine learning, such
as:
Data preparation
Visualization
Predictive analytics
Model validation
Optimization
Statistical modelling
10
Evaluation
Deployment
RapidMiner follows a client/server model where the server could be located on-premise,
or in a cloud infrastructure. It is written in Java and provides a GUI to design and execute
workflows. It can provide 99% of an advanced analytical solution.
6. MongoDB
MongoDB is an open source NoSQL database which is cross-platform compatible with
many built-in features. It is ideal for the business that needs fast and real-time data for instant
decisions. It is ideal for the users who want data-driven experiences. It runs on MEAN software
stack, NET applications and, Java platform.
Some notable features of MongoDB are:
It can store any type of data like integer, string, array, object, Boolean, date etc.
It provides flexibility in cloud-based infrastructure.
It is flexible and easily partitions data across the servers in a cloud structure.
MongoDB uses dynamic schemas. Hence, you can prepare data on the fly and
quickly. This is another way of cost saving.
7. R Programming Tool
This is one of the widely used open source big data tools in big data industry for
statistical analysis of data. The most positive part of this big data tool is – although used for
statistical analysis, as a user you don’t have to be a statistical expert. R has its own public
library CRAN (Comprehensive R Archive Network) which consists of more than 9000
modules and algorithms for statistical analysis of data.
R can run on Windows and Linux server as well inside SQL server. It also supports
Hadoop and Spark. Using R tool one can work on discrete data and try out a new analytical
algorithm for analysis. It is a portable language. Hence, an R model built and tested on a local
data source can be easily implemented in other servers or even against a Hadoop data lake.
8. Neo4j
Hadoop may not be a wise choice for all big data related problems. For example, when
you need to deal with large volume of network data or graph related issue like social networking
or demographic pattern, a graph database may be a perfect choice. Neo4j is one of the big data
tools that is widely used graph database in big data industry. It follows the fundamental
structure of graph database which is interconnected node-relationship of data. It maintains a
key-value pattern in data storing.
Notable features of Neo4j are:
It supports ACID transaction
High availability
Scalable and reliable
Flexible as it does not need a schema or data type to store data
It can integrate with other databases
Supports query language for graphs which is commonly known as Cypher.
11
9. Apache SAMOA
Apache SAMOA is among well-known big data tools used for distributed streaming
algorithms for big data mining. Not only data mining it is also used for other machine learning
tasks such as:
Classification
Clustering
Regression
Programming abstractions for new algorithms
It runs on the top of distributed stream processing engines (DSPEs). Apache Samoa is a
pluggable architecture and allows it to run on multiple DSPEs which include
Apache Storm
Apache S4
Apache Samza
Apache Flink
Due to below reasons, Samoa has got immense importance as the open source big data tool in
the industry:
You can program once and run it everywhere
Its existing infrastructure is reusable. Hence, you can avoid deploying cycles.
No system downtime
No need for complex backup or update process
10. HPCC
High-Performance Computing Cluster (HPCC) is another among best big data tools. It
is the competitor of Hadoop in big data market. It is one of the open source big data tools under
the Apache 2.0 license. Some of the core features of HPCC are:
Helps in parallel data processing
Open Source distributed data computing platform
Follows shared nothing architecture
Runs on commodity hardware
Comes with binary packages supported for Linux distributions
Supports end-to-end big data workflow management
The platform includes:
Thor: for batch-oriented data manipulation, their linking, and analytics
Roxie: for real-time data delivery and analytics
Implicitly a parallel engine
Maintains code and data encapsulation
Extensible
Highly optimized
Helps to build graphical execution plans
It compiles into C++ and native machine code
12
6. Analysis Vs Reporting
Reporting
Reporting is the first step of working with data when it comes to marketing. Reporting
is really about the collection and organization of data points to start the storytelling process
(more on story-telling later). Yet, to plant a seed, storytelling is really the core of reporting
when it’s done well. The data should come together into an organized visual format, allowing
you to see changes against time or other relevant variables to show what has happened.
Good reporting should be organized with clear time parameters and have a clear visual
presentation, so you can start to gain understanding of where things are as they pertain to your
marketing efforts.
Analysis
Analysis is the step that should happen after the reports have been created. Analysis is
the process of searching the reports and data to start to tell a more complex story. Analysis
would look for the interactions between various data points to see how they influence each
other. This search for correlation, or for the cause-and-effect relationships that exist inside of
the data, is the basis of good analysis. To find, test, and confirm a true cause-and-effect
relationship within the data would mark a successful analysis of the data.
Sometimes there’s not enough data to truly do analysis in your existing data set. This
would mean that to do true analysis you would have to gather data from outside of your data
set. For example, if you were doing some analysis on your web data, you might have to gather
reports on your social media channels or referral channels to see a bigger picture of the data
and get an idea of how it’s influenced by outside sources.
13
Data capabilities makes them important and one can analyze and visualize data better than any
other data visualization software in the market.
3. Python
Python is an object-oriented scripting language which is easy to read, write, maintain
and is a free open source tool. It was developed by Guido Van Rossum in late 1980’s which
supports both functional and structured programming methods.Python is easy to learn as it is
very similar to JavaScript, Ruby, and PHP. Also, Python has very good machine learning
libraries viz. Scikitlearn, Theano, Tensorflow and Keras. Another important feature of Python
is that it can be assembled on any platform like SQL server, a MongoDB database or JSON.
Python can also handle text data very well.
4. SAS:
Sas is a programming environment and language for data manipulation and a leader in
analytics, developed by the SAS Institute in 1966 and further developed in 1980’s and 1990’s.
SAS is easily accessible, manageable and can analyze data from any sources. SAS introduced
a large set of products in 2011 for customer intelligence and numerous SAS modules for web,
social media and marketing analytics that is widely used for profiling customers and prospects.
It can also predict their behaviours, manage, and optimize communications.
5. Apache Spark
The University of California, Berkeley’s AMP Lab, developed Apache in 2009. Apache
Spark is a fast large-scale data processing engine and executes applications in Hadoop clusters
100 times faster in memory and 10 times faster on disk. Spark is built on data science and its
concept makes data science effortless. Spark is also popular for data pipelines and machine
learning models development.
Spark also includes a library – MLlib, that provides a progressive set of machine
algorithms for repetitive data science techniques like Classification, Regression, Collaborative
Filtering, Clustering, etc.
6. Excel
Excel is a basic, popular and widely used analytical tool almost in all industries.
Whether you are an expert in Sas, R or Tableau, you will still need to use Excel. Excel becomes
important when there is a requirement of analytics on the client’s internal data. It analyzes the
complex task that summarizes the data with a preview of pivot tables that helps in filtering the
data as per client requirement. Excel has the advance business analytics option which helps in
modelling capabilities which have prebuilt options like automatic relationship detection, a
creation of DAX (Data Analysis Expressions) measures and time grouping.
7. RapidMiner:
RapidMiner is a powerful integrated data science platform developed by the same
company that performs predictive analysis and other advanced analytics like data mining, text
analytics, machine learning and visual analytics without any programming. RapidMiner can
incorporate with any data source types, including Access, Excel, Microsoft SQL, Tera data,
Oracle, Sybase, IBM DB2, Ingres, MySQL, IBM SPSS, Dbase etc. The tool is very powerful
14
that can generate analytics based on real-life data transformation settings, i.e. you can control
the formats and data sets for predictive analysis.
8. KNIME
KNIME Developed in January 2004 by a team of software engineers at University of
Konstanz. KNIME is leading open source, reporting, and integrated analytics tools that allow
you to analyze and model the data through visual programming, it integrates various
components for data mining and machine learning via its modular data-pipelining concept.
9. QlikView
QlikView has many unique features like patented technology and has in-memory data
processing, which executes the result very fast to the end users and stores the data in the report
itself. Data association in QlikView is automatically maintained and can be compressed to
almost 10% from its original size. Data relationship is visualized using colours – a specific
colour is given to related data and another colour for non-related data.
10. Splunk:
Splunk is a tool that analyzes and searches the machine-generated data. Splunk pulls all
text-based log data and provides a simple way to search through it, a user can pull in all kind
of data, and perform all sort of interesting statistical analysis on it, and present it in different
formats.
15