Big Data Tools for Data Analysis V1
Big Data Tools for Data Analysis V1
Data Analysis
Why Big Data Analytics?
Stages in Big Data Analytics
Types of Big Data Analytics
Tools Used in Big Data Analytics
Apache Hadoop
• Software framework employed for clustered file system and
handling of big data.
• Processes datasets of big data by means of the MapReduce
programming model.
• Open-source framework
• The core strength of that is• Sometimes
writtendisk in space
Java and it
provides cross-platform
Hadoop is its HDFS support.
(Hadoop Distributed File
issues can be faced due to
its 3x data redundancy.
System) which has the • I/O operations could have
ability to hold all type of been optimized for better
data – video, images, performance.
JSON, XML, and plain text
over the same file system.
• Highly useful for R&D
purposes.
• Provides quick access to
data.
Pros
• Highly scalable
• Highly-available service Cons
resting on a cluster of
computers
CDH (Cloudera Distribution
for Hadoop)
• Aims at enterprise-class deployments of that technology.
• It is totally open source and has a free platform distribution
that encompasses Apache Hadoop, Apache Spark, Apache
Impala, and many more.
• It allows you to collect, process, administer, manage,
discover, model, and distribute unlimited data.
• Comprehensive • Few complicating UI
distribution features like charts on
• Cloudera Manager the CM service.
administers the • Multiple recommended
Hadoop cluster very approaches for
well. installation sounds
• Easy implementation. confusing.
• Less complex
administration.
• High security and
governance
Pros Cons
Cassandra
• Free of cost and open-source distributed NoSQL DBMS
constructed to manage huge volumes of data spread across
numerous commodity servers, delivering high availability.
• It employs CQL (Cassandra Structure Language) to interact
with the database.
• Some of the high-profile companies using Cassandra include
Accenture, American Express, Facebook,
• No single point of
General Electric,
• Requires some extra
Honeywell, Yahoo, etc.
failure. efforts in
• Handles massive data troubleshooting and
very quickly. maintenance.
• Log-structured storage • Clustering could have
• Automated replication b
• Linear scalability • Row-level locking
• Simple Ring feature is not there.
architecture
Pros Cons
Knime
• KNIME stands for Konstanz Information Miner which is an open
source tool that is used for Enterprise reporting, integration,
research, CRM, data mining, data analytics, text mining, and
business intelligence.
• It supports Linux, OS X, and Windows operating systems.
• Some of the top companies using Knime include Comcast,
Johnson & Johnson, Canadian Tire, etc.
• Simple ETL operations • Data handling capacity
• Integrates very well can be improved.
with other technologies • Occupies almost the
and languages. entire RAM.
• Rich algorithm set. • Could have allowed
• Highly usable and integration with graph
organized workflows. databases.
• Automates a lot of
manual work.
• No stability issues.
• Easy to set up.
Pros Cons
Datawrapper
• Open source platform for data visualization that aids its users
to generate simple, precise and embeddable charts very
quickly.
• Its major customers are newsrooms that are spread all over the
world.
• Some •of the friendly.
Device namesWorks include
very The •Times, Fortune,
Limited color palettesMother Jones,
well on all type of devices –
Bloomberg,
mobile,Twitter etc.
tablet or desktop.
• Fully responsive
• Fast
• Interactive
• Brings all the charts in one
place.
• Great customization and
export options.
• Requires zero coding.
Pros Cons
MongoDB
• MongoDB is a NoSQL, document-oriented database written in C,
C++, and JavaScript.
• It is free to use and is an open source tool that supports multiple
operating systems.
• Its main features include Aggregation, Adhoc-queries, Uses
BSON format, Sharding, Indexing, Replication, Server-side
execution of javascript, Schemaless, Capped collection,
MongoDB management
• Easy to learn. service (MMS),
• Limitedload balancing and file
analytics.
storage. • Provides support for • Slow for certain use
multiple technologies cases.
• Some of theand major customers include Facebook, eBay, MetLife,
platforms.
Google, etc. • No hiccups in
installation and
maintenance.
• Reliable and low cost.
Pros Cons
Lumify
• It is a free and open source tool for big data fusion/integration,
analytics, and visualization.
• Its primary features include full-text search, 2D and 3D graph
visualizations, automatic layouts, link analysis between graph
entities, integration with mapping systems, geospatial analysis,
multimedia analysis, real-time collaboration through a set of
projects or workspaces.
• Scalable
• Secure
• Supported by a
dedicated full-time
development team.
• Supports the cloud-
based environment.
Works well with
Amazon’s AWS.
Pros Cons
HPCC
• HPCC stands for High-Performance Computing Cluster.
• It is a complete big data solution over a highly scalable
supercomputing platform.
• HPCC is also referred to as DAS (Data Analytics Supercomputer).
• It is based on a Thor architecture that supports data parallelism,
pipeline parallelism, and system parallelism.
• It is an open-source tool
• It is based on and is a good substitute for Hadoop and
commodity
some other computing
Big dataclusters
platforms.
which
provide high performance.
• Parallel data processing.
• Fast, powerful and highly
scalable.
• Supports high-performance
online query applications.
• Cost-effective and
comprehensive.
Pros Cons
Storm
• Apache Storm is a cross-platform, distributed stream processing,
and fault-tolerant real-time computational framework.
• It is free and open-source.
• Its architecture is based on customized spouts and bolts to
describe sources of information and manipulations in order to
permit batch, distributed processing of unbounded streams of
data.
• Groupon, Yahoo, Alibaba, and The Weather Channel are some of
the famous organizations
• Reliable at scale. that use Apache
• Difficult Storm
to learn and
• Very fast and fault- use.
tolerant. • Difficulties with
• Guarantees the debugging.
processing of data. • Use of Native
• It has multiple use Scheduler and Nimbus
cases – real-time become bottlenecks.
analytics, log
processing, ETL,
continuous
computation,
Pros
distributed RPC,
machine learning.
Cons
Apache SAMOA
• SAMOA stands for Scalable Advanced Massive Online Analysis.
• It is an open-source platform for big data stream mining and
machine learning.
• It allows you to create distributed streaming machine learning
(ML) algorithms and run them on multiple DSPEs (distributed
stream processing engines).
• Apache SAMOA’s closest
• Simple and alternative is BigML tool.
fun to use.
• Fast and scalable.
• True real-time
streaming.
• Write Once Run
Anywhere (WORA)
architecture.
Pros Cons
Talend
• Open studio for Big data
• It comes under free and open source license.
• Its components and connectors are Hadoop and NoSQL.
• It provides community support only.
• Big data platform
• It comes with a user-based subscription license.
• Its components and connectors are MapReduce and Spark.
• Real-time big data platform
• It comes under a user-based subscription license.
• Its components and connectors include Spark streaming, Machine
learning, and IoT.ETL and
• Streamlines • Community support
ELT for Big data. could have been
• Accomplish the speed better.
and scale of spark. • Could have an
• Accelerates your move improved and easy to
to real-time. use interface
• Handles multiple data • Difficult to add a
sources. custom component to
the palette.
Pros Cons
Rapidminer
• Rapidminer is a cross-platform tool which offers an integrated
environment for data science, machine learning and predictive
analytics.
• It comes under various licenses that offer small, medium and large
proprietary editions as well as a free edition that allows for 1 logical
processor and up to 10,000 data rows.
• Hitachi, BMW, Samsung,
• Open-source Airbus, etc
Java core. havedata
• Online been using RapidMiner.
services
• The convenience of should be improved.
front-line data science
tools and algorithms.
• Facility of code-
optional GUI.
• Integrates well with
APIs and cloud.
• Superb customer
service and technical
support.
Pros Cons
Qubole
• Qubole data service is an independent and all-inclusive Big data
platform that manages, learns and optimizes on its own from your
usage.
• This lets the data team concentrate on business outcomes instead of
managing the platform.
• Users of Qubole include Warner music group, Adobe, and Gannett.
• Faster time to value.
• Increased flexibility
and scale.
• Optimized spending
• Enhanced adoption of
Big data analytics.
• Easy to use.
• Eliminates vendor and
technology lock-in.
Pros Cons
Tableau
• Tableau is a software solution for business intelligence and analytics
which present a variety of integrated products. That are:
• Tableau Desktop (for the analyst)
• Tableau Server (for the enterprise) and
• Tableau Online (to the cloud).
• Tableau is capable of handling all data sizes and is easy to get to for
technical and non-technical customer base and it gives you real-time
customized dashboards.
• It is a great
• Greattool for to
flexibility data visualization
create the and exploration.
• Formatting controls could
type of visualizations be improved.
• Offers a bouquet of smart • Could have a built-in tool
features and is razor sharp for deployment and
in terms of its speed.
migration amongst the
• Out of the box support for
connection with most of the
various tableau servers
databases. and environments.
• No-code data queries.
• Mobile-ready, interactive and
shareable dashboards.
Pros Cons
R
• R is one of the most comprehensive statistical analysis packages.
• It is open-source, free, multi-paradigm and dynamic software
environment.
• It is broadly used by statisticians and data miners.
• Its use cases include data analysis, data manipulation, calculation, and
graphical display.
• vastness of the package • Its shortcomings include
ecosystem. • memory management
• Unmatched Graphics and • Speed and
charting benefits. • security.
Pros Cons
Python
Pros Cons
Big Data Analytics
Revolutionizing Different Domains
Use Cases
Any Questions???