0% found this document useful (0 votes)
48 views

Big Data Tools for Data Analysis V1

Various tools used for data analytics

Uploaded by

M Hemalatha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Big Data Tools for Data Analysis V1

Various tools used for data analytics

Uploaded by

M Hemalatha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Big Data Tools For

Data Analysis
Why Big Data Analytics?
Stages in Big Data Analytics
Types of Big Data Analytics
Tools Used in Big Data Analytics
Apache Hadoop
• Software framework employed for clustered file system and
handling of big data.
• Processes datasets of big data by means of the MapReduce
programming model.
• Open-source framework
• The core strength of that is• Sometimes
writtendisk in space
Java and it
provides cross-platform
Hadoop is its HDFS support.
(Hadoop Distributed File
issues can be faced due to
its 3x data redundancy.
System) which has the • I/O operations could have
ability to hold all type of been optimized for better
data – video, images, performance.
JSON, XML, and plain text
over the same file system.
• Highly useful for R&D
purposes.
• Provides quick access to
data.

Pros
• Highly scalable
• Highly-available service Cons
resting on a cluster of
computers
CDH (Cloudera Distribution
for Hadoop)
• Aims at enterprise-class deployments of that technology.
• It is totally open source and has a free platform distribution
that encompasses Apache Hadoop, Apache Spark, Apache
Impala, and many more.
• It allows you to collect, process, administer, manage,
discover, model, and distribute unlimited data.
• Comprehensive • Few complicating UI
distribution features like charts on
• Cloudera Manager the CM service.
administers the • Multiple recommended
Hadoop cluster very approaches for
well. installation sounds
• Easy implementation. confusing.
• Less complex
administration.
• High security and
governance
Pros Cons
Cassandra
• Free of cost and open-source distributed NoSQL DBMS
constructed to manage huge volumes of data spread across
numerous commodity servers, delivering high availability.
• It employs CQL (Cassandra Structure Language) to interact
with the database.
• Some of the high-profile companies using Cassandra include
Accenture, American Express, Facebook,
• No single point of
General Electric,
• Requires some extra
Honeywell, Yahoo, etc.
failure. efforts in
• Handles massive data troubleshooting and
very quickly. maintenance.
• Log-structured storage • Clustering could have
• Automated replication b
• Linear scalability • Row-level locking
• Simple Ring feature is not there.
architecture

Pros Cons
Knime
• KNIME stands for Konstanz Information Miner which is an open
source tool that is used for Enterprise reporting, integration,
research, CRM, data mining, data analytics, text mining, and
business intelligence.
• It supports Linux, OS X, and Windows operating systems.
• Some of the top companies using Knime include Comcast,
Johnson & Johnson, Canadian Tire, etc.
• Simple ETL operations • Data handling capacity
• Integrates very well can be improved.
with other technologies • Occupies almost the
and languages. entire RAM.
• Rich algorithm set. • Could have allowed
• Highly usable and integration with graph
organized workflows. databases.
• Automates a lot of
manual work.
• No stability issues.
• Easy to set up.
Pros Cons
Datawrapper
• Open source platform for data visualization that aids its users
to generate simple, precise and embeddable charts very
quickly.
• Its major customers are newsrooms that are spread all over the
world.
• Some •of the friendly.
Device namesWorks include
very The •Times, Fortune,
Limited color palettesMother Jones,
well on all type of devices –
Bloomberg,
mobile,Twitter etc.
tablet or desktop.
• Fully responsive
• Fast
• Interactive
• Brings all the charts in one
place.
• Great customization and
export options.
• Requires zero coding.

Pros Cons
MongoDB
• MongoDB is a NoSQL, document-oriented database written in C,
C++, and JavaScript.
• It is free to use and is an open source tool that supports multiple
operating systems.
• Its main features include Aggregation, Adhoc-queries, Uses
BSON format, Sharding, Indexing, Replication, Server-side
execution of javascript, Schemaless, Capped collection,
MongoDB management
• Easy to learn. service (MMS),
• Limitedload balancing and file
analytics.
storage. • Provides support for • Slow for certain use
multiple technologies cases.
• Some of theand major customers include Facebook, eBay, MetLife,
platforms.
Google, etc. • No hiccups in
installation and
maintenance.
• Reliable and low cost.

Pros Cons
Lumify
• It is a free and open source tool for big data fusion/integration,
analytics, and visualization.
• Its primary features include full-text search, 2D and 3D graph
visualizations, automatic layouts, link analysis between graph
entities, integration with mapping systems, geospatial analysis,
multimedia analysis, real-time collaboration through a set of
projects or workspaces.
• Scalable
• Secure
• Supported by a
dedicated full-time
development team.
• Supports the cloud-
based environment.
Works well with
Amazon’s AWS.

Pros Cons
HPCC
• HPCC stands for High-Performance Computing Cluster.
• It is a complete big data solution over a highly scalable
supercomputing platform.
• HPCC is also referred to as DAS (Data Analytics Supercomputer).
• It is based on a Thor architecture that supports data parallelism,
pipeline parallelism, and system parallelism.
• It is an open-source tool
• It is based on and is a good substitute for Hadoop and
commodity
some other computing
Big dataclusters
platforms.
which
provide high performance.
• Parallel data processing.
• Fast, powerful and highly
scalable.
• Supports high-performance
online query applications.
• Cost-effective and
comprehensive.

Pros Cons
Storm
• Apache Storm is a cross-platform, distributed stream processing,
and fault-tolerant real-time computational framework.
• It is free and open-source.
• Its architecture is based on customized spouts and bolts to
describe sources of information and manipulations in order to
permit batch, distributed processing of unbounded streams of
data.
• Groupon, Yahoo, Alibaba, and The Weather Channel are some of
the famous organizations
• Reliable at scale. that use Apache
• Difficult Storm
to learn and
• Very fast and fault- use.
tolerant. • Difficulties with
• Guarantees the debugging.
processing of data. • Use of Native
• It has multiple use Scheduler and Nimbus
cases – real-time become bottlenecks.
analytics, log
processing, ETL,
continuous
computation,
Pros
distributed RPC,
machine learning.
Cons
Apache SAMOA
• SAMOA stands for Scalable Advanced Massive Online Analysis.
• It is an open-source platform for big data stream mining and
machine learning.
• It allows you to create distributed streaming machine learning
(ML) algorithms and run them on multiple DSPEs (distributed
stream processing engines).
• Apache SAMOA’s closest
• Simple and alternative is BigML tool.
fun to use.
• Fast and scalable.
• True real-time
streaming.
• Write Once Run
Anywhere (WORA)
architecture.

Pros Cons
Talend
• Open studio for Big data
• It comes under free and open source license.
• Its components and connectors are Hadoop and NoSQL.
• It provides community support only.
• Big data platform
• It comes with a user-based subscription license.
• Its components and connectors are MapReduce and Spark.
• Real-time big data platform
• It comes under a user-based subscription license.
• Its components and connectors include Spark streaming, Machine
learning, and IoT.ETL and
• Streamlines • Community support
ELT for Big data. could have been
• Accomplish the speed better.
and scale of spark. • Could have an
• Accelerates your move improved and easy to
to real-time. use interface
• Handles multiple data • Difficult to add a
sources. custom component to
the palette.

Pros Cons
Rapidminer
• Rapidminer is a cross-platform tool which offers an integrated
environment for data science, machine learning and predictive
analytics.
• It comes under various licenses that offer small, medium and large
proprietary editions as well as a free edition that allows for 1 logical
processor and up to 10,000 data rows.
• Hitachi, BMW, Samsung,
• Open-source Airbus, etc
Java core. havedata
• Online been using RapidMiner.
services
• The convenience of should be improved.
front-line data science
tools and algorithms.
• Facility of code-
optional GUI.
• Integrates well with
APIs and cloud.
• Superb customer
service and technical
support.
Pros Cons
Qubole
• Qubole data service is an independent and all-inclusive Big data
platform that manages, learns and optimizes on its own from your
usage.
• This lets the data team concentrate on business outcomes instead of
managing the platform.
• Users of Qubole include Warner music group, Adobe, and Gannett.
• Faster time to value.
• Increased flexibility
and scale.
• Optimized spending
• Enhanced adoption of
Big data analytics.
• Easy to use.
• Eliminates vendor and
technology lock-in.

Pros Cons
Tableau
• Tableau is a software solution for business intelligence and analytics
which present a variety of integrated products. That are:
• Tableau Desktop (for the analyst)
• Tableau Server (for the enterprise) and
• Tableau Online (to the cloud).
• Tableau is capable of handling all data sizes and is easy to get to for
technical and non-technical customer base and it gives you real-time
customized dashboards.
• It is a great
• Greattool for to
flexibility data visualization
create the and exploration.
• Formatting controls could
type of visualizations be improved.
• Offers a bouquet of smart • Could have a built-in tool
features and is razor sharp for deployment and
in terms of its speed.
migration amongst the
• Out of the box support for
connection with most of the
various tableau servers
databases. and environments.
• No-code data queries.
• Mobile-ready, interactive and
shareable dashboards.

Pros Cons
R
• R is one of the most comprehensive statistical analysis packages.
• It is open-source, free, multi-paradigm and dynamic software
environment.
• It is broadly used by statisticians and data miners.
• Its use cases include data analysis, data manipulation, calculation, and
graphical display.
• vastness of the package • Its shortcomings include
ecosystem. • memory management
• Unmatched Graphics and • Speed and
charting benefits. • security.

Pros Cons
Python

• Python is a high-level programming language and provides a large


standard library.
• It has the features of object-oriented, functional, procedural, dynamic
type, and automatic memory management.
• It is used by data
scientists as it provides a
good number of useful
packages to download for
free.
• Python is extensible.
• It provides free data
analysis libraries.

Pros Cons
Big Data Analytics
Revolutionizing Different Domains
Use Cases
Any Questions???

You might also like