Course Material
Course Material
PERAMBALUR-621212
COURSE MATERIAL
Semester : IV
ELECTIVE COURSE V
Objective:
Introduction to big data: Data, Characteristics of data and Types of digital data:
Unstructured, Semi-structured and Structured, Sources of data, Working with
unstructured data, Evolution and Definition of big data, Characteristics and Need of
big data, Challenges of big data, Data environment versus big data environment
Unit II
Big data analytics: Overview of business intelligence, Data science and Analytics,
Meaning and Characteristics of big data analytics, Need of big data analytics,
Classification of analytics, Challenges to big data analytics, Importance of big data
analytics, Basic terminologies in big data environment
Unit III
Big data technologies and Databases: Introduction to NoSQL, Uses, Features and
Types, Need, Advantages, Disadvantages and Application of NoSQL, Overview of
NewSQL, Comparing SQL, NoSQL and NewSQL, Introduction to MongoDB and its
needs, Characteristics of MongoDB, Introduction of apache cassandra and its needs,
Characteristics of Cassandra
Unit IV
Hadoop foundation for analytics: History, Needs, Features, Key advantage and
Versions of Hadoop, Essential of Hadoop ecosystems, RDBMS versus Hadoop, Key
aspects and Components of Hadoop, Hadoop architectures
Unit V
Seema Acharya and Subhashini Chellappan, “Big Data and Analytics”, Wiley India
Pvt. Ltd., 2016
Reference Books
1. 1.“Big Data” by Judith Hurwitz, Alan Nugent, Dr. Fern Halper and Marcia
Kaufman, Wiley Publications, 2014.
2. 2.“Big Data Imperatives : Enterprise Big Data Warehouse, BI Implementations and
Analytics” by Soumendra Mohanty, Madhu Jagadeesh and Harsha Srivatsa, Apress
Media, Springer Science + Business Media New York, 2013
3. “Mining of Massive Datasets”, Anand Rajaraman, Jure Leskovec, Jeffery D.
Ullman, Springer, July 2013.
4. “Hadoop: The definitive Guide”, Tom White, O'Reilly Media, 2010.
BIGDATA ANALYTICS
UNIT-I
Big data:
What is Big Data?
Big Data is a collection of large datasets that cannot be processed using traditional computing
techniques. For example, the volume of data Facebook or Youtube need require it to collect and manage on
a daily basis, can fall under the category of Big Data. However, Big Data is not only about scale and
volume, it also involves one or more of the following aspects − Velocity, Variety, Volume, and
Complexity.
Variety of Big Data refers to structured, unstructured, and semistructured data that is gathered from multiple
sources. While in the past, data could only be collected from spreadsheets and databases, today data comes
in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so much more.
2) Velocity
Velocity essentially refers to the speed at which data is being created in real-time. In a broader prospect, it
comprises the rate of change, linking of incoming data sets at varying speeds, and activity bursts.
3) Volume
We already know that Big Data indicates huge ‘volumes’ of data that is being generated on a daily basis
from various sources like social media platforms, business processes, machines, networks, human
interactions, etc. Such a large amount of data are stored in data warehouses.
Structured
By structured data, we mean data that can be processed, stored, and retrieved in a fixed format. It
refers to highly organized information that can be readily and seamlessly stored and accessed from a
database by simple search engine algorithms. For instance, the employee table in a company database
will be structured as the employee details, their job positions, their salaries, etc., will be present in an
organized manner.
Unstructured
Unstructured data refers to the data that lacks any specific form or structure whatsoever. This makes
it very difficult and time-consuming to process and analyze unstructured data. Email is an example of
unstructured data.
Semi-structured
Semi-structured data pertains to the data containing both the formats mentioned above, that is,
structured and unstructured data. To be precise, it refers to the data that although has not been classified
under a particular repository (database), yet contains vital information or tags that segregate individual
elements within the data.
The process of extracting and analyzing data amongst extensive big data sources is a complex
process and can be frustrating and time-consuming. These complications can be resolved if organizations
encompass all the necessary considerations of big data, take into account relevant data sources, and deploy
them in a manner which is well tuned to their organizational goals.
Before the modern day ubiquity of online and mobile applications, databases processed
straightforward, structured data. Data models were relatively simple and described a set of relationships
between different data types in the database.
Unstructured data, in contrast, refers to data that doesn’t fit neatly into the traditional row and
column structure of relational databases. Examples of unstructured data include: emails, videos, audio files,
web pages, and social media messages. In today’s world of Big Data, most of the data that is created is
unstructured with some estimates of it being more than 95% of all data generated.
As a result, enterprises are looking to this new generation of databases, known as NoSQL, to address
unstructured data. MongoDB stands as a leader in this movement with over 10 million downloads and
hundreds of thousands of deployments. As a document database with flexible schema, MongoDB was built
specifically to handle unstructured data. MongoDB’s flexible data model allows for development without a
predefined schema which resonates particularly when most of the data in your system is unstructured.
With the 1880 census, just counting people was not enough information for the U.S. government to
work with—particular elements, such as age, sex, occupation, education level, and even the “number of
insane people in household,” had to be accounted for. That information had intrinsic value to the process,
but only if it could be tallied, tabulated, analyzed, and presented. New methods of relating the data to other
data collected came into being, such as associating occupations with geographic areas, birth rates with
education levels, and countries of origin with skill sets.
The 1880 census truly yielded a mountain of data to deal with, yet only severely limited technology
was available to do any of the analytics. The problem of Big Data could not be solved for the 1880 census,
so it took over seven years to manually tabulate and report on the data.
Data growing at such a quick rate is making it a challenge to find insights from it. There is more and more
data generated every second from which the data that is actually relevant and useful has to be picked up for
further analysis.
Storage
Such large amount of data is difficult to store and manage by organizations without appropriate tools and
technologies.
Syncing Across Data Sources
This implies that when organisations import data from different sources the data from one source might not
be up to date as compared to the data from another source.
Security
Huge amount of data in organisations can easily become a target for advanced persistent threats, so here lies
another challenge for organisations to keep their data secure by proper authentication, data encryption, etc.
Unreliable Data
We can’t deny the fact that big data can’t be 100 percent accurate. It might contain redundant or incomplete
data, along with contradictions.
Miscellaneous Challenges
These are some other challenges that come forward while dealing with big data, like the integration of
data, skill and talent availability, solution expenses and processing a large amount of data in time and
with accuracy so that the data is available for data consumers whenever they need it.
BasisOf
Small Data Big Data
Comparison
Data that is ‘small’ enough for human Data sets that are so large or complex that
Definition comprehension.In a volume and format that makes it traditional data processing applications
accessible, informative and actionable cannot deal with them
● Data from traditional enterprise systems like ● Purchase data from point-of-sale
○ Enterprise resource planning ● Clickstream data from websites
Data Source ○ Customer relationship management(CRM) ● GPS stream data – Mobility data sent to
● Financial Data like general ledger data a server
● Payment transaction data from website ● Social media – Facebook, Twitter
Velocity ● Controlled and steady data flow ● Data can arrive at very fast speeds.
● Data accumulation is slow ● Enormous data can accumulate within
very short periods of time
Historical data equally valid as data represent solid In some cases, data gets older soon(Eg fraud
Time Variance
business interactions detection).
Investments made in an EPM (Enterprise Process Management) implementation can be very expensive, so it
is imperative that every asset is leveraged. Youngsoft’s team of experienced professionals provide a wide
range of services across the suite of EPM products, consistently delivering solutions that reduce costs,
increase profits, and improve overall efficiency.
Benefits:
Data Science is the combination of statistics, mathematics, programming, problem-solving, capturing data in
ingenious ways, the ability to look at things differently, and the activity of cleansing, preparing, and aligning
the data.
In simple terms, it is the umbrella of techniques used when trying to extract insights and information from
data.
Applications of Data Science
Internet Search
Search engines make use of data science algorithms to deliver the best results for search queries in a
fraction of seconds.
Digital Advertisements
The entire digital marketing spectrum uses the data science algorithms - from display banners to
digital billboards. This is the mean reason for digital ads getting higher CTR than traditional
advertisements.
Recommender Systems
The recommender systems not only make it easy to find relevant products from billions of products
available but also adds a lot to user-experience. A lot of companies use this system to promote their
products and suggestions in accordance with the user’s demands and relevance of information. The
recommendations are based on the user’s previous search results.
Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in
turn, leads to smarter business moves, more efficient operations, higher profits and happier customers. In his
report Big Data in Big Companies, IIA Director of Research Tom Davenport interviewed more than 50
businesses to understand how they used big data. He found they got value in the following ways:
1. Cost reduction. Big data technologies such as Hadoop and cloud-based analytics bring significant
cost advantages when it comes to storing large amounts of data – plus they can identify more
efficient ways of doing business.
2. Faster, better decision making. With the speed of Hadoop and in-memory analytics, combined
with the ability to analyze new sources of data, businesses are able to analyze information
immediately – and make decisions based on what they’ve learned.
3. New products and services. With the ability to gauge customer needs and satisfaction through
analytics comes the power to give customers what they want. Davenport points out that with big data
analytics, more companies are creating new products to meet customers’ needs.
Types of data analytics
There are 4 different types of analytics.
Descriptive analytics
Descriptive analytics answers the question of what happened. Let us bring an example from ScienceSoft’s
practice: having analyzed monthly revenue and income per product group, and the total quantity of metal
parts produced per month, a manufacturer was able to answer a series of ‘what happened’ questions and
decide on focus product categories.
Descriptive analytics juggles raw data from multiple data sources to give valuable insights into the past.
However, these findings simply signal that something is wrong or right, without explaining why. For this
reason, our data consultants don’t recommend highly data-driven companies to settle for descriptive
analytics only, they’d rather combine it with other types of data analytics.
Diagnostic analytics
At this stage, historical data can be measured against other data to answer the question of why something
happened. For example, you can check ScienceSoft’s BI demo to see how a retailer can drill the sales and
gross profit down to categories to find out why they missed their net profit target. Another flashback to our
data analytics projects: in the healthcare industry, customer segmentation coupled with several filters
applied (like diagnoses and prescribed medications) allowed identifying the influence of medications.
Diagnostic analytics gives in-depth insights into a particular problem. At the same time, a company should
have detailed information at their disposal, otherwise, data collection may turn out to be individual for every
issue and time-consuming.
Predictive analytics
Predictive analytics tells what is likely to happen. It uses the findings of descriptive and diagnostic analytics
to detect clusters and exceptions, and to predict future trends, which makes it a valuable tool for forecasting.
Check ScienceSoft’s case study to get details on how advanced data analytics allowed a leading FMCG
company to predict what they could expect after changing brand positioning.
Predictive analytics belongs to advanced analytics types and brings many advantages like sophisticated
analysis based on machine or deep learning and proactive approach that predictions enable. However, our
data consultants state it clearly: forecasting is just an estimate, the accuracy of which highly depends on data
quality and stability of the situation, so it requires careful treatment and continuous optimization.
Prescriptive analytics
The purpose of prescriptive analytics is to literally prescribe what action to take to eliminate a future
problem or take full advantage of a promising trend. An example of prescriptive analytics from our project
portfolio: a multinational company was able to identify opportunities for repeat purchases based on
customer analytics and sales history.
Prescriptive analytics uses advanced tools and technologies, like machine learning, business rules and
algorithms, which makes it sophisticated to implement and manage. Besides, this state-of-the-art type of
data analytics requires not only historical internal data but also external information due to the nature of
algorithms it’s based on. That is why, before deciding to adopt prescriptive analytics, ScienceSoft strongly
recommends weighing the required efforts against an expected added value.
As data sets are becoming bigger and more diverse, there is a big challenge to incorporate them into an
analytical platform. If this is overlooked, it will create gaps and lead to wrong messages and insights.
The analysis of data is important to make this voluminous amount of data being produced in every minute,
useful. With the exponential rise of data, a huge demand for big data scientists and Big Data analysts has
been created in the market. It is important for business organizations to hire a data scientist having skills that
are varied as the job of a data scientist is multidisciplinary. Another major challenge faced by businesses is
the shortage of professionals who understand Big Data analysis. There is a sharp shortage of data scientists
in comparison to the massive amount of data being produced.
It is imperative for business organizations to gain important insights from Big Data analytics, and also it is
important that only the relevant department has access to this information. A big challenge faced by the
companies in the Big Data analytics is mending this wide gap in an effective manner.
It is hardly surprising that data is growing with every passing day. This simply indicates that business
organizations need to handle a large amount of data on daily basis. The amount and variety of data available
these days can overwhelm any data engineer and that is why it is considered vital to make data accessibility
easy and convenient for brand owners and managers.
With the rise of Big Data, new technologies and companies are being developed every day. However, a big
challenge faced by the companies in the Big Data analytics is to find out which technology will be best
suited to them without the introduction of new problems and potential risks.
Business organizations are growing at a rapid pace. With the tremendous growth of the companies and large
business organizations, increases the amount of data produced. The storage of this massive amount of data is
becoming a real challenge for everyone. Popular data storage options like data lakes/ warehouses are
commonly used to gather and store large quantities of unstructured and structured data in its native format.
The real problem arises when a data lakes/ warehouse try to combine unstructured and inconsistent data
from diverse sources, it encounters errors. Missing data, inconsistent data, logic conflicts, and duplicates
data all result in data quality challenges.
Once business enterprises discover how to use Big Data, it brings them a wide range of possibilities and
opportunities. However, it also involves the potential risks associated with big data when it comes to the
privacy and the security of the data. The Big Data tools used for analysis and storage utilizes the data
disparate sources. This eventually leads to a high risk of exposure of the data, making it vulnerable. Thus,
the rise of voluminous amount of data increases privacy and security concerns.
The Importance of Big Data Analytics
Driven by specialized analytics systems and software, as well as high-powered computing systems, big data
analytics offers various business benefits, including:
Big data analytics applications enable big data analysts, data scientists, predictive modelers, statisticians and
other analytics professionals to analyze growing volumes of structured transaction data, plus other forms of
data that are often left untapped by conventional BI and analytics programs. This encompasses a mix of
semi-structured and unstructured data -- for example, internet clickstream data, web server logs, social
media content, text from customer emails and survey responses, mobile phone records, and machine data
captured by sensors connected to the internet of things (IoT).
we will discuss the terminology related to Big Data ecosystem. This will give you a complete understanding
of Big Data and its terms.
Over time, Hadoop has become the nucleus of the Big Data ecosystem, where many new technologies have
emerged and have got integrated with Hadoop. So it’s important that, first, we understand and appreciate the
nucleus of modern Big Data architecture.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data
sets across clusters of computers, using simple programming models. It is designed to scale up from single
servers to thousands of machines, each offering local computation and storage.
Components of the Hadoop Ecosystem
Let’s begin by looking at some of the components of the Hadoop ecosystem:
M apReduce:
MapReduce is a programming model specifically implemented for processing
large data sets on Hadoop cluster. This is the core component of the Hadoop
framework, and it is the only execution engine available for Hadoop 1.0.
1. A function called ‘Map’, which allows different points in the distributed cluster to distribute their work.
2. A function called ‘Reduce’, which is designed to reduce the final form of the clusters’ results into one
output.
The main advantage of the MapReduce framework is its fault tolerance, where periodic reports from each
node in the cluster are expected as soon as the work is completed.
The MapReduce framework is inspired by the ‘Map’ and ‘Reduce’ functions used in functional
programming. The computational processing occurs on data stored in a file system or within a database,
which takes a set of input key values and produces a set of output key values.
Each day, numerous MapReduce programs and MapReduce jobs are executed on Google’s clusters.
Programs are automatically parallelized and executed on a large cluster of commodity machines.
Map Reduce is used in distributed grep, distributed sort, Web link-graph reversal, Web access log stats,
document clustering, Machine Learning and statistical machine translation.
Pig:
Pig is a data flow language that allows users to write complex MapReduce operations in simple scripting
language. Then Pig then transforms those scripts into a MapReduce job.
Hive:
Apache Hive data warehouse software facilitates querying and managing large datasets residing in
distributed storage. Hive provides a mechanism for querying the data using a SQL-like language called
HiveQL. At the same time, this language also allows traditional map/reduce programmers to plug in their
custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Sqoop:
Enterprises that use Hadoop often find it necessary to transfer some of their data from traditional relational
database management systems (RDBMSs) to the Hadoop ecosystem.
Sqoop, an integral part of Hadoop, can perform this transfer in an automated fashion. Moreover, the data
imported into Hadoop can be transformed with MapReduce before exporting them back to the RDBMS.
Sqoop can also generate Java classes for programmatically interacting with imported data.
Sqoop uses a connector-based architecture that allows it to use plugins to connect with external databases.
Flume:
Flume is a service for streaming logs into Hadoop. Apache Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop
Distributed File System (HDFS).
Storm:
Storm is a distributed, real-time computation system for processing large volumes of high-velocity data.
Storm is extremely fast and can process over a million records per second per node on a cluster of modest
size. Enterprises harness this speed and combine it with other data-access applications in Hadoop to prevent
undesirable events or to optimize positive outcomes.
Kafka:
Apache Kafka supports a wide range of use cases such as a general-purpose messaging system for scenarios
where high throughput, reliable delivery, and horizontal scalability are important. Apache Storm and
Apache HBase both work very well in combination with Kafka.
Oozie:
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. The Oozie Workflow jobs are
Directed Acyclical Graphs (DAGs) of actions, whereas the Oozie Coordinator jobs are recurrent Oozie
Workflow jobs triggered by time (frequency) and data availability.
Oozie is integrated with the rest of the Hadoop stack and supports several types of Hadoop jobs out of the
box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system-
specific jobs (such as Java programs and shell scripts). Oozie is a scalable, reliable and extensible system.
Spark:
Apache Spark is a fast, in-memory data processing engine for distributed computing clusters like Hadoop. It
runs on top of existing Hadoop clusters and accesses the Hadoop data store (HDFS).
Spark can be integrated with Hadoop’s 2 YARN architecture, but cannot be used with Hadoop 1.0.
Apache Solr:
Apache Solr is a fast, open-source Java search server. Solr enables you to easily create search engines that
search websites, databases, and files for Big Data
Apache Yarn:
Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology.
YARN is one of the key features in the second-generation Hadoop 2 version of the Apache Software
Foundation’s open-source distributed processing framework. Originally described by Apache as a
redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for
big data applications.
Tez:
Tez is an execution engine for Hadoop that allows jobs to meet the demands for fast response times and
extreme throughput at petabyte scale. Tez represents computations as a dataflow graphs and can be used
with Hadoop 2 YARN.
Apache Drill:
Apache Drill is an open-source, low-latency query engine for Hadoop that delivers secure, interactive SQL
analytics at petabyte scale. With the ability to discover schemas on the go, Drill is a pioneer in delivering
self-service data exploration capabilities on data stored in multiple formats in files or NoSQL databases. By
adhering to ANSI SQL standards, Drill does not require a learning curve and integrates seamlessly with
visualization tools.
Apache Phoenix:
Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and co-ordinates the
running of those scans to produce a regular JDBC result set. Apache Phoenix enables OLTP and operational
analytics in Hadoop for low-latency applications by combining the best of both worlds. Apache Phoenix is
fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and Map Reduce.
Cloud Computing:
Cloud Computing is a type of computing that relies on sharing computing resources rather than having local
servers or personal devices to handle applications. Cloud Computing is comparable to grid computing, a
type of computing where the unused processing cycles of all computers in a network are harnessed to solve
problems that are too processor-intensive for any single machine.
In Cloud Computing, the word cloud (also phrased as “the cloud”) is used as a metaphor for the Internet,
hence the phrase cloud computing means “a type of Internet-based computing” in which different services
such as servers, storage and applications are delivered to an organization’s computers and devices via the
Internet.
NoSQL:
The NoSQL database, also called Not Only SQL, is an approach to data management and database design
that’s useful for very large sets of distributed data. This database system is non-relational, distributed, open-
source and horizontally scalable. NoSQL seeks to solve the scalability and big-data performance issues that
relational databases weren’t designed to address.
Apache Cassandra:
Apache Cassandra is an open-source distributed database system designed for storing and managing large
amounts of data across commodity servers. Cassandra can serve as both a real-time operational data store
for online transactional applications and a read-intensive database for large-scale business intelligence (BI)
systems.
SimpleDB:
Amazon Simple Database Service (SimpleDB), also known as a key value data store, is a highly available
and flexible non-relational database that allows developers to request and store data, with minimal database
management and administrative responsibility.
This service offers simplified access to a data store and query functions that let users instantly add data and
effortlessly recover or edit that data.
SimpleDB is best used by customers who have a relatively simple data requirement, like data storage. For
example, a business might use cookies to track visitors that visit its company website. Some applications
might read the cookies to get the visitor’s identifier and look up the feeds they’re interested in. Amazon
SimpleDB gives users the option to store tens of attributes for a million customers, but not thousands of
attributes for a single customer.
Google BigTable:
Google’s BigTable is a distributed, column-oriented data store created by Google Inc. to handle very large
amounts of structured data associated with the company’s Internet search and Web services operations.
BigTable was designed to support applications requiring massive scalability; from its first iteration, the
technology was intended to be used with petabytes of data. The database was designed to be deployed on
clustered systems and uses a simple data model that Google has described as “a sparse, distributed,
persistent multidimensional sorted map.” Data is assembled in order by row key, and indexing of the map is
arranged according to row, column keys, and timestamps. Here, compression algorithms help achieve high
capacity.
MongoDB:
MongoDB is developed by MongoDB Inc. and is published as free and open-source software under a
combination of the GNU Affero General Public License and the Apache License. As of July 2015,
MongoDB is the fourth most popular type of database management system, and the most popular for
document stores.
HBase:
Apache HBase (Hadoop DataBase) is an open-source NoSQL database that runs on the top of the database
and provides real-time read/write access to those large data sets.
HBase scales linearly to handle huge data sets with billions of rows and millions of columns, and it easily
combines data sources that use a wide variety of different structures and schema. HBase is natively
integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
Neo4j:
Neo4j is a graph database management system developed by Neo Technology, Inc. Neo4j is described by its
developers as an ACID-compliant transactional database with native graph storage and processing.
According to db-engines.com, Neo4j is the most popular graph database.
Couch DB:
CouchDB is a database that completely embraces the web. It stores your data with JSON documents. It
accesses your documents and queries your indexes with your web browser, via HTTP. It indexes, combines,
and transforms your documents with JavaScript.
CouchDB works well with modern web and mobile apps. You can even serve web apps directly out of
CouchDB. You can distribute your data, or your apps, efficiently using CouchDB’s incremental replication.
CouchDB supports master-master setups with automatic conflict detection.
UNIT-III
Bigdata Terminologies and Database:
Introduction to NOSQL:
What is NoSQL Databases?
NoSQL is commonly referred to as not only SQL, non SQL, or non relational databases. These non SQL
databases are built for extreme high throughput using key value pairs versus relational databases with
relative dependence. Using loose dependences and quick indexes NoSQL databases are perfect for
Streaming Analytics and IoT applications because data can quickly be stored and referenced.
SQL NOSQL
Storing large volumes of data without structure. A NoSQL database doesn't limit storable data types. ...
Using cloud computing and storage. Cloud-based storage is a great solution, but it requires data to be
easily spread across multiple servers for scaling. ...
Rapid development.
Features of NoSQL
Non-relational
Schema-free
NoSQL is Schema-Free
Simple API
Offers easy to use interfaces for storage and querying data provided
APIs allow low-level data manipulation & selection methods
Text-based protocols mostly used with HTTP REST with JSON
Mostly used no standard based query language
Web-enabled databases running as internet-facing services
Distributed
There are mainly four categories of NoSQL databases. Each of these categories has its unique attributes and
limitations. No specific database is better to solve all problems. You should select a database based on your
product needs.
Let see all of them:
Data is stored in key/value pairs. It is designed in such a way to handle lots of data and heavy load.
Key-value pair storage databases store data as a hash table where each key is unique, and the value can be a
JSON, BLOB(Binary Large Objects), string, etc.
For example, a key-value pair may contain a key like "Website" associated with a value like "Guru99".
It is one of the most basic types of NoSQL databases. This kind of NoSQL database is used as a collection,
dictionaries, associative arrays, etc. Key value stores help the developer to store schema-less data. They
work best for shopping cart contents.
Redis, Dynamo, Riak are some examples of key-value store DataBases. They are all based on Amazon's
Dynamo paper.
Column-based
Column-oriented databases work on columns and are based on BigTable paper by Google. Every column is
treated separately. Values of single column databases are stored contiguously.
Column based NoSQL database
They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as the data is
readily available in a column.
Column-based NoSQL databases are widely used to manage data warehouses, business intelligence, CRM,
Library card catalogs,
Document-Oriented:
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value part is stored as a
document. The document is stored in JSON or XML formats. The value is understood by the DB and can be
queried.
In this diagram on your left you can see we have rows and columns, and in the right, we have a document
database which has a similar structure to JSON. Now for the relational database, you have to know what
columns you have and so on. However, for a document database, you have data store like JSON object. You
do not require to define which make it flexible.
The document type is mostly used for CMS systems, blogging platforms, real-time analytics & e-commerce
applications. It should not use for complex transactions which require multiple operations or queries against
varying aggregate structures.
Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular Document originated
DBMS systems.
Graph-Based
A graph type database stores entities as well the relations amongst those entities. The entity is stored as a
node with the relationship as edges. An edge gives a relationship between nodes. Every node and edge has a
unique identifier.
Compared to a relational database where tables are loosely connected, a Graph database is a multi-relational
in nature. Traversing relationship is fast as they are already captured into the DB, and there is no need to
calculate them.
Graph base database mostly used for social networks, logistics, spatial data.
Advantages of NoSQL
Can be used as Primary or Analytic Data Source
Big Data Capability
No Single Point of Failure
Easy Replication
No Need for Separate Caching Layer
It provides fast performance and horizontal scalability.
Can handle structured, semi-structured, and unstructured data with equal effect
Object-oriented programming which is easy to use and flexible
NoSQL databases don't need a dedicated high-performance server
Support Key Developer Languages and Platforms
Simple to implement than using RDBMS
It can serve as the primary data source for online applications.
Handles big data which manages data velocity, variety, volume, and complexity
Excels at distributed database and multi-data center operations
Eliminates the need for a specific caching layer to store data
Offers a flexible schema design which can easily be altered without downtime or service disruption
Disadvantages of NoSQL
No standardization rules
Limited query capabilities
RDBMS databases and tools are comparatively mature
It does not offer any traditional database capabilities, like consistency when multiple transactions are
performed simultaneously.
When the volume of data increases it is difficult to maintain unique values as keys become difficult
Doesn't work as well with relational data
The learning curve is stiff for new developers
Open source options so not so popular for enterprises.
NoSQL Applications
Facebook messaging platform
Apache Cassandra was created by Facebook to power their Inbox. It did this for a number of years.
Cassandra worked by doing the following:
Cassandra indexed users’ messages and the terms (words, and so on) in the messages and drove a
search over all the content in those messages. The user ID was the primary key. Each term became a
super column, and the message IDs were the column names.
Amazon DynamoDB
Amazon originally published the Dynamo paper, thereby launching the concept of NoSQL key-value stores.
Since then, Amazon has created a separate database called DynamoDB as a service offered on the Amazon
Web Services marketplace site.
Google Mail
Google’s Bigtable was created to provide wide-column storage for a range of Google’s applications,
including Orkut, Google Earth, web indexing, Google Maps, Google Books, YouTube, blogger.com,
Google Code and Google Mail.
LinkedIn
LinkedIn has used Hadoop to churn information about relationships overnight and to push the latest graph
information to the Voldemort key-value NoSQL store for query the next day. In this way, LinkedIn
maintained a rolling view of all data in the service.
HealthCare.gov
Healthcare.gov has been called the most complex IT system implementation of all time. Building it
required several systems, with the most visible one being the HealthCare.gov marketplace.
Citizen engagement
Governments use NoSQL databases to empower citizens with information about how their country is
governed. A good example is Fairfax County in Virginia, which uses MarkLogic Server to provide
geospatial information through an online browse and search interface to government agencies and residents.
The service covers a range of information — for example, geographic points in the county and police-
related events.
Overview of NewSQL
While various variants of the NoSQL database continue to be used, there is another paradigm arising in
parallel to NoSQL — NewSQL. NewSQL promises to combine benefits from RDBMS (strong consistency)
with benefits from NoSQL (scalability); it mainly achieves this through new architecture patterns and
Most current NewSQL databases are based on Google’s Spanner database and the theories in academic
papers such as Calvin: Fast Distributed Transactions for Partitioned Database Systems from Yale. Spanner is
Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database. It was the first
system to distribute data at global scale and support externally-consistent distributed transactions.
The Calvin academic paper from Yale was on leveraging determinism to guarantee active replication and full
Vertical + Horizontal
Scaling Vertical scaling Vertical scaling
scaling
Distributed
No Yes Yes
Databases
What Is MongoDB?
Instead of storing data in tables of rows or columns like SQL databases, each row in a MongoDB database is
a document described in JSON, a formatting language.
{
"name" : "Carlos Smith",
"title" : "Product Manager",
"location" : "New York, NY",
"twitter" : "@MongoDB",
"facebook" : "@MongoDB"
}
Document databases are extremely flexible, allowing variations in the structure of documents and allowing
storage of documents that are partially complete. One document can have others embedded in it.
Fields in a document play the role of columns in a SQL database, and like columns, they can be indexed to
increase search performance.
Best of all for many developers, the programmer can change the structure of the database easily as needs
change. Some say this turns data into code.
From its founding, MongoDB was built on a scale-out architecture, a structure that allows many small
machines to work together to create systems that are fast and handle huge amounts of data.
MongoDB has always focused on providing developers an excellent user experience, which, in addition to
all its other properties, has made MongoDB a favorite of developers worldwide for a huge variety of
applications.
MongoDB is a document database built on a scale-out architecture that has become popular with developers
of all kinds who are building scalable applications using agile methodologies.
MongoDB was built for people who are building internet and business applications who need to evolve
quickly and scale elegantly. If you are doing that, you should consider MongoDB.
The document data model is a powerful way to store and retrieve data that allows developers to
move fast.
MongoDB’s horizontal, scale-out architecture can support huge volumes of both data and traffic.
MongoDB has a great user experience for developers who can install MongoDB and start writing
code immediately.
MongoDB has developed a large and mature platform ecosystem, which means:
Why use MongoDB? Simply to go further and faster when developing software applications that have to
handle data of all sorts in a scalable way.
Thousands of companies like Bosch, Barclays, and Morgan Stanley run their businesses on MongoDB, and
use it to handle their most demanding apps in areas like IoT, Gaming, Logistics, Banking, e-Commerce, and
Content Management.
Represent data with natural clusters and variability over time or in its structure
Store, manage, and search data with text, geospatial, or time series dimensions.
MongoDB as a company has grown because the number of use cases with these characteristics keep
growing.
Characteristics of MangoDB
High Performance
Support for embedded data models reduces I/O activity on database system.
Indexes support faster queries and can include keys from embedded documents and arrays.
MongoDB supports a rich query language to support read and write operations (CRUD) as well as:
Data Aggregation
Text Search and Geospatial Queries.
SEE ALSO
High Availability
automatic failover
data redundancy.
A replica set is a group of MongoDB servers that maintain the same data set, providing redundancy and
increasing data availability.
Horizontal Scalability
In addition, MongoDB provides pluggable storage engine API that allows third parties to develop storage
engines for MongoDB.
Features of Cassandra
Cassandra has become so popular because of its outstanding technical features. Given below are some of
the features of Cassandra:
Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to accommodate
more customers and more data as per requirement.
Always on architecture − Cassandra has no single point of failure and it is continuously available
for business-critical applications that cannot afford a failure.
Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases your throughput as
you increase the number of nodes in the cluster. Therefore it maintains a quick response time.
Flexible data storage − Cassandra accommodates all possible data formats including: structured,
semi-structured, and unstructured. It can dynamically accommodate changes to your data structures
according to your need.
Easy data distribution − Cassandra provides the flexibility to distribute data where you need by
replicating data across multiple data centers.
Transaction support − Cassandra supports properties like Atomicity, Consistency, Isolation, and
Durability (ACID).
Fast writes − Cassandra was designed to run on cheap commodity hardware. It performs blazingly
fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency.
History of Cassandra
HADOOP
Hadoop is an Apache open source framework written in java that allows distributed processing of large
datasets across clusters of computers using simple programming models. The Hadoop framework
application works in an environment that provides distributed storage and computation across clusters of
computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local
computation and storage.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
Hadoop is an open source framework overseen by Apache Software Foundation which is written in Java for
storing and processing of huge datasets with the cluster of commodity hardware. There are mainly two
problems with the big data. First one is to store such a huge amount of data and the second one is to process
that stored data. The traditional approach like RDBMS is not sufficient due to the heterogeneity of the data.
So Hadoop comes as the solution to the problem of big data i.e. storing and processing the big data with
some extra capabilities. There are mainly two components of Hadoop which are Hadoop Distributed File
System (HDFS) and Yet Another Resource Negotiator(YARN).
Hadoop History
Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002 when they both started to
work on Apache Nutch project. Apache Nutch project was the process of building a search engine system
that can index 1 billion pages. After a lot of research on Nutch, they concluded that such a system will cost
around half a million dollars in hardware, and along with a monthly running cost of $30, 000 approximately,
which is very expensive. So, they realized that their project architecture will not be capable enough to the
workaround with billions of pages on the web. So they were looking for a feasible solution which can
reduce the implementation cost as well as the problem of storing and processing of large datasets.
In 2003, they came across a paper that described the architecture of Google’s distributed file system,
called GFS (Google File System) which was published by Google, for storing the large data sets. Now they
realize that this paper can solve their problem of storing very large files which were being generated
because of web crawling and indexing processes. But this paper was just the half solution to their problem.
In 2004, Google published one more paper on the technique MapReduce, which was the solution of
processing those large datasets. Now this paper was another half solution for Doug Cutting and Mike
Cafarella for their Nutch project. These both techniques (GFS & MapReduce) were just on white paper at
Google. Google didn’t implement these two techniques. Doug Cutting knew from his work on Apache
Lucene ( It is a free and open-source information retrieval software library, originally written in Java by
Doug Cutting in 1999) that open-source is a great way to spread the technology to more people. So, together
with Mike Cafarella, he started implementing Google’s techniques (GFS & MapReduce) as open-source in
the Apache Nutch project.
In 2005, Cutting found that Nutch is limited to only 20-to-40 node clusters. He soon realized two problems:
(a) Nutch wouldn’t achieve its potential until it ran reliably on the larger clusters
(b) And that was looking impossible with just two people (Doug Cutting & Mike Cafarella).
The engineering task in Nutch project was much bigger than he realized. So he started to find a job with a
company who is interested in investing in their efforts. And he found Yahoo!.Yahoo had a large team of
engineers that was eager to work on this there project.
So in 2006, Doug Cutting joined Yahoo along with Nutch project. He wanted to provide the world with an
open-source, reliable, scalable computing framework, with the help of Yahoo. So at Yahoo first, he
separates the distributed computing parts from Nutch and formed a new project Hadoop (He gave name
Hadoop it was the name of a yellow toy elephant which was owned by the Doug Cutting’s son. and it
was easy to pronounce and was the unique word.) Now he wanted to make Hadoop in such a way that it
can work well on thousands of nodes. So with GFS and MapReduce, he started to work on Hadoop.
In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using it.
In January of 2008, Yahoo released Hadoop as an open source project to ASF(Apache Software
Foundation). And in July of 2008, Apache Software Foundation successfully tested a 4000 node cluster
with Hadoop.
In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less than 17 hours for handling
billions of searches and indexing millions of web pages. And Doug Cutting left the Yahoo and joined
Cloudera to fulfill the challenge of spreading Hadoop to other industries.
In December of 2011, Apache Software Foundation released Apache Hadoop version 1.0.
And later in Aug 2013, Version 2.0.6 was available.
And currently, we have Apache Hadoop version 3.0 which released in December 2017.
Advantages of Hadoop:
1. Scalable
Hadoop is a highly scalable storage platform, because it can stores and distribute very large data sets across
hundreds of inexpensive servers that operate in parallel. Unlike traditional relational database systems
(RDBMS) that can’t scale to process large amounts of data, Hadoop enables businesses to run applications
on thousands of nodes involving many thousands of terabytes of data.
2. Cost effective
Hadoop also offers a cost effective storage solution for businesses’ exploding data sets. The problem with
traditional relational database management systems is that it is extremely cost prohibitive to scale to such a
degree in order to process such massive volumes of data. In an effort to reduce costs, many companies in the
past would have had to down-sample data and classify it based on certain assumptions as to which data was
the most valuable. The raw data would be deleted, as it would be too cost-prohibitive to keep. While this
approach may have worked in the short term, this meant that when business priorities changed, the complete
raw data set was not available, as it was too expensive to store.
3. Flexible
Hadoop enables businesses to easily access new data sources and tap into different types of data (both
structured and unstructured) to generate value from that data. This means businesses can use Hadoop to
derive valuable business insights from data sources such as social media, email conversations. Hadoop can
be used for a wide variety of purposes, such as log processing, recommendation systems, data warehousing,
market campaign analysis and fraud detection.
4. Fast
Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it
is located on a cluster. The tools for data processing are often on the same servers where the data is located,
resulting in much faster data processing. If you’re dealing with large volumes of unstructured data, Hadoop
is able to efficiently process terabytes of data in just minutes, and petabytes in hours.
5. Resilient to failure
A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is
also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy
available for use.
Disadvantages of Hadoop:
As the backbone of so many implementations, Hadoop is almost synomous with big data.
1. Security Concerns
Just managing a complex applications such as Hadoop can be challenging. A simple example can be seen in
the Hadoop security model, which is disabled by default due to sheer complexity. If whoever managing the
platform lacks of know how to enable it, your data could be at huge risk. Hadoop is also missing encryption
at the storage and network levels, which is a major selling point for government agencies and others that
prefer to keep their data under wraps.
2. Vulnerable By Nature
Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The framework is
written almost entirely in Java, one of the most widely used yet controversial programming languages in
existence. Java has been heavily exploited by cybercriminals and as a result, implicated in numerous
security breaches.
5. General Limitations
The article introduces Apache Flume, MillWheel, and Google’s own Cloud Dataflow as possible solutions.
What each of these platforms have in common is the ability to improve the efficiency and reliability of data
collection, aggregation, and integration. The main point the article stresses is that companies could be
missing out on big benefits by using Hadoop alone.
Hadoop Versions:
Hadoop is a Software which on an open-source framework storing data using a distributed network
rather than a centralized one thereby processing the data in a parallel transition. This enables Hadoop to act
as one of the most reliable batch processing engine and layered storage and resource management system.
As the data beings stored and processed increases in its complexity so do Hadoop where the developers
bring out various versions to address the issues (bug fixes) and simplify the complex data processes. The
updates are automatically implemented as Hadoop development follows the trunk (base code) – branch
(fix)model. Hadoop has two versions: a) Hadoop 1.x (Version 1) and b) Hadoop 2 (Version 2)
Below are the two Hadoop Versions:
Hadoop 1.x (Version 1)
Hadoop 2 (Version 2)
1. Hadoop 1.x
Below are the Components of Hadoop 1.x
1. The Hadoop Common Module is a jar file which acts as the base API on top of which all the other
components work.
2. Version one being the first one to come in existence is rock solid and has got no new updates
3. It has a limitation on the scaling nodes with just a maximum of 4000 nodes for each cluster
4. The functionality is limited utilizing the slot concept, i.e., the slots are capable of running a map task or a
reduce task.
5. The next component if the Hadoop Distributed File System commonly known as HDFS, which plays the
role of a distributed storage system that is designed to cater to large data, with a block size of 64 MegaBytes
(64MB) for supporting the architecture. It is further divided into two components:
Name Node which is used to store metadata about the Data node, placed with the Master Node. They
contain details like the details about the slave note, indexing and their respective locations along with
Data Nodes used for storage of data related to the applications in use placed in the Slave Nodes.
6. Hadoop 1 uses Map Reduce (MR) data processing model It is not capable of supporting other non-MR
tools.
Job Tracker is used to assigning or reassigning task-related (in case scenario fails or shutdown) to
MapReduce to an application called task tracker is located in the node clusters. It additionally maintains a
The Task Tracker is responsible for executing the functions which have been allocated by the job tracker
and sensor cross the status report of those task to the job tracker.
7. The network of the cluster is formed by organizing the master node and slave nodes. Which of this cluster
is further divided into tracks which contain a set of commodity computers or nodes.
8. Whenever a large storage operation for big data set is is received by the Hadoop system, the data is
divided into decipherable and organized blocks that are distributed into different nodes.
2. Hadoop Version 2
Version 2 for Hadoop was released to provide improvements over the lags which the users faced with
version 1. Let’s throw some light over the improvements that the new version provides:
HDFS Federation which has improved to provide for horizontal scalability for the name node. Moreover,
the namenode was available for a single point of failure only, it is available on varied points. This is going
to the Hadoop stat has been increased to include the stacks such as Hive, Pig, which make this tap well
YARN stands for Yey Another Resource Network has been improved with the new ability to process data in
the larger term that is petabyte and terabyte to make it available for the HDFS while using the applications
which are not MapReduce based. These include applications like MPI and GIRAPH.
Version – 2.7.x Released on 31st May 2018: The update focused to provide for two major functionalities
that are providing for your application and providing for a global resource manager, thereby improving its
overall utility and versatility, increasing scalability up to 10000 nodes for each cluster.
Version 2.8.x – Released in September 2018: The updated provided improvements include the capacity
scheduler which is designed to provide multi-tenancy support for processing data over Hadoop and it has
been made to be accessible for window uses so that there is an increase in the rate of adoption for the
software across the industry for dealing with problems related to big data.
Version 3
Version 3.1.x – released on 21 October 2019: This update enables Hadoop to be utilized as a platform to
serve a big chunk of Data Analytics Functions and utilities to be performed over event processing alongside
It has now improved feature work on the container concept which enables had to perform generic which
of support) of data nodes for multi-Tenancy, limitation to you only MapReduce processing and the biggest
problem than needed for an alternate data storage which is needed for the real-time processing and graphical
analysis.
The ever-increasing Avalanche of data and Big Data Analytics pertaining to just business standing at an
estimated 169 billion dollars (USD), the predicted growth to 274 billion dollars by 2022, the market seems
to be growing ecstatically.
This all the more calls for a system that is integrable in its functioning for the abandoned Utah which is
growing day by day. Hadoop app great to store, process and access the great solution which works to store
process and access this heterogeneous set of data which can be unstructured/ structure in an organized
manner.
With the feature of constant updates which act as tools to rectify the bugs that developers say while using
Hadoop, and the improved versions increase the scope of application and improve the dimension and
flexibility of using Hadoop, increases the chances of it is the next biggest to for all functions related to big
Hadoop Ecosystem
Overview: Apache Hadoop is an open source framework intended to make interaction with big
data easier, However, for those who are not acquainted with this technology, one question arises that what
is big data ? Big data is a term given to the data sets which can’t be processed in an efficient manner with
the help of traditional methodology such as RDBMS. Hadoop has made its place in the industries and
companies that need to work on large data sets which are sensitive and needs efficient handling. Hadoop is a
framework that enables processing of large data sets which reside in the form of clusters. Being a
framework, Hadoop is made up of several modules that are supported by a large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions. There are four
major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or
solutions are used to supplement or support these major elements. All these tools work collectively to
provide services such as absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
Note: Apart from the above-mentioned components, there are many other components too that are part of
the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of Hadoop that it
revolves around data and hence making its synthesis easier.
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large
data sets of structured or unstructured data across various nodes and thereby maintaining the metadata
in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
Name Node is the prime node which contains metadata (data about data) requiring comparatively
fewer resources than the data nodes that stores the actual data. These data nodes are commodity
hardware in the distributed environment. Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of
the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the
resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop
System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the applications in a system whereas
Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine
and later on acknowledges the resource manager. Application manager works as an interface between
the resource manager and node manager and performs negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over
the processing’s logic and helps to write applications which transform big data sets into a manageable
one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form of group.
Map generates a key-value pair based result which is later on processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped data. In
simple, Reduce() takes the output generated by Map() as input and combines those tuples into
smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based
language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of MapReduce are
taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way
Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of the
Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large data
sets. However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL
datatypes are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC
Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and connection
whereas HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name
suggests helps the system to develop itself based on some patterns, user/environmental interaction or
om the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking algorithms as
per our need with the help of its own libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing, interactive or
iterative real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch
processing, hence both are used in most of the companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of
Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets
effectively.
At times where we need to search or retrieve the occurrences of something small in a huge database,
the request must be processed within a short quick span of time. At such times, HBase comes handy as
it gives us a tolerant way of storing limited data.
Other Components: Apart from all of these, there are some other components too that carry out a huge task
in order to make Hadoop capable of processing large datasets. They are as follows:
Solr, Lucene: These are the two services that perform the task of searching and indexing with the help
of some java libraries, especially Lucene is based on Java which allows spell check mechanism, as
well. However, Lucene is driven by Solr.
Zookeeper: There was a huge issue of management of coordination and synchronization among the
resources or the components of Hadoop which resulted in inconsistency, often. Zookeeper overcame
all the problems by performing synchronization, inter-component based communication, grouping, and
maintenance.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together
as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie
workflow is the jobs that need to be executed in a sequentially ordered manner whereas Oozie
Coordinator jobs are those that are triggered when some data or external stimulus is given to it.
Hadoop vs RDBMS
Hadoop software framework work is very well structured semi-structured and unstructured data. This also
supports a variety of data formats in real-time such as XML, JSON, and text-based flat file
formats. RDBMS works efficiently when there is an entity-relationship flow that is defined perfectly and
therefore, the database schema or structure can grow and unmanaged otherwise. i.e., An RDBMS works
well with structured data. Hadoop will be a good choice in environments when there are needs for big data
processing on which the data being processed does not have dependable relationships.
Hadoop Architecture
Hadoop has a Master-Slave Architecture for data storage and distributed data processing using MapReduce
and HDFS methods.
NameNode:
NameNode represented every files and directory which is used in the namespace
DataNode:
DataNode helps you to manage the state of an HDFS node and allows you to interacts with the blocks
MasterNode:
The master node allows you to conduct parallel processing of data using Hadoop MapReduce.
Slave node:
The slave nodes are the additional machines in the Hadoop cluster which allows you to store data to conduct
complex calculations. Moreover, all the slave node comes with Task Tracker and a DataNode. This allows
you to synchronize the processes with the NameNode and Job Tracker respectively.
As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for
analysis of Big Data. Since it is processing logic (not the actual data) that flows to the computing nodes, less
network bandwidth is consumed. This concept is called as data locality concept which helps increase the
efficiency of Hadoop based applications.
• Scalability
HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes and thus allows for
the growth of Big Data. Also, scaling does not require modifications to application logic.
• Fault Tolerance
HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That way, in the
event of a cluster node failure, data processing can still proceed by using data stored on another cluster
node.
UNIT –V
Introduction to MapReduce:
What
is
MapReduce?
MapReduce is a programming model for writing applications that can process Big Data in parallel on
multiple nodes. MapReduce provides analytical capabilities for analyzing huge volumes of complex data.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task into
small parts and assigns them to many computers. Later, the results are collected at one place and integrated
to form the result dataset.
An Introduction to YARN
Get a top-down introduction to YARN. YARN allows integration of frameworks, such as Spark and
HAMA, into Hadoop to expand the popular Data tool beyond MapReduce.
YARN/Hadoop 2.x has a completely different architecture with compared to Hadoop 1.x.
1. Resource management
2. Job scheduling/Job monitoring
Recall that in Hadoop 1.x, there was a single JobTracker per Hadoop cluster serving these functions in
which scaling can overwhelm the JobTracker. Also, having a single JobTracker makes it a single point of
failure; if the JobTracker goes down, the entire cluster goes down with all the current jobs.
Before YARN, Hadoop was designed to support MapReduce jobs only. As time went by, people
encountered Big Data computation problems that cannot be addressed by MapReduce and thus came up
with different frameworks which work on top of HDFS to address their problems. Some of these were:
Apache Spark
Apache HAMA
Apache Giraph.
YARN provides a way for these new frameworks to be integrated into Hadoop framework, sharing the same
underlying HDFS. YARN enables Hadoop to handle jobs beyond MapReduce.
YARN components
YARN divides the responsibilities of JobTracker into separate components, each having a specified task
to perform. In Hadoop-1, the JobTracker takes care of resource management, job scheduling, and job
monitoring. YARN divides these responsibilities of JobTracker into ResourceManager and
ApplicationMaster. Instead of TaskTracker, it uses NodeManager as the worker daemon for execution of
map-reduce tasks. The ResourceManager and the NodeManager form the computation framework for
YARN, and ApplicationMaster is an application-specific framework for application management.
ResourceManager
A ResourceManager is a per cluster service that manages the scheduling of compute resources to
applications. It optimizes cluster utilization in terms of memory, CPU cores, fairness, and SLAs. To allow
different policy constraints, it has algorithms in terms of pluggable schedulers such as capacity and fair
that allows resource allocation in a particular way.
ApplicationsManager (AsM): This is a service used to manage application masters across the cluster that
is responsible for accepting the application submission, providing the resources for application master to
start, monitoring the application progress, and restart, in case of application failure.
NodeManager
The NodeManager is a per node worker service that is responsible for the execution of containers based
on the node capacity. Node capacity is calculated based on the installed memory and the number of CPU
cores. The NodeManager service sends a heartbeat signal to the ResourceManager to update its health
status. The NodeManager service is similar to the TaskTracker service in MapReduce v1. NodeManager
also sends the status to ResourceManager, which could be the status of the node on which it is running or
the status of tasks executing on it.
ApplicationMaster
Container
A container is a logical bundle of resources in terms of memory, CPU, disk, and so on that is bound to a
particular node. In the first version of YARN, a container is equivalent to a block of memory. The
ResourceManager scheduler service dynamically allocates resources as containers. A container grants
rights to an ApplicationMaster to use a specific amount of resources of a specific host. An
ApplicationMaster is considered as the first container of an application and it manages the execution of
the application logic on allocated containers.
Advantages of YARN
The architecture of YARN ensures that the Hadoop cluster can be enhanced in the following ways:
Multi-tenancy
YARN lets you access various proprietary and open-source engines for deploying Hadoop as a standard for
real-time, interactive, and batch processing tasks that are able to access the same dataset and parse it.
Cluster Utilization
YARN lets you use the Hadoop cluster in a dynamic way, rather than in a static manner by which
MapReduce applications were using it, and this is a better and optimized way of utilizing the cluster.
Scalability
YARN gives the power of scalability to the Hadoop cluster. YARN ResourceManager (RM) service is the
central controlling authority for resource management and it makes allocation decisions.
Compatibility
YARN tool is highly compatible with the existing Hadoop MapReduce applications, and thus those projects
that are working with MapReduce in Hadoop 1.0 can easily move on to Hadoop 2.0 with YARN without
any difficulty, ensuring complete compatibility.
MapReduce Applications
Here are a few examples of big data problems that can be solved with the MapReduce framework:
1. Given a repository of text files, find the frequency of each word. This is called
the WordCount problem.
2. Given a repository of text files, find the number of words of each word length.
5. Given a symmetric graph whose nodes represent people and edges represent friendship, compile a
list of common friends.
6. Given a symmetric graph whose nodes represent people and edges represent friendship, compute
the average number of friends by age.
7. Given a repository of weather records, find the annual global minima and maxima by year.
8. Sort a large list. Note that in most implementations of the MapReduce framework, this problem is
trivial, because the framework automatically sorts the output from the map() function.
9. Reverse a graph.
Data serialization
Data serialization is the process of converting data objects present in complex data structures into a
byte stream for storage, transfer and distribution purposes on physical devices
XML (Extensible Markup Language) - Nested textual format. Human-readable and editable. Schema
based validation. Used in metadata applications, web services data transfer, web publishing.
CSV (Comma-Separated Values) - Table structure with delimiters. Human-readable textual data. Opens as
spreadsheet or plaintext. Used as plaintext Database.
JSON (JavaScript Object Notation) - Short syntax textual format with limited data types. Human-
readable. Derived from JavaScript data formats. No need of a separate parser (like XML) since they map to
JavaScript objects. Can be fetched with an XMLHttpRequest call. No direct support for DATE data type.
All data is dynamically processed. Popular format for web API parameter passing. Mobile apps use this
extensively for user interaction and database services.
YAML (YAML Ain't Markup Language) - Lightweight text format. Human-readable. Supports
comments and thus easily editable. Superset of JSON. Supports complex data types. Maps easily to native
data structures. Used in configuration settings, document headers, Apps with need for MySQL style self-
references in relational data.
BSON (Binary JSON) - Created and internally used by MongoDB. Binary format, not human-readable.
Deals with attribute-value pairs like JSON. Includes datetime, bytearray and other data types not present in
JSON. Used in web apps with rich media data types such as live video. Primary use is storage, not network
communication.
MessagePack - Designed for data to be transparently converted from/to JSON. Compressed binary format,
not human-readable. Supports static typing. Supports RPC. Better JSON compatibility than BSON. Primary
use is network communication, not storage. Used in apps with distributed file systems.
protobuf (Protocol Buffers) - Created by Google. Binary message format that allows programmers to
specify a schema for the data. Also includes a set of rules and tools to define and exchange these messages.
Transparent data compression. Used in multi-platform applications due to easy interoperability between
languages. Universal RPC framework. Used in performance-critical distributed applications.
What are Data Serialization Storage format?
Storage formats are a way to define how information is stored in the file. Most of the time, this information
can be assumed from the extension of the dat`a. Both structured and unstructured data can be stored on
HADOOP enabled systems. Common Hdfs file formats are –
o Data serialization is a process to format structured data in such a way that it can be reconverted back to the
original form.
o Serialization is done to translate data structures into a stream of data. This stream of data can be transmitted
over the network or stored in DB regardless of the system architecture.
o Isn’t storing information in binary form or stream of bytes is a right approach.
o Serialization does the same but isn’t dependent on architecture.
Consider CSV files contains a comma (,) in between data, so while Deserialization wrong outputs may
occur. Now, if metadata is stored in XML form, which is a self architected form of data storage, data can be
easily deserialized in the future.
To maintain the proper format of a data serialization system must have the following four properties –
When a client calls a function or subroutine from one pc to the pc in-network or server, that Procedure of
calling is known as a remote procedure call.
Persistent storage
It is better than java ‘s inbuilt serialization as java serialization isn’ t compact Serialization and
Deserialization of data helps in maintaining and managing corporate decisions for effective use of resources
and data available in Data warehouse or any other database -writable – language specific to java
*******