Big Data Analytics Digital Notes
Big Data Analytics Digital Notes
[R17A0528]
LECTURE NOTES
MALLA REDDY
COLLEGE OF ENGINEERING & TECHNOLOGY
(Autonomous Institution – UGC, Govt. of India)
Recognized under 2(f) and 12 (B) of UGC ACT 1956
(Affiliated to JNTUH, Hyderabad, Approved by AICTE - Accredited by NBA & NAAC – ‘A’ Grade - ISO 9001:2015 Certified)
Maisammaguda, Dhulapally (Post Via. Hakimpet), Secunderabad – 500100, Telangana State, India
(R17A0528) BIG DATA ANALYTICS
UNIT I
INTRODUCTION TO BIG DATA AND ANALYTICS
Classification of Digital Data, Structured and Unstructured Data –
Introduction to Big Data: Characteristics – Evolution – Definition - Challenges with Big Data
- Other Characteristics of Data - Why Big Data - Traditional Business Intelligence versus Big
Data - Data Warehouse and Hadoop Environment Big Data Analytics: Classification of
Analytics – Challenges - Big Data Analytics important - Data Science - Data Scientist -
Terminologies used in Big Data Environments - Basically Available Soft State Eventual
Consistency - Top Analytics Tools
UNIT II
INTRODUCTION TO TECHNOLOGY LANDSCAPE
NoSQL, Comparison of SQL and NoSQL, Hadoop -RDBMS Versus Hadoop - Distributed
Computing Challenges – Hadoop Overview - Hadoop Distributed File System - Processing
Data with Hadoop - Managing Resources and Applications with Hadoop YARN -
Interacting with Hadoop Ecosystem
UNIT III
INTRODUCTION TO MONGODB AND MAPREDUCE PROGRAMMING
MongoDB: Why Mongo DB - Terms used in RDBMS and Mongo DB - Data Types -
MongoDB Query Language
MapReduce: Mapper – Reducer – Combiner – Partitioner – Searching – Sorting – Compression
UNIT IV
INTRODUCTION TO HIVE AND PIG
Hive: Introduction – Architecture - Data Types - File Formats - Hive Query Language
Statements – Partitions – Bucketing – Views - Sub- Query – Joins – Aggregations - Group by
and Having - RCFile Implementation - Hive User Defined Function - Serialization and
Deserialization. Pig: Introduction - Anatomy – Features – Philosophy - Use Case for Pig - Pig
Latin Overview - Pig Primitive Data Types - Running Pig - Execution Modes of Pig - HDFS
Commands - Relational Operators - Eval Function - Complex Data Types - Piggy Bank -
User-Defined Functions - Parameter Substitution - Diagnostic Operator - Word Count
Example using Pig - Pig at Yahoo! - Pig Versus Hive
UNIT V
INTRODUCTION TO DATA ANALYTICS WITH R
Machine Learning: Introduction, Supervised Learning, Unsupervised Learning, Machine
Learning Algorithms: Regression Model, Clustering, Collaborative Filtering, Associate Rule
Making, Decision Tree, Big Data Analytics with BigR.
Reference Book:
1. Judith Huruwitz, Alan Nugent, Fern Halper, Marcia Kaufman, “Big data for
dummies”, John Wiley & Sons, Inc.(2013)
2. Tom White, “Hadoop The Definitive Guide”, O’Reilly Publications, Fourth
Edition,2015
3. Dirk Deroos, Paul C.Zikopoulos, Roman B.Melnky, Bruce Brown, Rafael Coss,
“Hadoop For Dummies”, Wiley Publications,2014
4. Robert D.Schneider, “Hadoop For Dummies”, John Wiley & Sons, Inc.(2012)
5. Paul Zikopoulos, “Understanding Big Data: Analytics for Enterprise Class
Hadoop and Streaming Data, McGraw Hill, 2012 Chuck Lam, “Hadoop In
Action”, Dreamtech Publications, 2010
Text Book:
1. Seema Acharya, Subhashini Chellappan, “Big Data and Analytics”, Wiley
Publications, First Edition,2015
INDEX
S. No Unit Topic Pg.No
Operator - Word Count Example using Pig - Pig at Yahoo! - Pig Versus
18 IV Hive 93
This definition clearly answers the “What is Big Data?” question – Big Data refers to complex and
large data sets that have to be processed and analyzed to uncover valuable information that can
benefit businesses and organizations.
However, there are certain basic tenets of Big Data that will make it even simpler to answer what
is Big Data:
It refers to a massive amount of data that keeps on growing exponentially with time.
It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
It includes data mining, data storage, data analysis, data sharing, and data visualization.
The term is an all-comprehensive one including data, data frameworks, along with the tools
and techniques used to process and analyze the data.
Although the concept of big data itself is relatively new, the origins of large data sets go back to
the 1960s and '70s when the world of data was just getting started with the first data centers and
the development of the relational database.
Around 2005, people began to realize just how much data users generated through Facebook,
YouTube, and other online services. Hadoop (an open-source framework created specifically to
store and analyze big data sets) was developed that same year. NoSQL also began to gain
popularity during this time.
The development of open-source frameworks, such as Hadoop (and more recently, Spark) was
essential for the growth of big data because they make big data easier to work with and cheaper to
store. In the years since then, the volume of big data has skyrocketed. Users are still generating
huge amounts of data—but it’s not just humans who are doing it.
With the advent of the Internet of Things (IoT), more objects and devices are connected to the
internet, gathering data on customer usage patterns and product performance. The emergence of
machine learning has produced still more data.
While big data has come far, its usefulness is only just beginning. Cloud computing has expanded
big data possibilities even further. The cloud offers truly elastic scalability, where developers can
simply spin up ad hoc clusters to test a subset of data.
Big data makes it possible for you to gain more complete answers because you have more
information.
More complete answers mean more confidence in the data—which means a completely
different approach to tackling problems.
a) Structured
Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. For instance, the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an organized manner.
b) Unstructured
Unstructured data refers to the data that lacks any specific form or structure whatsoever. This
makes it very difficult and time-consuming to process and analyze unstructured data. Email is an
example of unstructured data. Structured and unstructured are two important types of big data.
c) Semi-structured
Semi structured is the third type of big data. Semi-structured data pertains to the data containing
both the formats mentioned above, that is, structured and unstructured data. To be precise, it refers
to the data that although has not been classified under a particular repository (database), yet
contains vital information or tags that segregate individual elements within the data. Thus we come
to the end of types of data.
Back in 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data – Variety, Velocity, and
Volume. Let’s discuss the characteristics of big data.
These characteristics, isolated, are enough to know what big data is. Let’s look at them in depth:
a) Variety
Variety of Big Data refers to structured, unstructured, and semi-structured data that is gathered
from multiple sources. While in the past, data could only be collected from spreadsheets and
databases, today data comes in an array of forms such as emails, PDFs, photos, videos, audios, SM
posts, and so much more. Variety is one of the important characteristics of big data.
1. Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based Analytics can
bring cost advantages to business when large amounts of data are to be stored and these
tools also help in identifying more efficient ways of doing business.
2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics can
easily identify new sources of data which helps businesses analyzing data immediately
and make quick decisions based on the learning.
3. Understand the market conditions: By analyzing big data you can get a better
understanding of current market conditions. For example, by analyzing customers’
purchasing behaviors, a company can find out the products that are sold the most and
produce products according to this trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you
can get feedback about who is saying what about your company. If you want to monitor
and improve the online presence of your business, then, big data tools can help in all
this.
5. Using Big Data Analytics to Boost Customer Acquisition and Retention
The customer is the most important asset any business depends on. There is no single
business that can claim success without first having to establish a solid customer base.
However, even with a customer base, a business cannot afford to disregard the high
competition it faces. If a business is slow to learn what customers are looking for, then
it is very easy to begin offering poor quality products. In the end, loss of clientele will
result, and this creates an adverse overall effect on business success. The use of big data
allows businesses to observe various customer related patterns and trends. Observing
customer behavior is important to trigger loyalty.
6. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing
Insights
Although Big Data and Business Intelligence are two technologies used to analyze data to help
companies in the decision-making process, there are differences between both of them. They differ
in the way they work as much as in the type of data they analyze.
Traditional BI methodology is based on the principle of grouping all business data into a central
server. Typically, this data is analyzed in offline mode, after storing the information in an
environment called Data Warehouse. The data is structured in a conventional relational database
with an additional set of indexes and forms of access to the tables (multidimensional cubes).
A Big Data solution differs in many aspects to BI to use. These are the main differences between
Big Data and Business Intelligence:
1. In a Big Data environment, information is stored on a distributed file system, rather than
on a central server. It is a much safer and more flexible space.
2. Big Data solutions carry the processing functions to the data, rather than the data to the
functions. As the analysis is centered on the information, it´s easier to handle larger
amounts of information in a more agile way.
3. Big Data can analyze data in different formats, both structured and unstructured. The
volume of unstructured data (those not stored in a traditional database) is growing at levels
much higher than the structured data. Nevertheless, its analysis carries different challenges.
Big Data solutions solve them by allowing a global analysis of various sources of
information.
4. Data processed by Big Data solutions can be historical or come from real-time sources.
Thus, companies can make decisions that affect their business in an agile and efficient way.
5. Big Data technology uses parallel mass processing (MPP) concepts, which improves the
speed of analysis. With MPP many instructions are executed simultaneously, and since the
various jobs are divided into several parallel execution parts, at the end the overall results
are reunited and presented. This allows you to analyze large volumes of information
quickly.
Big Data has become the reality of doing business for organizations today. There is a boom in the
amount of structured as well as raw data that floods every organization daily. If this data is
managed well, it can lead to powerful insights and quality decision making.
Big data analytics is the process of examining large data sets containing a variety of data types to
discover some knowledge in databases, to identify interesting patterns and establish relationships
to solve problems, market trends, customer preferences, and other useful information. Companies
and businesses that implement Big Data Analytics often reap several business benefits. Companies
implement Big Data Analytics because they want to make more informed business decisions.
A data warehouse (DW) is a collection of corporate information and data derived from operational
systems and external data sources. A data warehouse is designed to support business decisions by
allowing data consolidation, analysis and reporting at different aggregate levels. Data is populated
into the Data Warehouse through the processes of extraction, transformation and loading (ETL
tools). Data analysis tools, such as business intelligence software, access the data within the
warehouse.
Hadoop is changing the perception of handling Big Data especially the unstructured data. Let’s
know how Apache Hadoop software library, which is a framework, plays a vital role in handling
Big Data. Apache Hadoop enables surplus data to be streamlined for any distributed processing
system across clusters of computers using simple programming models. It truly is made to scale
up from single servers to a large number of machines, each and every offering local computation,
and storage space. Instead of depending on hardware to provide high-availability, the library itself
is built to detect and handle breakdowns at the application layer, so providing an extremely
available service along with a cluster of computers, as both versions might be vulnerable to
failures.
Classification of analytics
Descriptive analytics
Descriptive analytics is a statistical method that is used to search and summarize historical data in
order to identify patterns or meaning.
Data aggregation and data mining are two techniques used in descriptive analytics to discover
historical data. Data is first gathered and sorted by data aggregation in order to make the datasets
more manageable by analysts.
Data mining describes the next step of the analysis and involves a search of the data to identify
patterns and meaning. Identified patterns are analyzed to discover the specific ways that learners
interacted with the learning content and within the learning environment.
Advantages:
Quickly and easily report on the Return on Investment (ROI) by showing how performance
achieved business or target goals.
Identify gaps and performance issues early - before they become problems.
Identify specific learners who require additional support, regardless of how many students
or employees there are.
Analyze the value and impact of course design and learning resources.
Predictive analytics
Predictive Analytics is a statistical method that utilizes algorithms and machine learning to identify
trends in data and predict future behaviors
The software for predictive analytics has moved beyond the realm of statisticians and is becoming
more affordable and accessible for different markets and industries, including the field of learning
& development.
For the learner, predictive forecasting could be as simple as a dashboard located on the main screen
after logging in to access a course. Analyzing data from past and current progress, visual indicators
in the dashboard could be provided to signal whether the employee was on track with training
requirements.
Advantages:
Personalize the training needs of employees by identifying their gaps, strengths, and
weaknesses; specific learning resources and training can be offered to support individual
needs.
Retain Talent by tracking and understanding employee career progression and forecasting
what skills and learning resources would best benefit their career paths. Knowing what skills
employees need also benefits the design of future training.
Support employees who may be falling behind or not reaching their potential by offering
intervention support before their performance puts them at risk.
Simplified reporting and visuals that keep everyone updated when predictive forecasting
is required.
Prescriptive analytics
Prescriptive analytics is a statistical method used to generate recommendations and make decisions
based on the computational findings of algorithmic models.
Example
A Training Manager uses predictive analysis to discover that most learners without a particular
skill will not complete the newly launched course. What could be done? Now prescriptive analytics
can be of assistance on the matter and help determine options for action. Perhaps an algorithm can
detect the learners who require that new course, but lack that particular skill, and send an automated
recommendation that they take an additional training resource to acquire the missing skill.
You can think of Predictive Analytics as then using this historical data to develop statistical models
that will then forecast about future possibilities.
Prescriptive Analytics takes Predictive Analytics a step further and takes the possible forecasted
outcomes and predicts consequences for these outcomes.
As data sets are becoming bigger and more diverse, there is a big challenge to incorporate them
into an analytical platform. If this is overlooked, it will create gaps and lead to wrong messages
and insights.
The analysis of data is important to make this voluminous amount of data being produced in every
minute, useful. With the exponential rise of data, a huge demand for big data scientists and Big
Data analysts has been created in the market. It is important for business organizations to hire a
data scientist having skills that are varied as the job of a data scientist is multidisciplinary. Another
major challenge faced by businesses is the shortage of professionals who understand Big Data
analysis. There is a sharp shortage of data scientists in comparison to the massive amount of data
being produced.
It is imperative for business organizations to gain important insights from Big Data analytics, and
also it is important that only the relevant department has access to this information. A big challenge
faced by the companies in the Big Data analytics is mending this wide gap in an effective manner.
It is hardly surprising that data is growing with every passing day. This simply indicates that
business organizations need to handle a large amount of data on daily basis. The amount and
variety of data available these days can overwhelm any data engineer and that is why it is
considered vital to make data accessibility easy and convenient for brand owners and managers.
With the rise of Big Data, new technologies and companies are being developed every day.
However, a big challenge faced by the companies in the Big Data analytics is to find out which
technology will be best suited to them without the introduction of new problems and potential
risks.
Business organizations are growing at a rapid pace. With the tremendous growth of the companies
and large business organizations, increases the amount of data produced. The storage of this
massive amount of data is becoming a real challenge for everyone. Popular data storage options
like data lakes/ warehouses are commonly used to gather and store large quantities of unstructured
and structured data in its native format. The real problem arises when a data lakes/ warehouse try
to combine unstructured and inconsistent data from diverse sources, it encounters errors. Missing
data, inconsistent data, logic conflicts, and duplicates data all result in data quality challenges.
Once business enterprises discover how to use Big Data, it brings them a wide range of possibilities
and opportunities. However, it also involves the potential risks associated with big data when it
comes to the privacy and the security of the data. The Big Data tools used for analysis and storage
utilizes the data disparate sources. This eventually leads to a high risk of exposure of the data,
making it vulnerable. Thus, the rise of voluminous amount of data increases privacy and security
concerns.
Data science
Data science is the professional field that deals with turning data into value such as new insights
or predictive models. It brings together expertise from fields including statistics, mathematics,
computer science, communication as well as domain expertise such as business knowledge. Data
scientist has recently been voted the No 1 job in the U.S., based on current demand and salary and
career opportunities.
Data mining
Data mining is the process of discovering insights from data. In terms of Big Data, because it is so
large, this is generally done by computational methods in an automated way using methods such
as decision trees, clustering analysis and, most recently, machine learning. This can be thought of
as using the brute mathematical power of computers to spot patterns in data which would not be
visible to the human eye due to the complexity of the dataset.
Hadoop
Hadoop is a framework for Big Data computing which has been released into the public domain
as open source software, and so can freely be used by anyone. It consists of a number of modules
all tailored for a different vital step of the Big Data process – from file storage (Hadoop File System
– HDFS) to database (HBase) to carrying out data operations (Hadoop MapReduce – see below).
It has become so popular due to its power and flexibility that it has developed its own industry of
retailers (selling tailored versions), support service providers and consultants.
Predictive modelling
At its simplest, this is predicting what will happen next based on data about what has happened
previously. In the Big Data age, because there is more data around than ever before, predictions
are becoming more and more accurate. Predictive modelling is a core component of most Big Data
initiatives, which are formulated to help us choose the course of action which will lead to the most
desirable outcome. The speed of modern computers and the volume of data available means that
predictions can be made based on a huge number of variables, allowing an ever-increasing number
of variables to be assessed for the probability that it will lead to success.
MapReduce
MapReduce is a computing procedure for working with large datasets, which was devised due to
difficulty of reading and analysing really Big Data using conventional computing methodologies.
As its name suggest, it consists of two procedures – mapping (sorting information into the format
needed for analysis – i.e. sorting a list of people according to their age) and reducing (performing
an operation, such checking the age of everyone in the dataset to see who is over 21).
Basic availability implies continuous system availability despite network failures and tolerance
to temporary inconsistency.
Soft state refers to state change without input which is required for eventual consistency.
* R is a language for statistical computing and graphics. It also used for big data analysis. It
provides a wide variety of statistical tests.
Features:
* Apache Spark is a powerful open source big data analytics tool. It offers over 80 high-level
operators that make it easy to build parallel apps. It is used at a wide range of organizations to
process large datasets.
Features:
It helps to run an application in Hadoop cluster, up to 100 times faster in memory, and ten
times faster on disk
It offers lighting Fast Processing
Support for Sophisticated Analytics
Ability to Integrate with Hadoop and Existing Hadoop Data
* Plotly is an analytics tool that lets users create charts and dashboards to share online.
Features:
* Lumify is a big data fusion, analysis, and visualization platform. It helps users to discover
connections and explore relationships in their data via a suite of analytic options.
Features:
* IBM SPSS Modeler is a predictive big data analytics platform. It offers predictive models and
delivers to individuals, groups, systems and the enterprise. It has a range of advanced algorithms
and analysis techniques.
Features:
Discover insights and solve problems faster by analyzing structured and unstructured data
Use an intuitive interface for everyone to learn
You can select from on-premises, cloud and hybrid deployment options
Quickly choose the best performing algorithm based on model performance
Its main features include Aggregation, Adhoc-queries, Uses BSON format, Sharding, Indexing,
Replication, Server-side execution of javascript, Schemaless, Capped collection, MongoDB
management service (MMS), load balancing and file storage.
Features:
Easy to learn.
Provides support for multiple technologies and platforms.
No hiccups in installation and maintenance.
Reliable and low cost.
NoSQL
NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and is easy to
scale. NoSQL database is used for distributed data stores with humongous data storage needs.
NoSQL is used for Big data and real-time web apps. For example companies like Twitter,
Facebook, Google that collect terabytes of user data every single day.
SQL
SQL programming can be effectively used to insert, search, update, delete database records.
Designing a distributed system does not come as easy and straight forward. A number
of challenges need to be overcome in order to get the ideal system. The major challenges in
distributed systems are listed below:
1. Heterogeneity:
The Internet enables users to access services and run applications over a heterogeneous collection
of computers and networks. Heterogeneity (that is, variety and difference) applies to all of the
following:
4. Concurrency
Both services and applications provide resources that can be shared by clients in a distributed
system. There is therefore a possibility that several clients will attempt to access a shared resource
at the same time. For example, a data structure that records bids for an auction may be accessed
very frequently when it gets close to the deadline time. For an object to be safe in a concurrent
environment, its operations must be synchronized in such a way that its data remains consistent.
This can be achieved by standard techniques such as semaphores, which are used in most operating
systems.
5. Security
Many of the information resources that are made available and maintained in distributed systems
have a high intrinsic value to their users. Their security is therefore of considerable importance.
Security for information resources has three components:
confidentiality (protection against disclosure to unauthorized individuals)
integrity (protection against alteration or corruption),
availability for the authorized (protection against interference with the means to access the
resources).
6. Scalability
Distributed systems must be scalable as the number of user increases. The scalability is defined by
B. Clifford Neuman as
A system is said to be scalable if it can handle the addition of users and resources without suffering
a noticeable loss of performance or increase in administrative complexity
Scalability has 3 dimensions:
o Size
o Number of users and resources to be processed. Problem associated is overloading
o Geography
o Distance between users and resources. Problem associated is communication reliability
o Administration
o As the size of distributed systems increases, many of the system needs to be controlled.
Problem associated is administrative mess
7. Failure Handling
Computer systems sometimes fail. When faults occur in hardware or software, programs may
produce incorrect results or may stop before they have completed the intended computation. The
handling of failures is particularly difficult.
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale up
from single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture
MapReduce
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has many
It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with single-
CPU, as a single functional distributed system and practically, the clustered machines can read
the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one
high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core tasks
that Hadoop performs −
Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.
Yarn divides the task on resource management and job scheduling/monitoring into separate
daemons. There is one ResourceManager and per-application ApplicationMaster. An application
can be either a job or a DAG of jobs.
The ResourceManger have two components – Scheduler and AppicationManager.
The scheduler is a pure scheduler i.e. it does not track the status of running application. It only
allocates resources to various competing applications. Also, it does not restart the job after failure
due to hardware or application failure. The scheduler allocates the resources based on an abstract
notion of a container. A container is nothing but a fraction of resources like CPU, memory, disk,
network etc.
Following are the tasks of ApplicationManager:-
Yarn can scale beyond a few thousand nodes via Yarn Federation. YARN Federation allows to
wire multiple sub-cluster into the single massive cluster. We can use many independent clusters
together for a single large job. It can be used to achieve a large scale system.
So what stores data in HDFS? It is the HBase which stores data in HDFS.
HBase
HBase is a NoSQL database or non-relational database .
HBase is important and mainly used when you need random, real-time, read, or write
access to your Big Data.
It provides support to a high volume of data and high throughput.
In an HBase, a table can have thousands of columns.
Database
Database is a physical container for collections. Each database gets its own set of files on the file
system. A single MongoDB server typically has multiple databases.
Collection
Document
A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema means
that documents in the same collection do not need to have the same set of fields or structure, and
common fields in a collection's documents may hold different types of data.
The following table shows the relationship of RDBMS terminology with MongoDB.
RDBMS MongoDB
Database Database
Table Collection
Tuple/Row Document
column Field
mysqld/Oracle mongod
mysql/sqlplus mongo
Sample Document
Following example shows the document structure of a blog site, which is simply a comma
separated key value pair.
_id: ObjectId(7df78ad8902c)
url: 'http://www.tutorialspoint.com',
likes: 100,
comments: [
user:'user1',
like: 0
user:'user2',
like: 5
_id is a 12 bytes hexadecimal number which assures the uniqueness of every document. You can
provide _id while inserting the document. If you don’t provide then MongoDB provides a unique
id for every document. These 12 bytes first 4 bytes for the current timestamp, next 3 bytes for
machine id, next 2 bytes for process id of MongoDB server and remaining 3 bytes are simple
incremental VALUE.
Any relational database has a typical schema design that shows number of tables and the
relationship between these tables. While in MongoDB, there is no concept of relationship.
Schema less − MongoDB is a document database in which one collection holds different
documents. Number of fields, content and size of the document can differ from one
document to another.
Structure of a single object is clear.
No complex joins.
Deep query-ability. MongoDB supports dynamic queries on documents using a document-
based query language that's nearly as powerful as SQL.
Tuning.
Ease of scale-out − MongoDB is easy to scale.
Conversion/mapping of application objects to database objects not needed.
Uses internal memory for storing the (windowed) working set, enabling faster access of
data.
Document Oriented Storage − Data is stored in the form of JSON style documents.
Index on any attribute
Replication and high availability
Auto-Sharding
Rich queries
Fast in-place updates
Professional support by MongoDB
Big Data
Content Management and Delivery
Mobile and Social Infrastructure
User Data Management
Data Hub
MongoDB supports many datatypes. Some of them are −
String − This is the most commonly used datatype to store the data. String in MongoDB
must be UTF-8 valid.
Integer − This type is used to store a numerical value. Integer can be 32 bit or 64 bit
depending upon your server.
Boolean − This type is used to store a boolean (true/ false) value.
Double − This type is used to store floating point values.
Min/ Max keys − This type is used to compare a value against the lowest and highest
BSON elements.
Arrays − This type is used to store arrays or list or multiple values into one key.
Timestamp − ctimestamp. This can be handy for recording when a document has been
modified or added.
Object − This datatype is used for embedded documents.
Null − This type is used to store a Null value.
Symbol − This datatype is used identically to a string; however, it's generally reserved for
languages that use a specific symbol type.
Date − This datatype is used to store the current date or time in UNIX time format. You
can specify your own date time by creating object of Date and passing day, month, year
into it.
Object ID − This datatype is used to store the document’s ID.
To query data from MongoDB collection, you need to use MongoDB's find() method.
Syntax
Example
To display the results in a formatted way, you can use pretty() method.
Syntax
>db.COLLECTION_NAME.find().pretty()
Example
Following example retrieves all the documents from the collection named mycol and arranges
them in an easy-to-read format.
> db.mycol.find().pretty()
{
"_id" : ObjectId("5dd4e2cc0821d3b44607534c"),
"title" : "MongoDB Overview",
"description" : "MongoDB is no SQL database",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 100
Apart from the find() method, there is findOne() method, that returns only one document.
Syntax
>db.COLLECTIONNAME.findOne()
Example
To query the document on the basis of some condition, you can use following operations.
AND in MongoDB
Syntax
To query documents based on the AND condition, you need to use $and keyword. Following is
the basic syntax of AND −
>db.mycol.find({ $and: [ {<key1>:<value1>}, { <key2>:<value2>} ] })
Example
Following example will show all the tutorials written by 'tutorials point' and whose title is
'MongoDB Overview'.
> db.mycol.find({$and:[{"by":"tutorials point"},{"title": "MongoDB Overview"}]}).pretty()
{
"_id" : ObjectId("5dd4e2cc0821d3b44607534c"),
"title" : "MongoDB Overview",
"description" : "MongoDB is no SQL database",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 100
}
>
For the above given example, equivalent where clause will be ' where by = 'tutorials point'
AND title = 'MongoDB Overview' '. You can pass any number of key, value pairs in find clause.
OR in MongoDB
Syntax
Example
Following example will show all the tutorials written by 'tutorials point' or whose title is
'MongoDB Overview'.
>db.mycol.find({$or:[{"by":"tutorials point"},{"title": "MongoDB Overview"}]}).pretty()
{
"_id": ObjectId(7df78ad8902c),
"title": "MongoDB Overview",
"description": "MongoDB is no sql database",
"by": "tutorials point",
"url": "http://www.tutorialspoint.com",
"tags": ["mongodb", "database", "NoSQL"],
"likes": "100"
}
>
Example
The following example will show the documents that have likes greater than 10 and whose title
is either 'MongoDB Overview' or by is 'tutorials point'. Equivalent SQL where clause is 'where
likes>10 AND (by = 'tutorials point' OR title = 'MongoDB Overview')'
>db.mycol.find({"likes": {$gt:10}, $or: [{"by": "tutorials point"},
{"title": "MongoDB Overview"}]}).pretty()
{
"_id": ObjectId(7df78ad8902c),
"title": "MongoDB Overview",
"description": "MongoDB is no sql database",
"by": "tutorials point",
"url": "http://www.tutorialspoint.com",
"tags": ["mongodb", "database", "NoSQL"],
"likes": "100"
}
>
Syntax
To query documents based on the NOT condition, you need to use $not keyword. Following is
the basic syntax of NOT −
>db.COLLECTION_NAME.find(
{
$not: [
{key1: value1}, {key2:value2}
]
}
)
Example
NOT in MongoDB
Syntax
To query documents based on the NOT condition, you need to use $not keyword following is the
basic syntax of NOT −
>db.COLLECTION_NAME.find(
{
$NOT: [
{key1: value1}, {key2:value2}
]
}
).pretty()
Example
Following example will retrieve the document(s) whose age is not greater than 25
> db.empDetails.find( { "Age": { $not: { $gt: "25" } } } )
{
"_id" : ObjectId("5dd6636870fb13eec3963bf7"),
"First_Name" : "Fathima",
"Last_Name" : "Sheik",
"Age" : "24",
"e_mail" : "[email protected]",
"phone" : "9000054321"
}
ich often has multiple cores). Why is MapReduce important? In practical terms, it provides a very
effective tool for tackling large-data problems. But beyond that, MapReduce is important in how
it has changed the way we organize computations at a massive scale. MapReduce represents the
first widely-adopted step away from the von Neumann model that has served as the foundation of
computer science over the last half plus century. Valiant called this a bridging model [148], a
conceptual bridge between the physical implementation of a machine and the software that is to
be executed on that machine. Until recently, the von Neumann model has served us well: Hardware
designers focused on efficient implementations of the von Neumann model and didn’t have to
think much about the actual software that would run on the machines. Similarly, the software
industry developed software targeted at the model without worrying about the hardware details.
The result was extraordinary growth: chip designers churned out successive generations of
increasingly powerful processors, and software engineers were able to develop applications in
high-level languages that exploited those processors.
MapReduce can be viewed as the first breakthrough in the quest for new abstractions that allow us
to organize computations, not over individual machines, but over entire clusters. As Barroso puts
it, the datacenter is the computer. MapReduce is certainly not the first model of parallel
computation that has been proposed. The most prevalent model in theoretical computer science,
which dates back several decades, is the PRAM. MAPPERS AND REDUCERS Key-value pairs
form the basic data structure in MapReduce. Keys and values may be primitives such as integers,
floating point values, strings, and raw bytes, or they may be arbitrarily complex structures (lists,
tuples, associative arrays, etc.). Programmers typically need to define their own custom data types,
although a number of libraries such as Protocol Buffers,5 Thrift,6 and Avro7 simplify the task.
Part of the design of MapReduce algorithms involves imposing the key-value structure on arbitrary
datasets. For a collection of web pages, keys may be URLs and values may be the actual HTML
content. For a graph, keys may represent node ids and values may contain the adjacency lists of
those nodes (see Chapter 5 for more details). In some algorithms, input keys are not particularly
distribution over words in a collection). Input key-values pairs take the form of (docid, doc) pairs
stored on the distributed file system, where the former is a unique identifier for the document, and
the latter is the text of the document itself. The mapper takes an input key-value pair, tokenizes
the document, and emits an intermediate key-value pair for every word: the word itself serves as
the key, and the integer one serves as the value (denoting that we’ve seen the word once). The
MapReduce execution framework guarantees that all values associated with the same key are
brought together in the reducer. Therefore, in our word count algorithm, we simply need to sum
up all counts (ones) associated with each word. The reducer does exactly this, and emits final
keyvalue pairs with the word as the key, and the count as the value. Final output is written to the
distributed file system, one file per reducer. Words within each file will be sorted by alphabetical
order, and each file will contain roughly the same number of words. The partitioner, which we
discuss later in Section 2.4, controls the assignment of words to reducers. The output can be
examined by the programmer or used as input to another MapReduce program.
There are some differences between the Hadoop implementation of MapReduce and Google’s
implementation.9 In Hadoop, the reducer is presented with a key and an iterator over all values
associated with the particular key. The values are arbitrarily ordered. Google’s implementation
allows the programmer to specify a secondary sort key for ordering the values (if desired)—in
which case values associated with each key would be presented to the developer’s reduce code in
sorted order. Later in Section 3.4 we discuss how to overcome this limitation in Hadoop to perform
secondary sorting. Another difference: in Google’s implementation the programmer is not allowed
to change the key in the reducer. That is, the reducer output key must be exactly the same as the
reducer input key. In Hadoop, there is no such restriction, and the reducer can emit an arbitrary
number of output key-value pairs (with different keys).
To provide a bit more implementation detail: pseudo-code provided in this book roughly mirrors
how MapReduce programs are written in Hadoop. Mappers and reducers are objects that
implement the Map and Reduce methods, respectively. In Hadoop, a mapper object is initialized
for each map task (associated with a particular sequence of key-value pairs called an input split)
and the Map method is called on each key-value pair by the execution framework. In configuring
a MapReduce job, the programmer provides a hint on the number of map tasks to run, but the
execution framework (see next section) makes the final determination based on the physical layout
of the data (more details in Section 2.5 and Section 2.6). The situation is similar for the reduce
phase: a reducer object is initialized for each reduce task, and the Reduce method is called once
per intermediate key. In contrast with the number of map tasks, the programmer can precisely
specify the number of reduce tasks. We will return to discuss the details of Hadoop job execution
in Section 2.6, which is dependent on an understanding of the distributed file system (covered in
Section 2.5). To reiterate: although the presentation of algorithms in this book closely mirrors the
way they would be implemented in Hadoop, our focus is on algorithm design and conceptual
In addition to the “canonical” MapReduce processing flow, other variations are also possible.
MapReduce programs can contain no reducers, in which case mapper output is directly written to
disk (one file per mapper). For embarrassingly parallel problems, e.g., parse a large text collection
or independently analyze a large number of images, this would be a common pattern. The
converse—a MapReduce program with no mappers—is not possible, although in some cases it is
useful for the mapper to implement the identity function and simply pass input key-value pairs to
the reducers. This has the effect of sorting and regrouping the input for reduce-side processing.
Similarly, in some cases it is useful for the reducer to implement the identity function, in which
case the program simply sorts and groups mapper output. Finally, running identity mappers and
reducers has the effect of regrouping and resorting the input data (which is sometimes useful).
Although in the most common case, input to a MapReduce job comes from data stored on the
distributed file system and output is written back to the distributed file system, any other system
that satisfies the proper abstractions can serve as a data source or sink. With Google’s MapReduce
implementation, BigTable [34], a sparse, distributed, persistent multidimensional sorted map, is
frequently used as a source of input and as a store of MapReduce output. HBase is an open-source
BigTable clone and has similar capabilities. Also, Hadoop has been integrated with existing MPP
(massively parallel processing) relational databases, which allows a programmer to write
MapReduce jobs over database rows and dump output into a new database table. Finally, in some
We have thus far presented a simplified view of MapReduce. There are two additional elements
that complete the programming model: partitioners and combiners. Partitioners are responsible for
dividing up the intermediate key space and assigning intermediate key-value pairs to reducers. In
other words, the partitioner specifies the task to which an intermediate key-value pair must be
copied. Within each reducer, keys are processed in sorted order (which is how the “group by” is
implemented). The simplest partitioner involves computing the hash value of the key and then
taking the mod of that value with the number of reducers. This assigns approximately the same
number of keys to each reducer (dependent on the quality of the hash function). Note, however,
that the partitioner only considers the key and ignores the value—therefore, a roughly-even
partitioning of the key space may nevertheless yield large differences in the number of key-values
pairs sent to each reducer (since different keys may have different numbers of associated values).
This imbalance in the amount of data associated with each key is relatively common in many text
processing applications due to the Zipfian distribution of word occurrences.
Combiners are an optimization in MapReduce that allow for local aggregation before the shuffle
and sort phase. We can motivate the need for combiners by considering the word count algorithm
in Figure 2.3, which emits a key-value pair for each word in the collection. Furthermore, all these
key-value pairs need to be copied across the network, and so the amount of intermediate data will
be larger than the input collection itself. This is clearly inefficient. One solution is to perform local
aggregation on the output of each mapper, i.e., to compute a local count for a word over all the
documents processed by the mapper. With this modification (assuming the maximum amount of
local aggregation possible), the number of intermediate key-value pairs will be at most the number
of unique words in the collection times the number of mappers (and typically far smaller because
each mapper may not encounter every word).
smaller because each mapper may not encounter every word). The combiner in MapReduce
supports such an optimization. One can think of combiners as “mini-reducers” that take place on
the output of the mappers, prior to the shuffle and sort phase. Each combiner operates in isolation
and therefore does not have access to intermediate output from other mappers. The combiner is
provided keys and values associated with each key (the same types as the mapper output keys and
values). Critically, one cannot assume that a combiner will have the opportunity to process all
values associated with the same key. The combiner can emit any number of key-value pairs, but
the keys and values must be of the same type as the mapper output (same as the reducer input).12
In cases where an operation is both associative and commutative (e.g., addition or multiplication),
reducers can directly serve as combiners. In general, however, reducers and combiners are not
interchangeable.
SECONDARY SORTING
MapReduce sorts intermediate key-value pairs by the keys during the shuffle and sort phase, which
is very convenient if computations inside the reducer rely on sort order (e.g., the order inversion
design pattern described in the previous section). However, what if in addition to sorting by key,
we also need to sort by value? Google’s MapReduce implementation provides built-in
Consider the example of sensor data from a scientific experiment: there are m sensors each taking
readings on continuous basis, where m is potentially a large number. A dump of the sensor data
might look something like the following, where rx after each timestamp represents the actual
sensor readings (unimportant for this discussion, but may be a series of values, one or more
complex records, or even raw bytes of images).
m1 → (t1, r80521)
This would bring all readings from the same sensor together in the reducer. However, since
MapReduce makes no guarantees about the ordering of values associated with the same key, the
sensor readings will not likely be in temporal order. The most obvious solution is to buffer all the
readings in memory and then sort by timestamp before additional processing. However, it should
be apparent by now that any in-memory buffering of data introduces a potential scalability
bottleneck. What if we are working with a high frequency sensor or sensor readings over a long
period of time? What if the sensor readings themselves are large complex objects? This approach
may not scale in these cases—the reducer would run out of memory trying to buffer all values
associated with the same key.
This is a common problem, since in many applications we wish to first group together data one
way (e.g., by sensor id), and then sort within the groupings another way (e.g., by time). Fortunately,
there is a general purpose solution, which we call the “value-to-key conversion” design pattern.
The basic idea is to move part of the value into the intermediate key to form a composite key, and
let the MapReduce execution framework handle the sorting. In the above example, instead of
emitting the sensor id as the key, we would emit the sensor id and the timestamp as a composite
key: (m1, t1) → (r80521)
However, note that sensor readings are now split across multiple keys. The reducer will need to
preserve state and keep track of when readings associated with the current sensor end and the next
sensor begin.9 The basic tradeoff between the two approaches discussed above (buffer and
inmemory sort vs. value-to-key conversion) is where sorting is performed. One can explicitly
implement secondary sorting in the reducer, which is likely to be faster but suffers from a
scalability bottleneck.10 With value-to-key conversion, sorting is offloaded to the MapReduce
execution framework. Note that this approach can be arbitrarily extended to tertiary, quaternary,
etc. sorting. This pattern results in many more keys for the framework to sort, but distributed
sorting is a task that the MapReduce runtime excels at since it lies at the heart of the programming
model.
INDEX COMPRESSION
We return to the question of how postings are actually compressed and stored on disk. This chapter
devotes a substantial amount of space to this topic because index compression is one of the main
differences between a “toy” indexer and one that works on real-world collections. Otherwise,
MapReduce inverted indexing algorithms are pretty straightforward.
Let us consider the canonical case where each posting consists of a document id and the term
frequency. A na¨ıve implementation might represent the first as a 32-bit integer9 and the second
as a 16-bit integer. Thus, a postings list might be encoded as follows: [(5, 2),(7, 3),(12, 1),(49,
1),(51, 2), . . .]
where each posting is represented by a pair in parentheses. Note that all brackets, parentheses, and
commas are only included to enhance readability; in reality the postings would be represented as
a long stream of integers. This na¨ıve implementation would require six bytes per posting. Using
this scheme, the entire inverted index would be about as large as the collection itself. Fortunately,
we can do significantly better. The first trick is to encode differences between document ids as
opposed to the document ids themselves. Since the postings are sorted by document ids, the
differences (called d-gaps) must be positive integers greater than zero. The above postings list,
represented with d-gaps, would be: [(5, 2),(2, 3),(5, 1),(37, 1),(2, 2)
Of course, we must actually encode the first document id. We haven’t lost any information, since
the original document ids can be easily reconstructed from the d-gaps. However, it’s not obvious
that we’ve reduced the space requirements either, since the largest possible d-gap is one less than
Compression, in general, can be characterized as either lossless or lossy: it’s fairly obvious that
loseless compression is required in this context. To start, it is important to understand that all
compression techniques represent a time–space tradeoff. That is, we reduce the amount of space
on disk necessary to store data, but at the cost of extra processor cycles that must be spent coding
and decoding data. Therefore, it is possible that compression reduces size but also slows
processing. However, if the two factors are properly balanced (i.e., decoding speed can keep up
with disk bandwidth), we can achieve the best of both worlds: smaller and faster.
POSTINGS COMPRESSION
Having completed our slight detour into integer compression techniques, we can now return to the
scalable inverted indexing algorithm shown in Figure 4.4 and discuss how postings lists can be
properly compressed. As we can see from the previous section, there is a wide range of choices
that represent different tradeoffs between compression ratio and decoding speed. Actual
performance also depends on characteristics of the collection, which, among other factors,
determine the distribution of d-gaps. B¨uttcher et al. [30] recently compared the performance of
various compression techniques on coding document ids. In terms of the amount of compression
that can be obtained (measured in bits per docid), Golomb and Rice codes performed the best,
followed by γ codes, Simple-9, varInt, and group varInt (the least space efficient). In terms of raw
decoding speed, the order was almost the reverse: group varInt was the fastest, followed by
varInt.14 Simple-9 was substantially slower, and the bit-aligned codes were even slower than that.
Within the bit-aligned codes, Rice codes were the fastest, followed by γ, with Golomb codes being
the slowest (about ten times slower than group varInt).
Let us discuss what modifications are necessary to our inverted indexing algorithm if we were to
adopt Golomb compression for d-gaps and represent term frequencies with γ codes. Note that this
represents a space-efficient encoding, at the cost of slower decoding compared to alternatives.
Whether or not this is actually a worthwhile tradeoff in practice is not important here: use of
Golomb codes serves a pedagogical purpose, to illustrate how one might set compression
parameters.
Coding term frequencies with γ codes is easy since they are parameterless. Compressing d-gaps
with Golomb codes, however, is a bit tricky, since two parameters are required: the size of the
document collection and the number of postings for a particular postings list (i.e., the document
frequency, or df). The first is easy to obtain and can be passed into the reducer as a constant. The
df of a term, however, is not known until all the postings have been processed—and unfortunately,
To get around this problem, we need to somehow inform the reducer of a term’s df before any of
its postings arrive. This can be solved with the order inversion design pattern introduced in Section
3.3 to compute relative frequencies. The solution is to have the mapper emit special keys of the
form ht, ∗i to communicate partial document frequencies. That is, inside the mapper, in addition
to emitting intermediate key-value pairs of the following form:
to keep track of document frequencies associated with each term. In practice, we can accomplish
this by applying the in-mapper combining design pattern (see Section 3.1). The mapper holds an
in-memory associative array that keeps track of how many documents a term has been observed
in (i.e., the local document frequency of the term for the subset of documents processed by the
mapper). Once the mapper has processed all input records, special keys of the form ht, ∗i are
emitted with the partial df as the value.
To ensure that these special keys arrive first, we define the sort order of the tuple so that the special
symbol ∗ precedes all documents (part of the order inversion design pattern). Thus, for each term,
the reducer will first encounter the ht, ∗i key, associated with a list of values representing partial
df values originating from each mapper. Summing all these partial contributions will yield the
term’s df, which can then be used to set the Golomb compression parameter b. This allows the
postings to be incrementally compressed as they are encountered in the reducer—memory
bottlenecks are eliminated since we do not need to buffer postings in memory.
Once again, the order inversion design pattern comes to the rescue. Recall that the pattern is useful
when a reducer needs to access the result of a computation (e.g., an aggregate statistic) before it
encounters the data necessary to produce that computation. For computing relative frequencies,
that bit of information was the marginal. In this case, it’s the document frequency.
One of the most common and well-studied problems in graph theory is the single-source shortest
path problem, where the task is to find shortest paths from a source node to all other nodes in the
graph (or alternatively, edges can be associated with costs or weights, in which case the task is to
compute lowest-cost or lowest-weight paths). Such problems are a staple in undergraduate
Dijkstra(G, w, s)
2: d[s] ← 0
3: for all vertex v ∈ V do
4: d[v] ← ∞
5: Q ← {V }
6: while Q 6= ∅ do
7: u ← ExtractMin(Q)
8: for all vertex v ∈ u.AdjacencyList do
9: if d[v] > d[u] + w(u, v) then
10: d[v] ← d[u] + w(u, v)
Figure 5.2: Pseudo-code for Dijkstra’s algorithm, which is based on maintaining a global priority
queue of nodes with priorities equal to their distances from the source node. At each iteration, the
algorithm expands the node with the shortest distance and updates distances to all reachable nodes.
As a refresher and also to serve as a point of comparison, Dijkstra’s algorithm is shown in Figure
5.2, adapted from Cormen, Leiserson, and Rivest’s classic algorithms textbook [41] (often simply
known as CLR). The input to the algorithm is a directed, connected graph G = (V, E) represented
with adjacency lists, w containing edge distances such that w(u, v) ≥ 0, and the source node s. The
algorithm begins by first setting distances to all vertices d[v], v ∈ V to ∞, except for the source
node, whose distance to itself is zero. The algorithm maintains Q, a global priority queue of
vertices with priorities equal to their distance values d
Dijkstra’s algorithm operates by iteratively selecting the node with the lowest current distance
from the priority queue (initially, this is the source node). At each iteration, the algorithm
“expands” that node by traversing the adjacency list of the selected node to see if any of those
nodes can be reached with a path of a shorter distance. The algorithm terminates when the priority
queue Q is empty, or equivalently, when all nodes have been considered. Note that the algorithm
as presented in Figure 5.2 only computes the shortest distances. The actual paths can be recovered
by storing “backpointers” for every node indicating a fragment of the shortest path.
A sample trace of the algorithm running on a simple graph is shown in Figure 5.3 (example also
adapted from CLR). We start out in (a) with n1 having a distance of zero (since it’s the source)
and all other nodes having a distance of ∞. In the first iteration (a), n1 is selected as the node to
expand (indicated by the thicker border). After the expansion, we see in (b) that n2 and n3 can be
reached at a distance of 10 and 5, respectively. Also, we see in (b) that n3 is the next node selected
for expansion. Nodes we have already considered for expansion are shown in black. Expanding
n3, we see in (c) that the distance to n2 has decreased because we’ve found a shorter path. The
nodes that will be expanded next, in order, are n5, n2, and n4. The algorithm terminates with the
end state shown in (f), where we’ve discovered the shortest distance to all nodes.
Pseudo-code for the implementation of the parallel breadth-first search algorithm is provided in
Figure 5.4. As with Dijkstra’s algorithm, we assume a connected, directed graph represented as
adjacency lists. Distance to each node is directly stored alongside the adjacency list of that node,
and initialized to ∞ for all nodes except for the source node. In the pseudo-code, we use n to denote
the node id (an integer) and N to denote the node’s corresponding data structure (adjacency list
and current distance). The algorithm works by mapping over all nodes and emitting a key-value
pair for each neighbor on the node’s adjacency list. The key contains the node id of the neighbor,
and the value is the current distance to the node plus one. This says: if we can reach node n with a
distance d, then we must be able to reach all the nodes that are connected to n with distance d + 1.
h iteration corresponds to a MapReduce job. The first time we run the algorithm, we “discover”
all nodes that are connected to the source. The second iteration, we discover all nodes connected
to those, and so on. Each iteration of the algorithm expands the “search frontier” by one hop, and,
eventually, all nodes will be discovered with their shortest distances (assuming a fully-connected
graph). Before we discuss termination of the algorithm, there is one more detail required to make
the parallel breadth-first search algorithm work. We need to “pass along” the graph structure from
one iteration to the next. This is accomplished by emitting the node data structure itself, with the
node id as a key (Figure 5.4, line 4 in the mapper). In the reducer, we must distinguish the node
data structure from distance values (Figure 5.4, lines 5–6 in the reducer), and update the minimum
distance in the node data structure before emitting it as the final value. The final output is now
ready to serve as input to the next iteration.
So how many iterations are necessary to compute the shortest distance to all nodes? The answer is
the diameter of the graph, or the greatest distance between any pair of nodes. This number is
surprisingly small for many real-world problems: the saying “six degrees of separation” suggests
that everyone on the planet is connected to everyone else by at most six steps (the people a person
knows are one step away, people that they know are two steps away, etc.). If this is indeed true,
then parallel breadthfirst search on the global social network would take at most six MapReduce
iterations.
class Mapper
2: method Map(nid n, node N)
3: d ← N.Distance
4: Emit(nid n, N) . Pass along graph structure
5: for all nodeid m ∈ N.AdjacencyList do
6: Emit(nid m, d + 1) . Emit distances to reachable nodes
1: class Reducer
2: method Reduce(nid m, [d1, d2, . . .])
3: dmin ← ∞
4: M ← ∅
5: for all d ∈ counts [d1, d2, . . .] do
6: if IsNode(d) then
7: M ← d . Recover graph structure
8: else if d < dmin then . Look for shorter distance
9: dmin ← d
10: M.Distance ← dmin . Update shortest distance
11: Emit(nid m, node M)
Finally, as with Dijkstra’s algorithm in the form presented earlier, the parallel breadth-first search
algorithm only finds the shortest distances, not the actual shortest paths. However, the path can be
straightforwardly recovered. Storing “backpointers” at each node, as with Dijkstra’s algorithm,
will work, but may not be efficient since the graph needs to be traversed again to reconstruct the
path segments. A simpler approach is to emit paths along with distances in the mapper, so that
each node will have its shortest path easily accessible at all times. The additional space
requirements for shuffling these data from mappers to reducers are relatively modest, since for the
most part paths (i.e., sequence of node ids) are relatively short.
Up until now, we have been assuming that all edges are unit distance. Let us relax that restriction
and see what changes are required in the parallel breadth-first search algorithm. The adjacency
lists, which were previously lists of node ids, must now encode the edge distances as well. In line
The term ‘Big Data’ is used for collections of large datasets that include huge volume, high
velocity, and a variety of data that is increasing day by day. Using traditional data management
systems, it is difficult to process Big Data. Therefore, the Apache Software Foundation introduced
a framework called Hadoop to solve Big Data management and processing challenges.
Hadoop
Hadoop is an open-source framework to store and process Big Data in a distributed environment.
It contains two modules, one is MapReduce and another is Hadoop Distributed File System
(HDFS).
MapReduce: It is a parallel programming model for processing large amounts of
structured, semi-structured, and unstructured data on large clusters of commodity
hardware.
HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to store and
process the datasets. It provides a fault-tolerant file system to run on commodity hardware.
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that
are used to help Hadoop modules.
Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.
Note: There are various ways to execute MapReduce operations:
The traditional approach using Java MapReduce program for structured, semi-structured,
and unstructured data.
The scripting approach for MapReduce to process structured and semi structured data using
Pig.
The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data
using Hive.
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
Architecture of Hive
This component diagram contains different units. The following table describes each unit:
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on
the Metastore. It is one of the replacements of traditional approach
for MapReduce program. Instead of writing MapReduce program
in Java, we can write a query for MapReduce job and process it.
HDFS or HBASE Hadoop distributed file system or HBASE are the data
storage techniques to store data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
Step Operation
No.
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any database
driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the syntax and
query plan or the requirement of query.
3 Get Metadata
4 Send Metadata
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up to here, the
parsing and compiling of a query is complete.
6 Execute Plan
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends
the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which
is in Data node. Here, the query executes MapReduce job.
8 Fetch Result
9 Send Results
10 Send Results
1. TextFile format
Suitable for sharing data with other tools
Can be viewed/edited manually
2. SequenceFile
Flat files that stores binary key ,value pair
SequenceFile offers a Reader ,Writer, and Sorter classes for reading ,writing, and sorting
respectively
Supports – Uncompressed, Record compressed ( only value is compressed) and Block
compressed ( both key,value compressed) formats
3. RCFile
RCFile stores columns of a table in a record columnar way
4. ORC
5. AVRO
Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and User
defined functions.
create database
drop database
create table
drop table
alter table
create index
create view
Select
Where
Group By
Order By
Load Data
Join:
o Inner Join
o Left Outer Join
o Right Outer Join
o Full Outer Join
Drop database
Create a table called Sonoo with two columns, the first being an integer and the other a string.
Create a table called HIVE_TABLE with two columns and a partition column called ds. The
partition column is a virtual column. It is not part of the data itself but is derived from the partition
that a particular dataset is loaded into.By default, tables are assumed to be of text input format and
the delimiters are assumed to be ^A(ctrl-a).
1. hive> CREATE TABLE HIVE_TABLE (foo INT, bar STRING) PARTITIONED BY (ds STRI
NG);
To understand the Hive DML commands, let's see the employee and employee_department table
first.
1. hive> LOAD DATA LOCAL INPATH './usr/Desktop/kv1.txt' OVERWRITE INTO TABLE Employe
e;
GROUP BY
Adding a Partition
We can add partitions to a table by altering the table. Let us assume we have a table
called employee with fields such as Id, Name, Salary, Designation, Dept, and yoj.
Syntax:
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;
partition_spec:
: (p_column = p_col_value, p_column = p_col_value, ...)
The following query is used to add a partition to the employee table.
hive> ALTER TABLE employee
> ADD PARTITION (year=’2012’)
> location '/2012/part2012';
Renaming a Partition
Dropping a Partition
The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Metastore. This chapter explains how to use the SELECT statement with
WHERE clause.
SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a
condition. It filters the data using the condition and gives you a finite result. The built-in operators
and functions generate an expression, which fulfils the condition.
Syntax
Example
Let us take an example for SELECT…WHERE clause. Assume we have the employee table as
given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details who earn a salary of more than Rs 30000.
JDBC Program
The JDBC program to apply where clause for the given example is as follows.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "",
"");
// create statement
Statement stmt = con.createStatement();
// execute statement
System.out.println("Result:");
System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");
while (res.next()) {
System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " +
res.getString(4) + " " + res.getString(5));
}
con.close();
}
}
Save the program in a file named HiveQLWhere.java. Use the following commands to compile
and execute this program.
$ javac HiveQLWhere.java
$ java HiveQLWhere
Output:
The ORDER BY clause is used to retrieve the details based on one column and sort the result set
by ascending or descending order.
Syntax
Let us take an example for SELECT...ORDER BY clause. Assume employee table as given
below, with the fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details in order by using Department name.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario:
JDBC Program
Here is the JDBC program to apply Order By clause for the given example.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
// create statement
Statement stmt = con.createStatement();
// execute statement
Resultset res = stmt.executeQuery("SELECT * FROM employee ORDER BY DEPT;");
System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");
while (res.next()) {
System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " +
res.getString(4) + " " + res.getString(5));
}
con.close();
}
}
Save the program in a file named HiveQLOrderBy.java. Use the following commands to compile
and execute this program.
$ javac HiveQLOrderBy.java
$ java HiveQLOrderBy
Output:
The GROUP BY clause is used to group all the records in a result set using a particular collection
column. It is used to query a group of records.
Syntax
Example
JDBC Program
Given below is the JDBC program to apply the Group By clause for the given example.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
// create statement
Statement stmt = con.createStatement();
// execute statement
Resultset res = stmt.executeQuery(“SELECT Dept,count(*) ” + “FROM employee GROUP
BY DEPT; ”);
System.out.println(" Dept \t count(*)");
while (res.next()) {
System.out.println(res.getString(1) + " " + res.getInt(2));
}
con.close();
}
}
Save the program in a file named HiveQLGroupBy.java. Use the following commands to compile
and execute this program.
$ javac HiveQLGroupBy.java
$ java HiveQLGroupBy
Output:
Dept Count(*)
Admin 1
PR 2
TP 3
JOIN is a clause that is used for combining specific fields from two tables by using values
common to each one. It is used to combine records from two or more tables in the database.
Syntax
join_table:
Example
We will use the following two tables in this chapter. Consider the following table named
CUSTOMERS..
JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as
OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign keys
of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the
records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no
matches in the right table. This means, if the ON clause matches 0 (zero) records in the right table,
the JOIN still returns a row in the result, but with NULL in each column from the right table.
A LEFT JOIN returns all the values from the left table, plus the matched values from the right
table, or NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER
tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
+----+----------+--------+---------------------+
The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no
matches in the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN still
returns a row in the result, but with NULL in each column from the left table.
A RIGHT JOIN returns all the values from the right table, plus the matched values from the left
table, or NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER
tables.
notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c
RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables
that fulfil the JOIN condition. The joined table contains either all the records from both the tables,
or fills in NULL values for missing matches on either side.
The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER
tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
Bucketing #
•Bucketing concept is based on (hashing function on the bucketed column) mod (by total number
of buckets). The hash_function depends on the type of the bucketing column.
•Records with the same bucketed column will always be stored in the same bucket.
•Bucketing can be done along with Partitioning on Hive tables and even without partitioning.
•Bucketed tables will create almost equally distributed data file parts, unless there is skew in data.
Advantages
•Bucketed tables offer efficient sampling than by non-bucketed tables. With sampling, we can try
out queries on a fraction of data for testing and debugging purpose when the original data sets are
very huge.
•As the data files are equal sized parts, map-side joins will be faster on bucketed tables than non-
bucketed tables.
•Bucketing concept also provides the flexibility to keep the records in each bucket to be sorted by
one or more columns. This makes map-side joins even more efficient, since the join of each bucket
becomes an efficient merge-sort.
Bucketing Vs Partitioning
•Partitioning helps in elimination of data, if used in WHERE clause, where as bucketing helps in
organizing data in each partition into multiple files, so that the same set of data is always written
in same bucket.
•Hive Bucket is nothing but another technique of decomposing data or decreasing the data into
more manageable parts or equal parts.
Sampling
•TABLESAMPLE() gives more disordered and random records from a table as compared to
LIMIT. •We can sample using the rand() function, which returns a random number.
•Here rand() refers to any random column. •The denominator in the bucket clause represents the
number of buckets into which data will be hashed. •The numerator is the bucket number selected.
•If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED
BY clause, TABLESAMPLE queries only scan the required hash partitions of the table.
Reduce-Side Join
Map-Side Join
•In case one of the dataset is small, map side join takes place. •In map side join, a local job runs to
create hash-table from content of HDFS file and sends it to every node.
•The data must be bucketed on the keys used in the ON clause and the number of buckets for one
table must be a multiple of the number of buckets for the other table. •When these conditions are
met, Hive can join individual buckets between tables in the map phase, because it does not have
to fetch the entire content of one table to match against each bucket in the other table. •set
hive.optimize.bucketmapjoin =true; •SET hive.auto.convert.join =true;
SMBM Join
•SMB joins are used wherever the tables are sorted and bucketed.
•The join boils down to just merging the already sorted tables, allowing this operation to be faster
than an ordinary map-join.
•A left semi-join returns records from the lefthand table if records are found in the righthand table
that satisfy the ON predicates.
•SELECT and WHERE clauses can’t reference columns from the righthand table.
•For a given record in the lefthand table, Hive can stop looking for matching records in the
righthand table as soon as any match is found.
•At that point, the selected columns from the lefthand table record can be projected
•In Hive it refers to how records are stored inside the file.
•As we are dealing with structured data, each record has to be its own structure.
•These file formats mainly vary between data encoding, compression rate, usage of space and disk
I/O.
•Hive does not verify whether the data that you are loading matches the schema for the table or
not. •However, it verifies if the file format matches the table definition or not.
SerDe in Hive #
•The SerDe interface allows you to instruct Hive as to how a record should be processed.
•The Deserializer interface takes a string or binary representation of a record, and translates it into
a Java object that Hive can manipulate.
•The Serializer, however, will take a Java object that Hive has been working with, and turn it into
something that Hive can write to HDFS or another supported system.
•Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers
are used when writing data, such as through an INSERT-SELECT statement.
CSVSerDe
JSONSerDe
RegexSerDe
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe‘
For Example
•Partitioning a table stores data in sub-directories categorized by table location, which allows Hive
to exclude unnecessary data from queries without reading all the data every time a new query is
made.
•Hive does support Dynamic Partitioning (DP) where column values are only known at
EXECUTION TIME. To enable Dynamic Partitioning :
•Another situation we want to protect against dynamic partition insert is that the user may
accidentally specify all partitions to be dynamic partitions without specifying one static partition,
while the original intention is to just overwrite the sub-partitions of one root partition.
To enable bucketing:
Optimizations in Hive #
•Use Denormalisation , Filtering and Projection as early as possible to reduce data before join.
•Join is a costly affair and requires extra map-reduce phase to accomplish query job. With De-
normalisation, the data is present in the same table so there is no need for any joins, hence the
selects are very fast.
TUNE CONFIGURATIONS
•Parallel execution
•Applies to MapReduce jobs that can run in parallel, for example jobs processing different source
tables before a join.
USE ORCFILE
•Hive supports ORCfile , a new table storage format that sports fantastic speed improvements
through techniques like predicate push-down, compression and more.
•Using ORCFile for every HIVE table is extremely beneficial to get fast response times for your
HIVE queries.
USE TEZ
•With Hadoop2 and Tez , the cost of job submission and scheduling is minimized.
•Also Tez does not restrict the job to be only Map followed by Reduce; this implies that all the
query execution can be done in a single job without having to cross job boundaries.
•Each record represents a click event, and we would like to find the latest URL for each sessionID
•In the above query, we build a sub-query to collect the timestamp of the latest event in each
session, and then use an inner join to filter out the rest.
•While the query is a reasonable solution —from a functional point of view— it turns out there’s
a better way to re-write this query as follows:
•Here, we use Hive’s OLAP functionality (OVER and RANK) to achieve the same thing, but
without a Join.
•Clearly, removing an unnecessary join will almost always result in better performance, and when
using big data this is more important than ever.
•Hive has a special syntax for producing multiple aggregations from a single pass through a source
of data, rather than rescanning it for each aggregation.
•This change can save considerable processing time for large input data sets.
•For example, each of the following two queries creates a table from the same source table, history:
Optimizations in Hive
•The following rewrite achieves the same thing, but using a single pass through the source history
table:
FROM history
The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored in
HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache Spark.
Pig can handle any type of data, i.e., structured, semi-structured or unstructured and stores the
corresponding results into Hadoop Data File System. Every task which can be achieved using PIG
can also be achieved using java used in MapReduce.
1) Ease of programming
Writing complex java programs for map reduce is quite tough for non-programmers. Pig makes
this process easy. In the Pig, the queries are converted to MapReduce internally.
2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their execution automatically, allowing
the user to focus on semantics rather than efficiency.
3) Extensibility
A user-defined function is written in which the user can write their logic to execute over the data
set.
4) Flexible
5) In-built operators
o Less code - The Pig consumes less line of code to perform any operation.
o Reusability - The Pig code is flexible enough to reuse again.
Pig Latin
The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a
textual language that abstracts the programming from the Java MapReduce idiom into a notation.
The Pig Latin statements are used to process the data. It is an operator that accepts a relation as an
input and generates another relation as an output.
Convention Description
() The parenthesis can enclose one or more items. It can also be used to indicate
the tuple data type.
Example - (10, xyz, (3,6,9))
[] The straight brackets can enclose one or more items. It can also be used to
indicate the map data type.
Example - [INNER | OUTER]
{} The curly brackets enclose two or more items. It can also be used to indicate
the bag data type
Example - { block | nested_block }
... The horizontal ellipsis points indicate that you can repeat a portion of the code.
Example - cat path [path ...]
Type Description
Apache Pig supports many data types. A list of Apache Pig Data Types with description and
examples are given below.
Command Function
foreach Applies expressions to each record and outputs one or more records
filter Applies predicate and removes records that do not return true
split Splits data into two or more sets based on filter conditions
Complex Types
Type Description
Operator Description
LOAD To Load the data from the file system (local/HDFS) into a relation.
Filtering
Sorting
Eval Functions
1 AVG()
2 BagToString()
To concatenate the elements of a bag into a string. While concatenating, we can place a
delimiter between these values (optional).
3 CONCAT()
4 COUNT()
To get the number of elements in a bag, while counting the number of tuples in a bag.
5 COUNT_STAR()
6 DIFF()
7 IsEmpty()
8 MAX()
To calculate the highest value for a column (numeric values or chararrays) in a single-column
bag.
9 MIN()
To get the minimum (lowest) value (numeric or chararray) for a certain column in a single-
column bag.
10 PluckTuple()
Using the Pig Latin PluckTuple() function, we can define a string Prefix and filter the
columns in a relation that begin with the given prefix.
11 SIZE()
12 SUBTRACT()
To subtract two bags. It takes two bags as inputs and returns a bag which contains the tuples
of the first bag that are not in the second bag.
13 SUM()
14 TOKENIZE()
To split a string (which contains a group of words) in a single tuple and return a bag which
contains the output of the split operation.
While writing UDF’s using Java, we can create and use the following three types of functions −
Filter Functions − The filter functions are used as conditions in filter statements. These
functions accept a Pig value as input and return a Boolean value.
Eval Functions − The Eval functions are used in FOREACH-GENERATE statements.
These functions accept a Pig value as input and return a Pig result.
Algebraic Functions − The Algebraic functions act on inner bags in a
FOREACHGENERATE statement. These functions are used to perform full MapReduce
operations on an inner bag.
To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we
discuss how to write a sample UDF using Eclipse. Before proceeding further, make sure you have
installed Eclipse and Maven in your system.
Follow the steps given below to write a UDF function −
Open Eclipse and create a new project (say myproject).
Convert the newly created project into a Maven project.
Copy the following content in the pom.xml. This file contains the Maven dependencies for
Apache Pig and Hadoop-core jar files.
<project xmlns = "http://maven.apache.org/POM/4.0.0"
xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation = "http://maven.apache.org/POM/4.0.0http://maven.apache
.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>Pig_Udf</groupId>
<artifactId>Pig_Udf</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<sourceDirectory>src</sourceDirectory>
<dependencies>
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<version>0.15.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.2</version>
</dependency>
</dependencies>
</project>
Save the file and refresh it. In the Maven Dependencies section, you can find the
downloaded jar files.
Create a new class file with name Sample_Eval and copy the following content in it.
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
On clicking export, you will get the following window. Click on JAR file.
After writing the UDF and generating the Jar file, follow the steps given below −
Step 1: Registering the Jar file
After writing UDF (in Java) we have to register the Jar file that contain the UDF using the Register
operator. By registering the Jar file, users can intimate the location of the UDF to Apache Pig.
Syntax
Given below is the syntax of the Register operator.
REGISTER path;
Example
As an example let us register the sample_udf.jar created earlier in this chapter.
Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.
$cd PIG_HOME/bin
$./pig –x local
REGISTER '/$PIG_HOME/sample_udf.jar'
Note − assume the Jar file in the path − /$PIG_HOME/sample_udf.jar
Step 2: Defining Alias
After registering the UDF we can define an alias to it using the Define operator.
Syntax
Given below is the syntax of the Define operator.
DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };
Example
Define the alias for sample_eval as shown below.
DEFINE sample_eval sample_eval();
Step 3: Using the UDF
After defining the alias you can use the UDF same as the built-in functions. Suppose there is a
file named emp_data in the HDFS /Pig_Data/ directory with the following content.
001,Robin,22,newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London
005,David,23,Bhuwaneshwar
006,Maggy,22,Chennai
(ROBIN)
(BOB)
(MAYA)
(SARA)
(DAVID)
(MAGGY)
(ROBERT)
(SYAM)
(MARY)
(SARAN)
(STACY)
(KELLY)
Pig Latin has an option called param, using this we can write dynamic scripts .
12
23
34
12
56
57
12
Dump specificNumber;
Usually we write above code in a file .let us assume we have written it in a file called numbers.pig
Pig –f /path/to/numbers.pig
Later if we want to see only numbers equals to 34, then we change second line to
Assume we even want to take path at the time of running script, now we write code like below
If you feel this code is missing readability, we can specify all these dynamic values in a file like
below
##Dyna.params (file name)
Path = /data/numbers
dynanumber = 34
Then you can run script with param-file option like below.
Dump operator
Describe operator
Explanation operator
Illustration operator
The above pig script, first splits each line into words using the TOKENIZE operator. The
tokenize function creates a bag of words. Using the FLATTEN function, the bag is
Pig at Yahoo
Pig was initially developed by Yahoo! for its data scientists who were using Hadoop. It
was incepted to focus mainly on analysis of large datasets rather than on writing mapper
and reduce functions. This allowed users to focus on what they want to do rather than
bothering with how its done. On top of this with Pig language you have the facility to write
commands in other languages like Java, Python etc. Big applications that can be built on
Pig Latin can be custom built for different companies to serve different tasks related to data
management. Pig systemizes all the branches of data and relates it in a manner that when
the time comes, filtering and searching data is checked efficiently and quickly.
Here are some basic difference between Hive and Pig which gives an idea of which to use
depending on the type of data and purpose.
The tabular column below gives a comprehensive comparison between the two. The Hive can be
used in places where partitions are necessary and when it is essential to define and create cross-
language services for numerous languages.
Machine learning is a branch in computer science that studies the design of algorithms that can
learn. Typical machine learning tasks are concept learning, function learning or “predictive
modeling”, clustering and finding predictive patterns. These tasks are learned through available
data that were observed through experiences or instructions, for example. Machine learning hopes
that including the experience into its tasks will eventually improve the learning. The ultimate goal
is to improve the learning in such a way that it becomes automatic, so that humans like ourselves
don’t need to interfere any more.
In supervised learning (SML), the learning algorithm is presented with labelled example
inputs, where the labels indicate the desired output. SML itself is composed of classification,
where the output is categorical, and regression, where the output is numerical.
In unsupervised learning (UML), no labels are provided, and the learning algorithm focuses
solely on detecting structure in unlabelled input data.
Note that there are also semi-supervised learning approaches that use labelled data to inform
unsupervised learning on the unlabelled data to identify and annotate new classes in the dataset
(also called novelty detection).
Reinforcement learning, the learning algorithm performs a task using feedback from operating
in a real or synthetic environment.
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning:
Here is the list of commonly used machine learning algorithms. These algorithms can be applied to
almost any data problem:
1. Linear Regression
2. Logistic Regression
3. Decision Tree
4. SVM
5. Naive Bayes
6. kNN
7. K-Means
8. Random Forest
9. Dimensionality Reduction Algorithms
10. Gradient Boosting algorithms
1. GBM
2. XGBoost
3. LightGBM
4. CatBoost
Regression analysis consists of a set of machine learning methods that allow us to predict a continuous outcome
variable (y) based on the value of one or multiple predictor variables (x).
Briefly, the goal of regression model is to build a mathematical equation that defines y as a function of the x variables.
Next, this equation can be used to predict the outcome (y) on the basis of new values of the predictor variables (x).
Linear regression is the most simple and popular technique for predicting a continuous variable. It assumes a linear
relationship between the outcome and the predictor variables.
The linear regression equation can be written as y = b0 + b*x + e, where:
b0 is the intercept,
b is the regression weight or coefficient associated with the predictor variable x.
e is the residual error
Technically, the linear regression coefficients are detetermined so that the error in predicting the outcome value is
minimized. This method of computing the beta coefficients is called the Ordinary Least Squares method.
When you have multiple predictor variables, say x1 and x2, the regression equation can be written as y = b0 + b1*x1
+ b2*x2 +e. In some situations, there might be an interaction effect between some predictors, that is for example,
increasing the value of a predictor variable x1 may increase the effectiveness of the predictor x2 in explaining the
variation in the outcome variable.
Note also that, linear regression models can incorporate both continuous and categorical predictor variables.
When you build the linear regression model, you need to diagnostic whether linear model is suitable for your data.
In some cases, the relationship between the outcome and the predictor variables is not linear. In these situations, you
need to build a non-linear regression, such as polynomial and spline regression.
When you have multiple predictors in the regression model, you might want to select the best combination of predictor
variables to build an optimal predictive model. This process called model selection, consists of comparing multiple
models containing different sets of predictors in order to select the best performing model that minimize the prediction
error. Linear model selection approaches include best subsets regression and stepwise regression
In some situations, such as in genomic fields, you might have a large multivariate data set containing some correlated
predictors. In this case, the information, in the original data set, can be summarized into few new variables (called
principal components) that are a linear combination of the original variables. This few principal components can be
used to build a linear model, which might be more performant for your data. This approach is know as principal
component-based methods, which include: principal component regression and partial least squares regression.
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior, etc., and
divides them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals with the
unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can use this
id to simplify the processing of large and complex datasets.
Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any shopping
mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are grouped in one
section, and trousers are at other sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped
in separate sections, so that we can easily find out the things. The clustering technique also works in the same way.
Other examples of clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this technique are:
o Market Segmentation
o Statistical data analysis
Apart from these general usages, it is used by the Amazon in its recommendation system to provide the
recommendations as per the past search of products. Netflix also uses this technique to recommend the movies and
web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits are divided into
several groups with similar properties.
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the centroid-based
method. The most common example of partitioning clustering is the K-Means Clustering algorithm.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected. This algorithm does it by identifying different
clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data space are divided
from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying densities and high
dimensions.
The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian Mixture Models
(GMM).
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above. There are different types of
clustering algorithms published, but only a few are commonly used. The clustering algorithm is based on the kind of
data that we are using. Such as, some algorithms need to guess the number of clusters in the given dataset, whereas
some are required to find the minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It classifies
the dataset by dividing the samples into different clusters of equal variances. The number of clusters must be
specified in this algorithm. It is fast with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of data
points. It is an example of a centroid-based model, that works on updating the candidates for centroid to be
the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise. It is an
example of a density-based model similar to the mean-shift, but with some remarkable advantages. In this
algorithm, the areas of high density are separated by the areas of low density. Because of this, the clusters
can be found in any arbitrary shape.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of
cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result appears based
on the closest object to the search query. It does it by grouping similar data objects in one group that is far
from the other dissimilar objects. The accurate result of a query depends on the quality of the clustering
algorithm used.
o Customer Segmentation: It is used in market research to segment the customers based on their choice and
preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals using the image
recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS database.
This can be very useful to find that for what purpose the particular land should be used, that means for which
purpose it is more suitable.
o Customer Segmentation: It is used in market research to segment the customers based on their choice and
preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals using the image
recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS database.
This can be very useful to find that for what purpose the particular land should be used, that means for which
purpose it is more suitable.
Association Rule Mining is used when you want to find an association between different objects
in a set, find frequent patterns in a transaction database, relational databases or any other
information repository. The applications of Association Rule Mining are found in Marketing,
Basket Data Analysis (or Market Basket Analysis) in retailing, clustering and classification. It can
tell you what items do customers frequently buy together by generating a set of rules
Catalogue design
Given is a set of transaction data. You can see transactions numbered 1 to 5. Each
transaction shows items bought in that transaction. You can see that Diaper is bought with Beer in
A=>B[Support,Confidence]A=>B[Support,Confidence]
The part before =>=> is referred to as if (Antecedent) and the part after =>=> is referred to as then
(Consequent).
Where A and B are sets of items in the transaction data. A and B are disjoint sets.
Computer=>Anti−virusSoftware[Support=20%,confidence=60%]Computer=>Anti−virusSoftwar
e[Support=20%,confidence=60%]
In the following section you will learn about the basic concepts of Association Rule Mining:
Basic Concepts of Association Rule Mining
59
59
Hafsa Jabeen
MUST READ
BUSINESS
+1
You are a data scientist (or becoming one!), and you get a client who runs a retail store. Your client
gives you data for all transactions that consists of items bought in the store by several customers
over a period of time and asks you to use that data to help boost their business. Your client will
use your findings to not only change/update/add items in inventory but also use them to change
the layout of the physical store or rather an online store. To find results that will help your client,
you will use Market Basket Analysis (MBA) which uses Association Rule Mining on the given
transaction data.
Catalogue design
Given is a set of transaction data. You can see transactions numbered 1 to 5. Each transaction
shows items bought in that transaction. You can see that Diaper is bought with Beer in three
transactions. Similarly, Bread is bought with milk in three transactions making them both frequent
item sets. Association rules are given in the form as below:
A=>B[Support,Confidence]A=>B[Support,Confidence]
The part before =>=> is referred to as if (Antecedent) and the part after =>=> is referred to as then
(Consequent).
Where A and B are sets of items in the transaction data. A and B are disjoint sets.
Computer=>Anti−virusSoftware[Support=20%,confidence=60%]Computer=>Anti−virusSoftwar
e[Support=20%,confidence=60%]
In the following section you will learn about the basic concepts of Association Rule Mining:
Basic Concepts of Association Rule Mining
Support(A=>B)=frequency(A,B)NSupport(A=>B)=frequency(A,B)N
1. Confidence (c): For a rule A=>B Confidence shows the percentage in which B is bought
with A.
Confidence(A=>B)=P(A∩B)P(A)=frequency(A,B)frequency(A)Confidence(A=>B)=P(A∩B)P(
A)=frequency(A,B)frequency(A)
The number of transactions with both A and B divided by the total number of transactions having
A.
Confidence(Bread=>Milk)=34=0.75=75%Confidence(Bread=>Milk)=34=0.75=75%
Note: Support and Confidence measure how interesting the rule is. It is set by the minimum
support and minimum confidence thresholds. These thresholds set by client help to compare the
rule strength according to your own or client's will. The closer to threshold the more the rule is of
use to the client.
1. Frequent Itemsets: Item-sets whose support is greater or equal than minimum support
threshold (min_sup). In above example min_sup=3. This is set on user choice.
2. Strong rules: If a rule A=>B[Support, Confidence] satisfies min_sup and min_confidence
then it is a strong rule.
3. Lift: Lift gives the correlation between A and B in the rule A=>B. Correlation shows how
one item-set A effects the item-set B.
Lift(A=>B)=SupportSupp(A)Supp(B)Lift(A=>B)=SupportSupp(A)Supp(B)
support(Bread)=45=0.8support(Bread)=45=0.8
support(Milk)=45=0.8support(Milk)=45=0.8
When you apply Association Rule Mining on a given set of transactions T your goal will be to find
all rules with:
APRIORI Algorithm
In this part of the tutorial, you will learn about the algorithm that will be running
behind R libraries for Market Basket Analysis. This will help you understand your
clients more and perform analysis with more attention. If you already know about
the APRIORI algorithm and how it works, you can get to the coding part.
Among the above steps, Frequent Item-set generation is the most costly in terms of computation.
1. Choosing a Variable
Marketing and Sales – Decision Trees play an important role in a decision-oriented sector
like marketing. In order to understand the consequences of marketing activities, organisations
The Decision Tree techniques can detect criteria for the division of individual items of a group
into predetermined classes that are denoted by n.
In the first step, the variable of the root node is taken. This variable should be selected based on
its ability to separate the classes efficiently. This operation starts with the division of variable into
the given classes. This results in the creation of subpopulations. This operation repeats until no
separation can be obtained.
A tree exhibiting not more than two child nodes is a binary tree. The origin node is referred to as
a node and the terminal nodes are the trees.
Installing R packages
To use the data file with the format specified earlier, we don't need to install extra R packages. We
just need to use the built-in functions available with R. Importing the data into R To perform