0% found this document useful (0 votes)
99 views33 pages

BD - Unit - V - Mahout, Sqoop and Case Study

The document provides information about Apache Mahout and Sqoop. It discusses: - Apache Mahout is a machine learning library for scalable machine learning algorithms. It implements techniques like recommendation, classification, and clustering. - Sqoop allows users to import and export data between Hadoop and relational databases like MySQL and Oracle. It is used to transfer structured data between SQL databases and Hadoop.

Uploaded by

Prem Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views33 pages

BD - Unit - V - Mahout, Sqoop and Case Study

The document provides information about Apache Mahout and Sqoop. It discusses: - Apache Mahout is a machine learning library for scalable machine learning algorithms. It implements techniques like recommendation, classification, and clustering. - Sqoop allows users to import and export data between Hadoop and relational databases like MySQL and Oracle. It is used to transfer structured data between SQL databases and Hadoop.

Uploaded by

Prem Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

BIG DATA

Syllabus

Unit-I : Introduction to Big Data


Unit-II : Hadoop Frameworks and HDFS
Unit-III :MapReduce
Unit-VI : Hive and Pig
Unit-V : Mahout, Sqoop and Case Study

1
Unit – V
1. Mahout: Apache Mahout is a highly scalable machine
learning library that enables developers to use optimized
algorithms.

 Apache Mahout is a project of the Apache Software


Foundation to produce free implementations of distributed or
scalable machine learning algorithms focused primarily in the
areas of collaborative filtering, clustering and classification.

 Mahout implements popular machine learning techniques such


as recommendation, classification, and clustering.

 Mahout provides the data science tools to automatically find


meaningful patterns in those big data sets.
2
 Mahout supports four main data science use cases:

i. Collaborative filtering – Mines user behavior and makes


product recommendations (e.g. Amazon recommendations)

ii. Clustering – Takes items in a particular class (such as web


pages or newspaper articles) and organizes them into naturally
occurring groups, such that items belonging to the same group
are similar to each other

iii. Classification – Learns from existing categorizations and then


assigns unclassified items to the best category

iv. Frequent itemset mining – Analyzes items in a group (e.g.


items in a shopping cart or terms in a query session) and then
identifies which items typically appear together.
 This Case Study consists of
1. Company Name
2. CEO
3. Introduction
4. Mahout Architecture
5. Services
6. Features
7. Advantages
8. Applications
9. Pictures/videos
10. Software's /Tools
11. References( URL’s)
12. Case Studies / Whitepapers
13. Conclusion

4
 1. Company Name: Apache mahout was introduced by
IBM. Apache Mahout started as a sub-project of Apache’s
Lucene in 2008. In 2010, Mahout became a top level
project of Apache.

 2. CEO: The Mahout project was started by several


people involved in the Apache Lucene community with
an active interest in machine learning and a desire for
robust, well-documented, scalable implementations of
common machine-learning algorithms for clustering and
categorization.
3. Introduction: Apache Mahout is an open source project
that is primarily used in producing scalable machine
learning algorithms.

 A mahout is one who drives an elephant as its master, the


name comes from its close association with Apache
Hadoop which uses an elephant as its logo.

 Mahout implements popular machine learning techniques


such as:
i. Recommendation
ii. Classification
iii. Clustering
4. Mahout Architecture: The architecture shows the
relationship between various Mahout components in a user-
based recommender.

 An item-based recommender system is similar except that


there are no Neighbourhood algorithms involved.

 Mahout is designed to be enterprise-ready; it's designed for


performance, scalability and flexibility.
 Top-level packages define the Mahout interfaces to these
key abstractions:
 Data Model
 User Similarity
 Item Similarity
 User Neighbourhood
 Recommender
Fig: Mahout Architecture
5. SERVICES
 Metastore is a relational database storing the metadata of
hive tables, partitions, Hive databases etc.
 File system
 Job Client
 Recommenders service
 Classification service
 Clustering service
 Frequent item set mining service
 Distributed Matrices and Decomposition

9
6. Features / Benefits
 Recommenders
 Classification features
 Clustering features
 Distributed Matrices and Decomposition
 Fast non-distributed linear mathematics
 Hadoop library to scale effectively in the cloud.
 Mahout offers the coder a ready-to-use framework
 Several MapReduce enabled clustering implementations
 Supports Distributed Naive Bayes
 Distributed fitness function capabilities
 Matrix and vector libraries.

10
7. ADVANTAGES
 Scalable

 Community

 Fast-prototyping

 Apache license

 Well tested

 Documentation support and Examples

 Built over existing production quality libraries


11
8. APPLICATIONS
 Vision processing
 Language processing
 Pattern recognition
 Games
 Data mining
 Expert systems
 Robotics
 Forecasting (e.g., stock market trends)
 Twitter uses Mahout for user interest modeling.
 Yahoo! uses Mahout for pattern mining.
 Foursquare helps you in finding out places, food, and
entertainment available in a particular area, it uses the
recommender engine of Mahout. 12
9. PICTURES / VIDEOS
 Apache Mahout is a highly scalable machine learning
library that enables developers to use optimized
algorithms.
 Mahout implements popular machine learning techniques
such as recommendation, classification, and clustering.
 www.mahout.apache.org/
 https://mahout.apache.org/
10. SOFTWARES / TOOLS
 Apache Mahout is an open source project that is primarily used in
producing scalable machine learning algorithms.
 http://www.tutorialspoint.com/mahout/mahout_classification.html
 Latest release version 0.11.0 has Mahout Samsara Environment.

Step 1: Check java installed


Step 2: check Hadoop installed
Step 3: Download Mahout or Maven Plug-in
Step 4: Generate Example Data
Step 5: Create Sequence Files
Step 6: Convert Sequence Files to Vectors
Step 7: Train the Vectors
Step 8: Test the Vectors
11. References

1. www.mahout.apache.org/
2. http://hortonworks.com/hadoop/mahout/
3. http://www.tutorialspoint.com/mahout/
4. www.hortonworks.com/hadoop/mahout/
5. www.tutorialspoint.com/mahout/index.html
6. www.datametica.com/mahout-an-introduction/
7. http://www.ibm.com/developerworks/library/j-mahout/
8. https://mahout.apache.org/users/basics/quickstart.html
12. Case Studies / White Papers: Companies such as Adobe,
Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use
Mahout internally.
 www.mahout.apache.org/
 Amazon
 Facebook
 Google
 IBM
 Joost
 New York Times
 Yahoo!
13. Conclusions: Apache Mahout is a highly scalable machine
learning library that enables developers to use optimized
algorithms. Mahout implements popular machine learning
techniques such as recommendation, classification, and
clustering.
2. SQOOP: SQOOP is an open source which is the product of
Apache. SQOOP stands for SQL to HADOOP.

 It is the tool which is the specially designed to transfer data


between Hadoop and RDBMS like SQL Server, MySQL, Oracle
etc.

 SQOOP is a tool designed to transfer data between Hadoop and


relational databases, you can use SQOOP to import data from a
relational database management system (RDBMS) such as
MySQL or Oracle into the Hadoop Distributed File System
(HDFS), transform the data in Hadoop MapReduce, and then
export the data back into an RDBMS.

 SQOOP is basically command based interface so we use import


command to transfer RDBMS data to Hadoop and Export
command to transfer data back in RDBMS.
17
 SQOOP import data from a relational database system
into HDFS. The input to the import process is a database
table.

 SQOOP will read the table row-by-row into HDFS.

 The output of this import process is a set of files


containing a copy of the imported table.

 The import process is performed in parallel. For this


reason, the output will be in multiple files.

 These files may be delimited text files (for example, with


commas or tabs separating each field), or binary Avro or
Sequence Files containing serialized record data.
 This Case Study consists of
1. Company Name
2. CEO
3. Introduction
4. SQOOP Architecture
5. Services
6. Features
7. Advantages
8. Applications
9. Pictures/videos
10. Software's / Tools
11. References( URL’s)
12. Case Studies / Whitepapers
13. Conclusion

19
 1. Company Name: SQOOP was initially developed and
maintained by Cloudera. It was incubated in Apache on
23 July 2011, since then Apache committee manages the
releases. SQOOP is an open source tool written at
Cloudera .

 2. CEO: The Apache project and the CEO is Steven


Farris.
– Developer(s) - Apache Software Foundation
– Stable release - 1.4.5 / March 17, 2015 (2015-03-17)
– Development status Active
– Written in Java operating system
– Website: www.sqoop.apache.org
Fig: Introduction of SQOOP 21
3. Introduction: SQOOP is a command line tool used for importing
and exporting data between Hadoop and specified relational
databases.

 SQOOP “SQL to Hadoop and Hadoop to SQL”.

 SQOOP allows users to import data from their relational databases


into HDFS and vice versa.

 SQOOP allows us to
 Import one table
 Import complete database
 Import selected tables
 Import selected columns from a particular table
 Filter out certain rows from certain table etc
 SQOOP Import: The import tool imports individual tables from
RDBMS to HDFS. Each row in a table is treated as a record in
HDFS. All records are stored as text data in text files or as binary
data in Avro and Sequence files.

 SQOOP Export: The export tool exports a set of files from HDFS
back to an RDBMS. The files given as input to SQOOP contain
records, which are called as rows in table. Those are read and
parsed into a set of records and delimited with user-specified
delimiter.

Fig: Introduction of SQOOP


 4. SQOOP Architecture: SQOOP job creates and saves the import
and export commands, it specifies parameters to identify and recall
the saved job, this re-calling or re-executing is used in the
incremental import, which can import the updated rows from
RDBMS table to HDFS.

 SQOOP list-databases tool parses and executes the ‘SHOW


DATABASES’ query against the database server, thereafter, it lists
out the present databases on the server.

 SQOOP is written in Java, which provides an API called Java


Database Connectivity (JDBC), this allows applications to access
data stored in a RDBMS and inspect the nature of the data, if JDBC
is native to a database platform, SQOOP can work directly with it.
Fig: SQOOP Architecture
5. SERVICES
 Meta store service
 File system service
 Job Client service
 Hive Web Interface
 The Hive Metastore Server
 Disabling Bypass Mode
 Using Hive Gateways
 Connectors are installed / Configured in one place
 Managed by administrator and run by operator
 JDBC drivers are needed in one place
 Database connectivity is needed on the server

26
6. Features / Benefits
 Ease of use
 CLI and REST compliant
 Seamless integration with Hadoop
 It is designed for OLAP
 It is familiar, fast, scalable, and extensible
 Maintaining parallel operations
 High fault tolerance
 Extensions of RDBMS and NoSQL
 Internally used MapReduce Framework.
 It provides SQL type language
 It stores schema in a database and processed data into HDFS.
27
7. ADVANTAGES
 It is easily extensible.

 Supports Hive and HBase imports.

 Provides metastore to save jobs.

 Supports incremental imports (RDBMS to HDFS).

 Internally uses JDBC for importing and exporting the data.

 Direct mode of SQOOP enables the use of bulk copy utilities.

 Supports various file formats like text, sequence file, Avro.


28
8. APPLICATIONS
 Log processing
 Document indexing
 Predictive modeling
 Hypothesis testing
 Customer facing BI
 Data Mining
 Call Center Apps
 Marketing Apps
 Create new Apps
 Website.com Apps
 Enterprise applications

29
9. PICTURES / VIDEOS
 SQOOP is a command line tool used for importing and
exporting data between Hadoop and specified relational
databases.

 www.sqoop.apache.org/

 https://sqoop.apache.org/
10. SOFTWARES / TOOLS
 As SQOOP is a sub-project of Hadoop, it can only work on Linux operating
system. Follow the steps given below to install SQOOP on your system.
 http://www./sqoop/sqoop_installation.html

 Step 1: Verifying JAVA Installation


 Step 2: Verifying Hadoop Installation
 Step 3: Hadoop Configuration
 Step 4: Downloading SQOOP
 Step 5: Installing SQOOP
 Step 6: Configuring BASHRC
 Step 7: Configuring SQOOP
 Step 8: Download and Configure MYSQL -connector-java
 Step 9: Verifying SQOOP

 SQOOP version is 1.4.3.


11. References

1. www.sqoop.apache.org/
2. https://sqoop.apache.org/
3. www.hortonworks.com/hadoop/sqoop/
4. www.tutorialspoint.com/sqoop/index.html
5. www.datametica.com/sqoop-an-introduction/
6. http://hadoopadmin.com/introduction-to-sqoop/
7. http://tutorial.techaltum.com/sqoop-introduction.html
12. Case Studies / White Papers: SQOOP is a collection of
related tools. To use SQOOP, you specify the tool you want to
use and the arguments that control the tool.
 www.sqoop.apache.org/
 Amazon
 Facebook
 Google
 IBM
 Joost
 New York Times
 Yahoo!

13. Conclusions: SQOOP provides a good general purpose tool


for transferring data between any JDBC database and
Hadoop. SQOOP extensions can provide optimizations for
specific targets.

You might also like