BD - Unit - V - Mahout, Sqoop and Case Study
BD - Unit - V - Mahout, Sqoop and Case Study
Syllabus
1
Unit – V
1. Mahout: Apache Mahout is a highly scalable machine
learning library that enables developers to use optimized
algorithms.
4
1. Company Name: Apache mahout was introduced by
IBM. Apache Mahout started as a sub-project of Apache’s
Lucene in 2008. In 2010, Mahout became a top level
project of Apache.
9
6. Features / Benefits
Recommenders
Classification features
Clustering features
Distributed Matrices and Decomposition
Fast non-distributed linear mathematics
Hadoop library to scale effectively in the cloud.
Mahout offers the coder a ready-to-use framework
Several MapReduce enabled clustering implementations
Supports Distributed Naive Bayes
Distributed fitness function capabilities
Matrix and vector libraries.
10
7. ADVANTAGES
Scalable
Community
Fast-prototyping
Apache license
Well tested
1. www.mahout.apache.org/
2. http://hortonworks.com/hadoop/mahout/
3. http://www.tutorialspoint.com/mahout/
4. www.hortonworks.com/hadoop/mahout/
5. www.tutorialspoint.com/mahout/index.html
6. www.datametica.com/mahout-an-introduction/
7. http://www.ibm.com/developerworks/library/j-mahout/
8. https://mahout.apache.org/users/basics/quickstart.html
12. Case Studies / White Papers: Companies such as Adobe,
Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use
Mahout internally.
www.mahout.apache.org/
Amazon
Facebook
Google
IBM
Joost
New York Times
Yahoo!
13. Conclusions: Apache Mahout is a highly scalable machine
learning library that enables developers to use optimized
algorithms. Mahout implements popular machine learning
techniques such as recommendation, classification, and
clustering.
2. SQOOP: SQOOP is an open source which is the product of
Apache. SQOOP stands for SQL to HADOOP.
19
1. Company Name: SQOOP was initially developed and
maintained by Cloudera. It was incubated in Apache on
23 July 2011, since then Apache committee manages the
releases. SQOOP is an open source tool written at
Cloudera .
SQOOP allows us to
Import one table
Import complete database
Import selected tables
Import selected columns from a particular table
Filter out certain rows from certain table etc
SQOOP Import: The import tool imports individual tables from
RDBMS to HDFS. Each row in a table is treated as a record in
HDFS. All records are stored as text data in text files or as binary
data in Avro and Sequence files.
SQOOP Export: The export tool exports a set of files from HDFS
back to an RDBMS. The files given as input to SQOOP contain
records, which are called as rows in table. Those are read and
parsed into a set of records and delimited with user-specified
delimiter.
26
6. Features / Benefits
Ease of use
CLI and REST compliant
Seamless integration with Hadoop
It is designed for OLAP
It is familiar, fast, scalable, and extensible
Maintaining parallel operations
High fault tolerance
Extensions of RDBMS and NoSQL
Internally used MapReduce Framework.
It provides SQL type language
It stores schema in a database and processed data into HDFS.
27
7. ADVANTAGES
It is easily extensible.
29
9. PICTURES / VIDEOS
SQOOP is a command line tool used for importing and
exporting data between Hadoop and specified relational
databases.
www.sqoop.apache.org/
https://sqoop.apache.org/
10. SOFTWARES / TOOLS
As SQOOP is a sub-project of Hadoop, it can only work on Linux operating
system. Follow the steps given below to install SQOOP on your system.
http://www./sqoop/sqoop_installation.html
1. www.sqoop.apache.org/
2. https://sqoop.apache.org/
3. www.hortonworks.com/hadoop/sqoop/
4. www.tutorialspoint.com/sqoop/index.html
5. www.datametica.com/sqoop-an-introduction/
6. http://hadoopadmin.com/introduction-to-sqoop/
7. http://tutorial.techaltum.com/sqoop-introduction.html
12. Case Studies / White Papers: SQOOP is a collection of
related tools. To use SQOOP, you specify the tool you want to
use and the arguments that control the tool.
www.sqoop.apache.org/
Amazon
Facebook
Google
IBM
Joost
New York Times
Yahoo!