0% found this document useful (0 votes)
8 views

Module 5_Mahout

Uploaded by

sonia
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module 5_Mahout

Uploaded by

sonia
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

APACHE MAHOUT

What is Apache Mahout?


• A mahout is one who drives an elephant as its master. The name comes
from its close association with Apache Hadoop which uses an elephant
as its logo.

• Hadoop is an open-source framework from Apache that allows to store


and process big data in a distributed environment across clusters of
computers using simple programming models.

• Apache Mahout is an open source project that is primarily used for


creating scalable machine learning algorithms.
What is Apache Mahout?
• It implements popular machine learning techniques such as:

• Recommendation
• Classification
• Clustering

• Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In


2010, Mahout became a top level project of Apache.
Features of Mahout
• The algorithms of Mahout are written on top of Hadoop, so it works well in
distributed environment. Mahout uses the Apache Hadoop library to scale
effectively in the cloud.
• Mahout offers the coder a ready-to-use framework for doing data mining tasks
on large volumes of data.
• Mahout lets applications to analyze large sets of data effectively and in quick
time.
• Includes several MapReduce enabled clustering implementations such as k-
means, fuzzy k-means, Canopy, Dirichlet, and Mean-Shift.
• Supports Distributed Naive Bayes and Complementary Naive Bayes classification
implementations.
• Comes with distributed fitness function capabilities for evolutionary
programming.

• Includes matrix and vector libraries.


Applications of Mahout

• Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter,


and Yahoo use Mahout internally.

• Foursquare helps you in finding out places, food, and entertainment


available in a particular area. It uses the recommender engine of
Mahout.

• Twitter uses Mahout for user interest modelling.

• Yahoo! uses Mahout for pattern mining.


What is Machine Learning?
• Machine learning is a branch of science that deals with programming
the systems in such a way that they automatically learn and improve
with experience. Here, learning means recognizing and understanding
the input data and making wise decisions based on the supplied data.

• It is very difficult to cater to all the decisions based on all possible


inputs. To tackle this problem, algorithms are developed. These
algorithms build knowledge from specific data and past experience with
the principles of statistics, probability theory, logic, combinatorial
optimization, search, reinforcement learning, and control theory.
The developed algorithms form the basis of various applications such as:

• Vision processing
• Language processing
• Forecasting (e.g., stock market trends)
• Pattern recognition
• Games
• Data mining
• Expert systems
• Robotics
Supervised Learning
Supervised learning deals with learning a function from available training
data. A supervised learning algorithm analyzes the training data and
produces an inferred function, which can be used for mapping new
examples. Common examples of supervised learning include:
•classifying e-mails as spam,
•labeling webpages based on their content, and
•voice recognition.
There are many supervised learning algorithms such as neural networks,
Support Vector Machines (SVMs), and Naive Bayes classifiers. Mahout
implements Naive Bayes classifier.
Unsupervised Learning
Unsupervised learning makes sense of unlabeled data without having any predefined
dataset for its training. Unsupervised learning is an extremely powerful tool for
analyzing available data and look for patterns and trends. It is most commonly used for
clustering similar input into logical groups. Common approaches to unsupervised
learning include:
•k-means
•self-organizing maps, and
•hierarchical clustering
Recommendation
Recommendation is a popular technique that provides close recommendations based
on user information such as previous purchases, clicks, and ratings.
•Amazon uses this technique to display a list of recommended items that you might be
interested in, drawing information from your past actions. There are recommender
engines that work behind Amazon to capture user behavior and recommend selected
items based on your earlier actions.
•Facebook uses the recommender technique to identify and recommend the “people
you may know list”.
3Cs of mahout on the machine learning framework for
processing data.

• Collaborative Filtering
• Clustering
• Classification
Collaborative Filtering
•Collaborative filtering is a technique used for building recommendation systems. It
makes personalized recommendations to users based on the preferences and behavior
of similar users.
•In Mahout, collaborative filtering algorithms fall into two main categories: user-based
and item-based. User-based filtering recommends items based on the preferences of
similar users, while item-based filtering recommends items based on the preferences
of similar items.
•Apache Mahout provides the tools to implement and work with these collaborative
filtering techniques. These algorithms use user-item interaction data, often in the form
of a user-item matrix, to make recommendations.
•Collaborative filtering is widely used in applications like e-commerce for suggesting
products, content platforms for recommending articles or videos, and social networks
for suggesting connections or friends.
Clustering
• Clustering is a fundamental data analysis technique that involves grouping similar
data points together. In the context of Mahout, clustering is typically used with large
datasets to discover patterns, associations, and similarities.
• Mahout offers clustering algorithms such as K-Means, Canopy Clustering, and
Mean Shift, which allow you to cluster data points into groups or clusters. K-Means,
for example, partitions data into K clusters based on the similarity of data points.
• Clustering can be applied in various domains, including text analysis (grouping
similar documents), customer segmentation (grouping similar customers based on
behavior), and image analysis (grouping similar images).
Clustering
• Clustering is used to form groups or clusters of similar data based on common
characteristics. Clustering is a form of unsupervised learning.

• Search engines such as Google and Yahoo! use clustering techniques to group data
with similar characteristics.

• Newsgroups use clustering techniques to group various articles based on related


topics.

• The clustering engine goes through the input data completely and based on the
characteristics of the data, it will decide under which cluster it should be grouped.
Take a look at the following example.
Classification
• Classification is a supervised learning task where data is labeled with predefined
categories, and the goal is to predict the category of new, unlabeled data points.
• In Apache Mahout, you can find classification algorithms like Naive Bayes and
Random Forests. These algorithms can be used for tasks such as spam email
classification, sentiment analysis of text, or disease prediction.
• Classification involves training a model on labeled data to learn the relationships
between features and categories. Once trained, the model can classify new data
points into the appropriate categories.
Classification
• Classification, also known as categorization, is a machine learning technique that
uses known data to determine how the new data should be classified into a set of
existing categories. Classification is a form of supervised learning.

• Mail service providers such as Yahoo! and Gmail use this technique to decide
whether a new mail should be classified as a spam. The categorization algorithm
trains itself by analyzing user habits of marking certain mails as spams. Based on
that, the classifier decides whether a future mail should be deposited in your inbox
or in the spams folder.

• iTunes application uses classification to prepare playlists.


Summary
• These 3Cs, collaborative filtering, clustering, and classification, are fundamental
tasks in machine learning and data analysis.
• Apache Mahout provides implementations of various algorithms for these tasks,
making it a valuable tool for working with large datasets and building
recommendation systems, discovering patterns, and making category predictions in
a distributed and scalable manner.
THANK YOU

15

You might also like