0% found this document useful (0 votes)
29 views

How To Use Regression Machine Learning Algorithms in Weka

The document describes how to use regression algorithms on the Weka platform for machine learning. Explains five main regression algorithms supported by Weka, including linear regression, k-nearest neighbors, decision trees, support vector machines, and multilayer perceptrons. For each algorithm, it briefly describes how it works and how to configure its key parameters in Weka, and demonstrates its performance on a house price data set.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

How To Use Regression Machine Learning Algorithms in Weka

The document describes how to use regression algorithms on the Weka platform for machine learning. Explains five main regression algorithms supported by Weka, including linear regression, k-nearest neighbors, decision trees, support vector machines, and multilayer perceptrons. For each algorithm, it briefly describes how it works and how to configure its key parameters in Weka, and demonstrates its performance on a house price data set.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

How to use regression machine learning algorithms in Weka

by Jason Brownlee on July 22, 2016 in Weka Machine Learning

Weka has a large number of regression algorithms available on the platform.


The large number of machine learning algorithms supported by Weka is one of the biggest benefits of using
the platform.
In this post, you will discover how to use the best regression machine learning algorithms in Weka.
After reading this post you will know:
About 5 main regression algorithms supported by Weka.
How to use machine learning regression algorithms for predictive modeling in Weka.
About the key configuration options of regression algorithms in Weka.
Discover how to prepare data, fit models, and evaluate your predictions, all without writing a line of code in
my new book , with 18 step-by-step tutorials and 3 projects with Weka.
Let us begin.
How to Use Regression Machine Learning Algorithms in Weka Photo by solarisgirl , some rights reserved.
Regression Algorithms Overview
Let's take a tour of the 5 best regression algorithms in Weka.
Each algorithm we cover will be briefly described in terms of how it works, key algorithm parameters will be
highlighted, and the algorithm will be demonstrated in the Weka Explorer interface.
The 5 algorithms that we will review are:
Linear regression
k-Nearest neighbors
Decision tree
Support vector machines
Multilayer perceptron
These are 5 algorithms you can try on your regression problem as a starting point.
A standard machine learning regression problem will be used to demonstrate each algorithm.
Specifically, the Boston home price data set. Each instance describes properties in a Boston suburb and the
task is to predict housing prices in thousands of dollars. There are 13 numerical input variables with varying
scales that describe suburban properties. You can learn more about this dataset at the UCI Machine Learning
Repositoryhttps://archive.ics.uci.edu/ml/datasets/Housing .
Start Weka Explorer:
Open the Weka GUI Launcher.
Click the "Explorer" button to open the Weka Explorer.
Load the Boston housing price dataset from the housing.arff file.
Click "Classify" to open the Classify tab.
Let's start by looking at the linear regression algorithm.
Need more help with Weka for Machine Learning?
Take my free 14-day email course and discover how to use the platform step by step. Click to register and
also get a free PDF version of the book.
Linear regression
Linear regression only supports regression type problems.
It works by estimating coefficients for a line or hyperplane that best fits the training data. It is a very simple
regression algorithm, quick to train, and can have great performance if the output variable for your data is a
linear combination of its inputs.
It's a good idea to evaluate linear regression for your problem before moving on to more complex algorithms
in case it works well.
Choose linear regression algorithm:
Click the "Choose" button and select "LinearRegression" in the "functions" group.
Click the algorithm name to review the algorithm settings.

Linear Regression Weka Setup


Linear regression performance may be reduced if your training data has input attributes that are highly
correlated. Weka can automatically detect and delete highly correlated input attributes by setting
deleteColinearAttributes to True, which is the default value.
Additionally, attributes that are not related to the output variable can also negatively affect performance.
Weka can automatically perform feature selection to select only those relevant attributes by setting the
SelectionMethod attribute. This is enabled by default and can be disabled.
Finally, the Weka implementation uses a ridge regularization technique to reduce the complexity of the
learned model. It does this by minimizing the square of the absolute sum of the learned coefficients, which
will prevent any specific coefficient from being too large (a sign of complexity in regression models).
Click "OK" to close the algorithm settings.
Click the “Start” button to run the algorithm on the Boston house price data set.
You can see that with the default settings, the linear regression achieves an RMSE of 4.9.
Weka Results for Linear Regression
k-Nearest neighbors
The ak nearest neighbors algorithm supports classification and regression. It is also called kNN for short. It
works by storing the entire training data set and querying it to locate the k most similar training patterns
when making a prediction.
As such, there is no model other than the raw training data set and the only calculation performed is the
query of the training data set when a prediction is requested.
It's a simple algorithm, but one that doesn't assume much about the problem other than that the distance
between data instances is meaningful for making predictions. As such, it often achieves very good
performance. When making predictions on regression problems, KNN will take the average of the k most
similar instances in the training data set. Choose KNN algorithm:
Click the "Choose" button and select "IBk" in the "lazy" group.
Click the algorithm name to review the algorithm settings.
In Weka, KNN is called IBk, which means instance-based k.
Weka Nearest Neighbor Settings
The size of the neighborhood is controlled by the parameter k. For example, if set to 1, predictions are made
using the training instance most similar to a given new pattern for which a prediction is requested. Common
values for k are 3, 7, 11, and 21, larger for larger data set sizes. Weka can automatically discover a good
value for k using cross validation within the algorithm by setting the crossValidate parameter to True.
Another important parameter is the distance measurement used. This is configured in the nearest neighbor
search algorithm which controls how the training data is stored and searched. The default value is a
LinearNNSearch. Clicking on the name of this search algorithm will provide another configuration window
where you can choose a distanceFunction parameter. By default, Euclidean distance is used to calculate the
distance between instances, which is good for numerical data with the same scale. Manhattan distance is
good to use if your attributes differ in measurements or type.
It's a good idea to try a set of different k values and distance measures for your problem and see what works
best.
Click "OK" to close the algorithm settings.
Click the "Start" button to run the algorithm on the house price data set.
Boston.
You can see that with the default settings, the KNN algorithm achieves an RMSE of 4.6.
Weka regression results for k-nearest neighbors algorithm
Decision tree
Decision trees can support classification and regression problems.
Decision trees are more recently known as classification and regression trees or CARTs. They work by
creating a tree to evaluate an instance of data, starting at the root of the tree and moving up the city towards
the leaves (roots because the tree is drawn with an inverted perspective) until a prediction can be made. The
process of creating a decision tree works by greedily selecting the best split point to make predictions and
repeating the process until the tree has a fixed depth. After the tree is built, it is pruned to improve the
model's ability to generalize to new data.
Choose the decision tree algorithm:
Click the "Choose" button and select "REPTree" in the "trees" group.
Click the algorithm name to review the algorithm settings.
Weka configuration for decision tree algorithm
The depth of the tree is set automatically, but you can specify a depth in the maxDepth attribute.
You can also choose to disable pruning by setting the noPruning parameter to True, although this may result
in worse performance.
The minNum parameter defines the minimum number of instances supported by the tree at a leaf node when
building the tree from training data.
Click "OK" to close the algorithm settings.
Click the “Start” button to run the algorithm on the Boston house price data set.
You can see that with the default settings, the decision tree algorithm achieves an RMSE of
4.8.
Weka Regression Results for Decision Tree Algorithm
Support Vector Regression
Support vector machines were developed for binary classification problems, although extensions have been
made to the technique to support multi-class classification and regression problems. The adaptation of SVM
for regression is called Support Vector Regression or SVR for short.
SVM was developed for numeric input variables, although it will automatically convert nominal values to
numeric values. The input data is also normalized before being used.
Unlike SVM which finds a line that best separates training data into classes, SVR works by finding a line of
best fit that minimizes the error of a cost function. This is done through an optimization process that only
considers those data instances in the training data set that are closest to the line with the minimum cost.
These instances are called support vectors, hence the name of the technique.
In almost all problems of interest, a line cannot be drawn to best fit the data, so a margin is added around the
line to relax the constraint, allowing some bad predictions to be tolerated but allowing a better overall result.
Finally, few data sets can be fitted with just a straight line. Sometimes a line with curves or even polygonal
regions needs to be marked. This is achieved by projecting the data into a higher dimensional space to draw
the lines and make predictions. Different cores can be used to control the projection and amount of flex.
Choose SVR algorithm:
Click the "Choose" button and select "SMOreg" in the "function" group.
Click the algorithm name to review the algorithm settings.
Weka configuration for support vector regression algorithm
The C parameter, called the complexity parameter in Weka, controls how flexible the process can be to draw
the line that fits the data. A value of 0 does not allow margin violations, while the default value is 1.
A key parameter in SVM is the type of Kernel to use. The simplest kernel is a linear kernel that separates the
data with a straight line or hyperplane. The default in Weka is a polynomial kernel which will fit the data
using a curved or wavy line, the larger the polynomial the more wavy (the value of the exponent).
The polynomial kernel has a default exponent of 1, making it equivalent to a linear kernel. A popular and
powerful kernel is the RBF kernel or radial basis function kernel which is capable of learning closed
polygons and complex shapes to fit the training data.
It's a good idea to try a set of different kernels and C (complexity) values on your problem and see what
works best.
Click "OK" to close the algorithm settings.
Click the “Start” button to run the algorithm on the Boston house price data set.
You can see that with the default settings, the SVR algorithm achieves an RMSE of 5.1.
Weka Regression Results for Support Vector Regression Algorithm
Multilayer perceptron
Multilayer perceptron algorithms support regression and classification problems.
It is also called artificial neural networks or simply neural networks for short.
Neural networks are a complex algorithm to use for predictive modeling because there are so many
configuration parameters that they can only be adjusted effectively through intuition and a lot of trial and
error.
It is an algorithm inspired by a model of biological neural networks in the brain where small processing units
called neurons are organized into layers that, if configured well, are capable of approximating any function.
In classification we are interested in approximating the underlying function to better discriminate between
classes. In regression problems, we are interested in approximating a function that best fits the actual value
output.
Choose multilayer perceptron algorithm:
Click the "Choose" button and select "Multilayer Perceptron" in the "function" group.
Click the algorithm name to review the algorithm settings.
Weka setup for multilayer perceptron algorithm
You can manually specify the structure of the neural network that the model uses, but this is not recommended
for beginners.
The default will automatically design the network and train it on your data set. The default will create a single
hidden layer network. You can specify the number of hidden layers in the hiddenLayers parameter, set to "a"
auto by default.
You can also use a GUI to design the network structure. This can be fun, but it is recommended that you use the
GUI with a simple training and test split of your training data, otherwise you will be asked to design a network
for each of the 10 cross-validation folds.
Weka GUI Designer for Multilayer Perceptron Algorithm
You can configure the learning process by specifying how much to update the model each epoch by setting
the learning rate. Common values are small, such as values between 0.3 (the default value) and 0.1.
The learning process can be further tuned with a boost (set to 0.2 by default) to continue updating the
weights even when no changes need to be made, and a decay (set decay to True) which will reduce the
learning rate with the time to do more learning at the beginning of training and less at the end.
Click "OK" to close the algorithm settings.
Click the “Start” button to run the algorithm on the Boston house price data set.
You can see that with the default settings, the multi-layer Perceptron algorithm achieves an RMSE of 4.7.
Weka Regression Results Multilayer Perceptron Algorithm
Summary
In this post you discovered regression algorithms in Weka.
Specifically you learned:
About 5 top regression algorithms you can use for predictive modeling.
How to run regression algorithms in Weka.
About key configuration options for regression algorithms in Weka.
Do you have any questions about regression algorithms in Weka or this post? Ask your questions in the
comments and I'll do my best to answer.

You might also like