Day 2 Presentation
Day 2 Presentation
AI & MACHINE
LEARNING
Manoranjan Dash
Professor and Dean
School of Computing and Data Science
FLAME University, Pune
• You can perform tasks ranging from basic visuals to data manipulations,
transformations, and data mining. It consolidates all the functions of the
entire process into a single workflow.
• The best part and the differentiator about Orange is that it has some
wonderful visuals. You can try silhouettes, heat-maps, geo-maps and all sorts
of visualizations available.
2. Setting up your System
• Orange comes built-in with the Anaconda tool if you’ve previously
installed it. If not, follow these steps to download Orange.
This is what the start-up page of Orange looks like. You have options that allow you to create new projects,
open recent ones or view examples and get started.
• Before we delve into how Orange works, let’s define a few key terms
to help us in our understanding:
• A widget is the basic processing point of any data manipulation. It can do a
number of actions based on what you choose in your widget selector on the
left of the screen.
• A workflow is the sequence of steps or actions that you take in your platform
to accomplish a particular task.
• For now, click on “New” and let’s start building your first workflow.
3. Creating Your First Workflow
• This is the first step towards building a solution to any problem. We
need to first understand what steps we need to take in order to
achieve our final goal. After you clicked on “New” in the above step,
this is what you should have come up with.
This is your blank Workflow on Orange. Now, you’re ready to explore and solve any problem by dragging
any widget from the widget menu to your workflow.
4. Familiarising yourself with the basics
• Orange is a platform that can help us solve most problems in Data
Science today. Topics that range from the most basic visualizations to
training models. You can even evaluate and perform unsupervised
learning on datasets
• Problem
• The problem we’re looking to solve in this tutorial is the practice problem
Loan Prediction that can be accessed via this link
Loan Prediction (analyticsvidhya.com) on Datahack
Importing the data files
- We begin with the first and the necessary step to understand our data and make predictions: importing our data
- Step 1: Click on the “Data” tab on the widget selector menu and drag the widget “CSV File Import” to our blank workflow.
Directory for Orange Datasets
C:\Users\mdash\Desktop\CG\ACE_TEACHING\
AI_ML_Fundamentals_Dec2023\Orange3-3.36.1\Orange\Lib\site-
packages\Orange\datasets
Step 2: Double click the “File” widget and select the file you want to load into the workflow. Import Iris dataset.
Step 3: Click on ‘Data Table’ widget
Understanding our Data
Click on the semicircle in front of the “File” widget and drag it to an empty space in the workflow and
select the “Scatter Plot” widget.
Another way to visualize our distributions would be the “Distributions” widget. Click on the semi-circle again, and
drag to find the widget “Distributions”.
Missing Values and Imputation
- Despite its name, logistic regression is used for classification, not regression
It's called "regression" because it's an extension of linear regression,
but it's adapted for classification purposes through the logistic function.
Logistic Regression Algorithm:
1. Linear Combination:
• The algorithm starts with a linear combination of the input features. The linear
combination is represented as z = b0 + b1.x1 + … + bn.xn where b0, …, bn are
coefficients, x1, …, xn are input features
2. Logistic Function (Sigmoid):
• The linear combination is then passed through a logistic function, also known as
the sigmoid function. The sigmoid function maps any real-valued number to the
range between 0 and 1. The formula for the sigmoid function is 𝜎 ( 𝑧 )= 1 − 𝑧
1+𝑒
3. Probability Prediction:
• The output of the logistic function represents the probability that the given input
point belongs to the positive class (class 1). It can be interpreted as the
probability of success in a binary outcome.
4. Decision Threshold:
• A decision threshold is chosen (typically 0.5), and if the predicted probability is
greater than or equal to the threshold, the instance is classified as belonging to
the positive class; otherwise, it is classified as belonging to the negative class.
• Step 5: Double click the widget and select the type of
regularization you want to perform.
• Ridge Regression:
• Performs L2 regularization, i.e. adds penalty equivalent to square
of the magnitude of coefficients
• Minimization objective = LS Obj + α * (sum of square of
coefficients)
• Lasso Regression:
• Performs L1 regularization, i.e. adds penalty equivalent to
absolute value of the magnitude of coefficients
• Minimization objective = LS Obj + α * (sum of absolute value of
coefficients)
Step 6: Next, click on the “File” or the “Logistic Regression” widget and find the
“Test and Score” widget. Make sure you connect both the data and the model
to the testing widget.
Step 7: Click on “Test and Score” widget to see how well your model is doing.
Step 8: To visualize the results better, drag and drop from the “Test and
Score” widget to fin d “Confusion Matrix”.
Step 9: Once you’ve placed it, click on it to visualize your findings!
Random Forest
• It is an ensemble learning method used for both classification and
regression tasks.
• It operates by constructing a multitude of decision trees during
training and outputs the mode (for classification) or mean prediction
(for regression) of the individual trees as the final prediction.
• Random Forest introduces randomness both in the data used to train
each tree and in the features considered when splitting each node of
the trees.
Random Forest Algorithm:
1. Bootstrapped Sampling (Bagging):
• Random Forest starts by creating multiple bootstrap samples from the original
dataset. Each sample is obtained by randomly sampling with replacement from
the original dataset.
2. Random Feature Selection:
• At each node of each tree, a random subset of features is selected to determine
the best split. This introduces diversity among the trees and helps prevent
overfitting
3. Decision Tree Construction:
• For each bootstrap sample and at each node of the tree, the algorithm constructs
a decision tree using the selected features. The tree is grown until a stopping
criterion is met (e.g., a maximum depth is reached).
4. Voting (Classification) or Averaging (Regression):
• For classification problems, the final prediction is determined by a majority vote
among the individual trees. For regression problems, it's the average of the
predictions.
Hyperparameters:
Number of Trees (n_estimators): The number of decision
trees in the forest.
Hyperplane Definition: SVM aims to find the hyperplane that best separates the data into two classes. A hyperplane is a
decision boundary that maximizes the margin between the two classes. The margin is defined as the distance between the
hyperplane and the nearest data point from either class.
Optimization Objective: SVM formulates an optimization problem to maximize the margin while minimizing classification
errors. The optimal hyperplane is the one that satisfies this objective.
Soft Margin (C parameter): In some cases, it may not be possible to find a hyperplane that perfectly separates the classes.
SVM introduces a "soft margin" that allows for some misclassification. The parameter C controls the trade-off between
having a smooth decision boundary and classifying training points correctly.
Kernel Trick: SVM can handle non-linear decision boundaries by using the kernel trick. This involves mapping the input
features into a higher-dimensional space where a hyperplane can effectively separate the data. Common kernels include
linear, polynomial, and radial basis function (RBF) kernels.
Hyperparameters:
Kernel Type: The choice of kernel (linear, polynomial,
RBF, etc.) influences the decision boundary shape.
Distance Calculation: When a prediction is needed for a new data point, kNN calculates the
distance between that point and all other points in the training dataset. Common distance
metrics include Euclidean distance, Manhattan distance, or other distance measures.
Finding Neighbors: The algorithm identifies the k-nearest neighbors of the new data point
based on the calculated distances.
Hidden Layers:
Between the input and output layers, there can be one or more hidden layers. Each
neuron in a hidden layer takes input from the previous layer, applies a weighted sum
and an activation function, and produces an output for the next layer.
Output Layer:
The output layer produces the final prediction or classification. The number of neurons
in the output layer depends on the type of task (e.g., binary classification, multi-class
classification, regression).
Neural Network Training (Backpropagation):
Forward Propagation: During training, the input data is fed forward through the network,
and the predictions are computed.
Loss Function: A loss function is used to measure the difference between the predicted
output and the actual target. Common loss functions include mean squared error for
regression and cross-entropy for classification.
Optimization: Optimization algorithms (e.g., stochastic gradient descent) are used to find
the optimal weights that minimize the loss function.
Activation Functions:Neurons typically use activation functions (e.g., sigmoid, tanh, ReLU)
to introduce non-linearity into the model, enabling it to learn complex patterns.
Hyperparameters:
Number of Layers: The choice of the number of hidden
layers and neurons in each layer.
The Hierarchical Clustering widget will output a new dataset with a cluster label
attached to each data point. You can then use this cluster label to visualize your data or
to perform other analyses.
In hierarchical clustering, the height ratio and top N are two methods for selecting the
number of clusters to extract from a dendrogram.
a. Height Ratio
• It is a measure of the relative distance between clusters.
• It is calculated by dividing the distance between two merged clusters by the total
height of the dendrogram.
• The height of the dendrogram is the maximum distance between any two clusters.
• A high height ratio indicates that the two clusters are very different, while a low
height ratio indicates that the two clusters are very similar.
• To use the height ratio to determine the optimal number of clusters, you can cut
the dendrogram at a height that corresponds to a desired level of similarity
between clusters. For example, if you want to extract five clusters, you would cut
the dendrogram at a height that corresponds to a height ratio of 0.2. This would
ensure that the five clusters are relatively different from each other.
a. Top N
• It selects the N largest clusters from the dendrogram.
• It is useful if you want to extract a specific number of clusters, regardless of the
similarity between the clusters.
• To use the top N method to determine the optimal number of clusters, you can
specify the desired number of clusters (N) in the Hierarchical Clustering widget.
The widget will then extract the N largest clusters from the dendrogram and assign
them to data points.
Sieve Diagram
• It is a graphical visualization tool used in data mining to examine the
relationship between two categorical variables
• In the context of data clustering, the Sieve diagram can be employed to
assess the effectiveness of the clustering process by comparing the
observed frequencies of attribute combinations to the expected
frequencies under the assumption of independence
• Interpret the Sieve diagram
• It displays a grid of rectangles, where each rectangle represents a combination of
attribute values.
• The area of each rectangle corresponds to the expected frequency under the
assumption of independence, while the number of squares inside each rectangle
indicates the observed frequency.
• Deviations from the expected frequencies suggest potential dependencies
between the attributes.
Rectangles:
Each rectangle represents a combination of values for the two categorical variables. The
size of each rectangle is proportional to the expected frequency of that combination,
assuming the variables are independent
Squares:
The number of squares inside each rectangle represents the observed frequency of that
combination. If the observed frequency is significantly higher than the expected
frequency, it suggests a positive correlation between the variables. Conversely, if the
observed frequency is significantly lower than the expected frequency, it suggests a
negative correlation.
Coloring:
The squares are colored according to the deviation from the expected frequency. Red
squares indicate positive deviations, while blue squares indicate negative deviations.
Sieve Diagram
Box Plot
Visualization
• Data Table
• Scatter Plot
• Mosaic Display
• Sieve Diagram
• Rank
• Rad Viz
Mosaic Display: It is a graphical method for visualizing the association between two
categorical variables. It uses a grid of rectangles, with the area of each rectangle
proportional to the joint frequency of the corresponding categories.
Rank: In the context of data analysis and visualization, "rank" typically refers to the ordering
of items based on a particular criterion. For example, you might rank items by their
frequency, importance, or some other measure. Rank visualizations often involve displaying
items in order, highlighting their relative positions in a list or chart.
Rad Viz (Radial Visualization): It is a method of data representation where data points are
arranged in a circular or radial pattern. It's particularly useful for visualizing multivariate
data, as different variables can be represented along the radial axes.
Petal length and petal width gives better clusters
Mosaic Display + Scatter Plot
Selected
Rank
• The Rank widget in Orange Data Mining is used to score variables
according to their correlation with a discrete or numeric target
variable.
• It utilizes various internal scorers, such as information gain, chi-
square, and linear regression, to assess the relevance of each variable
to the target variable.
• Additionally, it can incorporate scores from external models like linear
regression, logistic regression, random forest, and SGD.
Rank
Radviz: Radial Visualization
• The Radviz widget in Orange Data Mining is a non-linear
multidimensional visualization technique that can display data
defined by three or more variables in a 2-dimensional projection.
• It utilizes a metaphor from physics, where data instances are
represented as points within a circle, and their positions are
determined by springs attached to attribute anchors located on the
circle's perimeter.
References
1. https://www.analyticsvidhya.com/blog/2017/09/building-machine-l
earning-model-fun-using-orange/