0% found this document useful (0 votes)
3 views

Study of Ensemble Classifers

This report compares ensemble and nonlinear classifiers for breast cancer diagnosis using the Breast Cancer Wisconsin dataset. It evaluates models such as Decision Trees, Random Forest, AdaBoost, Support Vector Machine, and k-Nearest Neighbors, highlighting that Random Forest and AdaBoost achieved the highest accuracy and robustness. The findings suggest that ensemble methods are preferable for complex datasets, while k-NN is suitable for simpler cases.

Uploaded by

Kärthïçk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Study of Ensemble Classifers

This report compares ensemble and nonlinear classifiers for breast cancer diagnosis using the Breast Cancer Wisconsin dataset. It evaluates models such as Decision Trees, Random Forest, AdaBoost, Support Vector Machine, and k-Nearest Neighbors, highlighting that Random Forest and AdaBoost achieved the highest accuracy and robustness. The findings suggest that ensemble methods are preferable for complex datasets, while k-NN is suitable for simpler cases.

Uploaded by

Kärthïçk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Machine Learning Classification

for Breast Cancer Diagnosis: A


Comparative Study of Ensemble and
Nonlinear Classifiers

Name :Karthick R Roll No: 242SP015

1. Introduction
In machine learning, classification tasks are fundamental to
predictive modeling, where the goal is to predict labels for input
samples based on their features. In binary classification, the task
is to categorize data into one of two classes, making it common in
fields like healthcare, finance, and more. Healthcare, for example,
often uses binary classification to identify cancerous and non-
cancerous cells.

Ensemble learning methods and nonlinear classifiers have gained


significant attention due to their ability to improve predictive
performance, particularly for complex, high-dimensional datasets.
These methods combine the results of multiple models to produce more
robust and accurate predictions. This report explores various
classifiers, including ensemble methods—Decision Trees, Random
Forest, and AdaBoost—alongside nonlinear models such as Support
Vector Machine (SVM) and k-Nearest Neighbors (k-NN). These models are
evaluated using the well-known Breast Cancer Wisconsin dataset.

2. Dataset Description
The dataset used for evaluation is the Breast Cancer Wisconsin
(Diagnostic) dataset, which has been widely used for benchmarking
binary classification models.

Number of Samples: 569

Features: 30 numeric attributes, including measurements like


radius, texture, perimeter, area, smoothness, compactness,
concavity, and symmetry.
Target Classes:

0: Benign (non-cancerous)

1: Malignant (cancerous)

The dataset is well-balanced with equal representation of


benign and malignant tumors, making it ideal for testing
classification models.

3. Ensemble Classifiers
Ensemble methods combine multiple individual models to produce a
final prediction, often improving accuracy and robustness.

3.1 Decision Tree

A decision tree splits the data into subsets based on the most
significant feature at each node. These splits are recursively
performed until the data in each subset is pure, i.e., it contains
instances of only one class.

Pros:

Easy to interpret and visualize.

Can model both linear and nonlinear relationships.

Cons:

Can overfit if not pruned or limited in depth.

Sensitive to noise in the data.

3.2 Random Forest

Random Forest is an ensemble of decision trees. Each tree is built on


a random subset of the data and features, and their predictions are
aggregated (via voting or averaging) to provide a final result.
Pros:

Reduces overfitting compared to individual decision trees.

Robust to noise and outliers.

Works well for both classification and regression.

Cons:

Less interpretable than a single decision tree.

Can be slower to train and predict due to the large


number of trees.

3.3 AdaBoost

AdaBoost (Adaptive Boosting) works by iteratively training weak


learners, typically decision trees, and focusing on the errors made
by previous learners. Each subsequent model is trained to correct the
mistakes made by its predecessors.

Pros:

Often achieves high accuracy, even with weak learners.

Can reduce bias and variance.

Cons:

Sensitive to outliers and noisy data.

May overfit if the model complexity is too high.

4. Nonlinear Classifiers
Nonlinear classifiers are especially useful when the relationship
between features and classes cannot be described by a straight line
or hyperplane.
4.1 Support Vector Machine (SVM) with RBF Kernel

SVM aims to find a hyperplane that separates the classes with the
maximum margin. The Radial Basis Function (RBF) kernel allows SVM to
operate in a higher-dimensional space, enabling it to model complex
decision boundaries.

Pros:

Effective in high-dimensional spaces.

Performs well in cases where the classes are not linearly


separable.

Cons:

Computationally expensive, especially with large datasets.

Requires careful tuning of hyperparameters (C and gamma).

4.2 k-Nearest Neighbors (k-NN)

k-NN classifies a sample by finding the majority class among the k-


nearest training samples in the feature space. It is a simple,
intuitive algorithm that makes predictions based on the proximity of
data points.

Pros:

No training phase (instance-based learning).

Simple and easy to understand.

Cons:

Computationally expensive during inference (as it


requires calculating distances to all training points).

Sensitive to the scale of features and the choice of k.

Can be slow with large datasets and noisy data.


5. Evaluation Metrics
To assess the performance of each classifier, we use several standard
evaluation metrics:

Where:

TP: True positives

TN: True negatives

FP: False positives

FN: False negatives

6. Code
7. Results and Conclusion
The performance of the models on the Breast Cancer Wisconsin dataset
was evaluated using the metrics mentioned above.

Best Models:

Random Forest and AdaBoost achieved the highest accuracy


and robustness. Random Forest's ensemble nature helped
reduce overfitting, while AdaBoost's focus on correcting
errors led to superior performance, especially with weak
learners.

SVM with RBF kernel also performed well, especially when


the data was not linearly separable. However, it required
more tuning and computational power.

Model Comparison:

Decision Tree performed well but was prone to overfitting


and had lower accuracy compared to ensemble methods.

k-NN showed good performance on small datasets but


struggled with larger, more complex data due to its
sensitivity to scaling and computational cost.
8.Conclusion:
Ensemble learning techniques like Random Forest and AdaBoost offer
significant robustness and accuracy, making them ideal for real-world
applications in domains such as healthcare, where precision is
critical. Nonlinear models like SVM are valuable when the data is
highly complex and non-linearly separable, while k-NN provides a
simple but effective solution for smaller, less complex datasets. The
choice of model depends on the dataset's characteristics, problem
requirements, and computational resources.

End of the Report

You might also like