Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19
Naïve Bayes classification
To find conditional probability
When there are more cause and only one effect then this simple rule cannot be used. Features of fruits independent on each other – naïve
Uses bayes theorem – bayes
yes probability high so yes is the answer Self organizing feature maps: A Self-Organizing Feature Map (SOFM or SOM), also known as a Kohonen Map, is a type of artificial neural network that performs unsupervised learning to create a low-dimensional (usually 2D) map of input data. It organizes data points based on their similarity, grouping similar data together in clusters without needing labelled data. This makes SOMs particularly useful for data visualization and pattern recognition in high-dimensional datasets. How It Works: 1. Network Structure: o The SOM consists of a grid of neurons, typically arranged in 2D, where each neuron has a weight vector of the same dimension as the input data. These weight vectors act as coordinates for data points in the map. 2. Training Process: o When an input vector is presented to the SOM, the neuron with the closest matching weight vector is identified (called the best matching unit, or BMU). o The weights of the BMU and its neighboring neurons are updated to move closer to the input vector, making the map more representative of the input data distribution. o Over time, the map "self-organizes" such that similar input vectors activate neighboring neurons, creating clusters that reveal the data's underlying structure. 3. Neighborhood Function: o During each training step, not only the BMU but also neighboring neurons (in a defined radius) have their weights updated, allowing similar input data points to form clusters in the map. This radius decreases over time to refine the clusters. 4. Dimensionality Reduction: o SOMs reduce high-dimensional input data to a 2D or 3D space while preserving the topological relationships, making it easier to visualize clusters and patterns in the data. Advantages of Self-Organizing Maps: Visualization of High-Dimensional Data: SOMs provide an easy way to visualize data patterns, especially for complex datasets. Clustering and Data Compression: SOMs automatically cluster data, which can be helpful in identifying groups or categories within unlabeled data. Topological Preservation: Similar data points are mapped close to each other, helping maintain the data’s structure. Unsupervised Learning: SOMs do not require labeled data, so they’re ideal for exploratory data analysis. Disadvantages of Self-Organizing Maps: Limited Interpretability: The map might be difficult to interpret without additional analysis, especially in large or complex datasets. Choice of Parameters: The results are sensitive to parameters like the learning rate, neighborhood function, and grid size, which may require careful tuning. Computationally Intensive: Training SOMs, especially for high- dimensional data, can be computationally demanding. Not Ideal for Large Datasets: While SOMs can be applied to larger datasets, they may struggle with extremely large or complex data due to computational limitations. Applications of Self-Organizing Maps: Data Clustering and Classification: Used in exploratory data analysis to uncover natural clusters in data. Image and Speech Recognition: SOMs can be applied in pre-processing stages of pattern recognition systems. Customer Segmentation: Useful in marketing for identifying customer segments based on purchasing behavior. Genomics: Applied in bioinformatics to organize and visualize genetic data. Model Selection:
Training data set – small data set, errorless data set
Validation – comparing output with expected data set Testing data set – very large, testing the expected model with this data set Model selection is the process of choosing the most appropriate statistical or machine learning model from a set of candidate models based on their performance on a given dataset. The goal is to find a model that generalizes well to unseen data and balances complexity and accuracy. Steps in Model Selection: 1. Define the Problem and Goals: o Determine the type of problem (e.g., classification, regression) and what you aim to achieve (e.g., high accuracy, low computation cost, interpretability). 2. Prepare the Dataset: o Clean and preprocess the dataset, handle missing values, and split the data into training, validation, and test sets. 3. Choose Candidate Models: o Identify a set of models that could potentially perform well on the data. This selection could include simple models like linear regression, as well as more complex ones like random forests or neural networks. 4. Set Up a Validation Process: o Use techniques like cross-validation (e.g., k-fold cross- validation) to evaluate the performance of each candidate model. This helps estimate how each model will generalize to new data. 5. Evaluate Performance: o Compare models based on chosen evaluation metrics (e.g., accuracy, precision, recall, F1 score for classification, or mean squared error for regression). 6. Select the Model with the Best Performance: o Choose the model that performs the best on the validation data. Avoid simply selecting the model with the highest accuracy; consider other factors like interpretability, training time, and prediction speed. 7. Fine-Tune the Chosen Model: o Perform hyperparameter tuning (using grid search or random search) on the chosen model to optimize its performance further. 8. Test the Final Model: o Assess the selected model on the test set (data it hasn’t seen during training or validation) to get an unbiased estimate of its performance on new data. Key Factors in Model Selection: 1. Performance Metrics: Choose a model based on metrics relevant to the problem, like precision and recall in classification or RMSE in regression. 2. Complexity vs. Interpretability: Simple models (e.g., linear models) are more interpretable but may not capture complex patterns, while complex models (e.g., deep neural networks) might have higher accuracy but can be harder to interpret. 3. Generalization and Overfitting: Consider models that balance accuracy and complexity to avoid overfitting (performing well on training data but poorly on new data). 4. Computational Cost: Models vary in computational efficiency. Deep learning models may need more processing power than simpler models like decision trees. Common Model Selection Techniques: Cross-Validation: Splitting data multiple times to evaluate model performance across different data splits. Grid Search and Random Search: Systematically or randomly trying different hyperparameter combinations to improve model performance. Information Criteria (e.g., AIC, BIC): Used in statistics to penalize models based on complexity (e.g., in regression models). Ensemble Methods: Combining multiple models (e.g., bagging, boosting) can sometimes yield better results than individual models. Model Selection Strategies: Compare baseline models first: Start with simple models (like linear regression or logistic regression) and build up to more complex models only if needed. Use automated tools: Tools like AutoML can help streamline model selection by automatically comparing different algorithms and tuning their hyperparameters. Consider interpretability needs: If the model needs to be explainable (e.g., in healthcare or finance), simpler models or explainable models like decision trees or linear models may be preferable over black-box models like deep neural networks. In summary, model selection is about choosing a model that balances performance, interpretability, and efficiency, ensuring that it can generalize well to new data while meeting the requirements of the specific problem.
Bias-Variance Trade off :
Bias – diff between original value and predicted value Occurs in training dataset
0.8 training dataset (bias occurs)
0.2 testing dataset (variance occurs) As degree of polynomial increases, complexity of model also increases and it also leads to change in shape. 1] High complexity model has low bias and high variance when exposed to test dataset but target achieved. Over fitting – it learns the trained dataset (96%)well but cannot reproduce in test data set (85%) accuracy. 2] low complexity Under fitting – trained dataset(64%) very less and same in test dataset The bias-variance tradeoff is a fundamental concept in machine learning and statistics that describes the balance between two types of errors that affect a model's performance on new, unseen data. Understanding this tradeoff is essential for building models that generalize well. Bias: Definition: Bias refers to the error introduced by approximating a real- world problem (which may be complex) with a simplified model. Characteristics of High-Bias Models: o High-bias models make strong assumptions about the data and are often too simple to capture the underlying patterns in complex datasets. o These models are prone to underfitting – they fail to capture important relationships in the data, leading to poor performance on both training and test sets. o Example: Linear regression applied to highly nonlinear data. Variance: Definition: Variance refers to the model's sensitivity to fluctuations in the training data. Characteristics of High-Variance Models: o High-variance models are often too complex and capture noise or random fluctuations in the training data rather than the actual pattern. o These models are prone to overfitting – they perform well on the training data but poorly on new data because they have learned irrelevant details. o Example: A deep neural network that perfectly fits the training data but performs poorly on the test data. The Tradeoff: Goal: The goal is to find a model with a balance between bias and variance that minimizes the total error on new data. High Bias, Low Variance: Simple models (e.g., linear regression) usually have high bias and low variance, resulting in underfitting. Low Bias, High Variance: Complex models (e.g., deep neural networks) often have low bias but high variance, resulting in overfitting. Optimal Point: The best model achieves an optimal tradeoff between bias and variance, minimizing total error (sum of bias error and variance error). Illustrating Bias-Variance Tradeoff: Underfitting: High bias and low variance → poor performance on both training and test data. Overfitting: Low bias and high variance → good performance on training data but poor performance on test data. Generalization: Balancing bias and variance helps the model perform well on new data, achieving generalization. Managing the Bias-Variance Tradeoff: Regularization: Techniques like L1 (Lasso) or L2 (Ridge) regularization add penalties for complexity, reducing variance. Model Complexity: Start with a simple model and gradually increase complexity to avoid high variance. Ensemble Methods: Models like Random Forests and Boosting combine multiple weak learners, often reducing variance while controlling bias. Cross-Validation: Helps in tuning model parameters to find the optimal complexity that balances bias and variance. Summary: The bias-variance tradeoff is about finding the right balance between underfitting and overfitting, ensuring the model is not too simple to capture patterns (low bias) nor too complex to fit noise (low variance). Achieving this balance enables a model to generalize well to unseen data.
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB