We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22
Feature Selection
• Feature Selection is the process of selecting a subset of
relevant features from the dataset to be used in a machine-learning model. It is an important step in the feature engineering process as it can have a significant impact on the model’s performance. • After generating a large set of features, • use statistical and machine learning techniques to identify the most relevant ones. • This includes:1. Correlation Analysis :Checking how each feature correlate with the target variable (e.g.,user retention). • 2. Model-Based Selection: Using algorithms like random forest or LASSO regression • To identify important features. • 3. Cross-Validation :Ensuring th features selected improve model performance on unseen data. • Benefits of Feature Selection: • 1. Reduces Overfitting: • By using only the most relevant features, the model can generalize better to new data. • 2.Improves Model Performance: • Selecting the right features can improve the accuracy, precision, and recall of the model. • 3.Decreases Computational Costs: • A smaller number of features requires less computation and storage resources. • 4.Improves Interpretability: • By reducing the number of features, it is easier to understand and interpret the results of the model. • • There can be various reasons to perform feature selection • Simplification of the model. • •Less computational time. • •To avoid the curse of dimensionality. • •Improve the compatibility of data with models Roughly the feature selection techniques can be divided into three parts. • 1. Filters • 2. Wrappers • 3.Embedded methods. Filter Methods • Filters rank features based on a statistical measure of their relationship with the outcome variable. This is a good initial step but doesn’t account for interactions between features. • The filter method filters out the irrelevant feature and redundant columns from the model by using different metrics through ranking. • Advantages: • •Simple and fast to compute. does not over fit the data • •Provides a preliminary ranking of features. • Disadvantages: • •Ignores feature interactions and redundancy. Filter Methods -example
• Linear Regression Test: Foreach feature, run a linear regression with
only that feature as a predictor. • Rank features by p-value or R-squared. • Steps:1.Compute correlation: Measure thecorrelation between each feature and thetarget variable(e.g., user retention). • 2.Rank features: Order features by their p-values or R-squared values. • 3.Select top features: Choose asubset of top-ranked features for further analysis. • 1. Wrapper Method:- • In wrapper methodology, selection of features is done by considering it as a search problem, in which different combinations are made, evaluated, and compared with other combinations. It trains the algorithm by using the subset of features iteratively. On the basis of the output of the model, features are added or subtracted, and with this feature set, the model has trained again. • They consider feature interactions but can be computationally • expensive and prone to overfitting. • Types of Wrappers: • 1.Forward Selection: • 1.Start with no features. • 2.Add features one at a time, selecting the one that improves the model the most. • 3.Stop when adding more features does not improve the model • 2.Backward Elimination: • A. Start with allfeatures. • b.Remove features one at a time, selecting the one that improves the model the most when removed. • C.. Stop when removing more features degrades the model. • 3.Combined Approach: Use a hybrid of forward selection and backward elimination to balance feature inclusion and exclusion.
• Steps involved in Wrapper Methods
• Steps: • •Select an algorithm: • Choose forward selection, backward elimination,or a combined approach. • •Evaluate subsets:Use cross-validation to evaluate the performance of different feature subsets. • •Optimize selection:Use criteria such as R-squared, values, AIC,or BIC to select the best subset. Selection criteria of features • 1.R-squared: • 1. Measures the proportion of variance explained by the model. • 2.Higher R-squared indicates a better fit. • 2.P-values: • 1. Assess the significance of individual features. • 2. Lower p-values indicate higher significance. • 3.AIC (Akaike Information Criterion) • 1. Balances model fit and complexity. • 2. Lower AIC indicates a better model. • • 4. .BIC (Bayesian Information Criterion): • 1. Similar to AIC but with a stronger penalty for model complexity. • 2. Lower BIC indicates a better model. • 5.Entropy • 1.Calculate Entropy :Compute the entropy for the entire dataset. • 2. Compute Information Gain:For each feature,calculate the information gain resulting from splitting the data set on that feature. • 3. Select Features: Choose features with the higest information gain, as these contribute the most reducing uncertainty. • • he Akaike information criterion (AIC) is a mathematical method for evaluating how well a model fits the data it was generated from. In statistics, AIC is used to compare different possible models and determine which one is the best fit for the data. AIC is calculated from: 1.the number of independent variables used to build the model. 2.the maximum likelihood estimate of the model (how well the model reproduces the data). • The best-fit model according to AIC is the one that explains the greatest amount of variation using the fewest possible independent variables. • Model selection exampleIn a study of how hours spent studying and test format (multiple choice vs. written answers) affect test scores, you create two models: 1. Final test score in response to hours spent studying 2. Final test score in response to hours spent studying + test format 2 • You 2 find an r of 0.45 with a p value less than 0.05 for model 1, and an r of 0.46 with a p value less than 0.05 for model 2. Model 2 fits the data slightly better – but was it worth it to add another parameter just to get this small increase in model fit? • You run an AIC test to find out, which shows that model 1 has the lower AIC score because it requires less information to predict with almost the exact same level of precision. Another way to think of this is that the increased precision in model 2 could have happened by chance. • From the AIC test, you decide that model 1 is the best model for your study. • Bayesian information criterion (BIC) is a criterion for model selection among a finite set of models. It is based, in part, on the likelihood function, and it is closely related to Akaike information criterion (AIC). • When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. The BIC resolves this problem by introducing a penalty term for the number of parameters in the model. The penalty term is larger in BIC than in AIC. Random Forest Algorithm
• Random Forest is a popular machine learning algorithm that
belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model. • "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." • The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting. • The Working process can be explained in the below steps and diagram: • Step-1: Select random K data points from the training set. • Step-2: Build the decision trees associated with the selected data points (Subsets). • Step-3: Choose the number N for decision trees that you want to build. • Step-4: Repeat Step 1 & 2. • Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the category that wins the majority votes.