-
Notifications
You must be signed in to change notification settings - Fork 194
Add classification feature selection support #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * @param <R> The evaluation type. | ||
| * @return The descriptive statistics for each metric. | ||
| */ | ||
| public static <T extends Output<T>, R extends Evaluation<T>> Map<MetricID<T>, DescriptiveStats> summarizeCrossValidation(List<Pair<R, Model<T>>> evaluations) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is separate from the rest of them but I added it to make it easier to aggregate across cross validation folds which is currently a little ugly to do and it didn't feel worth a full PR.
…ctedFeatureDataset.
…ed values out of a CategoricalInfo, and made BinningTransformer and it's constructor public for easy binning.
…e quality of the relevant features in the test, relax the type returned from the binning operation to allow specialisation for categorical features later.
…'s constructor args so they are the same as the rest.
Core/src/main/java/org/tribuo/dataset/SelectedFeatureDataset.java
Outdated
Show resolved
Hide resolved
Classification/FeatureSelection/src/main/java/org/tribuo/classification/fs/FSMatrix.java
Outdated
Show resolved
Hide resolved
Classification/FeatureSelection/src/main/java/org/tribuo/classification/fs/FSMatrix.java
Outdated
Show resolved
Hide resolved
Classification/FeatureSelection/src/main/java/org/tribuo/classification/fs/FSMatrix.java
Outdated
Show resolved
Hide resolved
Classification/FeatureSelection/src/main/java/org/tribuo/classification/fs/FSMatrix.java
Outdated
Show resolved
Hide resolved
Classification/FeatureSelection/src/main/java/org/tribuo/classification/fs/FSMatrix.java
Outdated
Show resolved
Hide resolved
JackSullivan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good apart from a few minor documentation quibbles
Co-authored-by: Jack Sullivan <[email protected]>
|
Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA). To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application. When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated. |
Co-authored-by: Jack Sullivan <[email protected]>
JackSullivan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
|
It would be awesome if you make a tutorial about that, and I am planning to
creat a wrapper model for feature selection using evolutionary algorithm.
If it's okay tell me please?
…On Wed, Aug 3, 2022, 12:29 AM Adam Pocock ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In Core/src/main/java/org/tribuo/evaluation/EvaluationAggregator.java
<#254 (comment)>:
> @@ -174,6 +174,27 @@ public static <T extends Output<T>, R extends Evaluation<T>> Map<MetricID<T>, De
return results;
}
+ /**
+ * Summarize all fields of a list of evaluations produced by ***@***.*** CrossValidation}.
+ * @param evaluations The evaluations to summarize.
+ * @param <T> The output type.
+ * @param <R> The evaluation type.
+ * @return The descriptive statistics for each metric.
+ */
+ public static <T extends Output<T>, R extends Evaluation<T>> Map<MetricID<T>, DescriptiveStats> summarizeCrossValidation(List<Pair<R, Model<T>>> evaluations) {
This change is separate from the rest of them but I added it to make it
easier to aggregate across cross validation folds which is currently a
little ugly to do and it didn't feel worth a full PR.
—
Reply to this email directly, view it on GitHub
<#254 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AWGLSQ3PKJ6OLAP7XRXONLDVXGHNDANCNFSM55MTS23Q>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
|
The |
|
Sounds well. |
|
The feature selection machinery will drop the unselected features for you, that's what |
|
sounds very good, thank you |
Description
Adds support for feature selection algorithms via a
FeatureSelectorinterface, aSelectedFeatureSetclass to represent the output of aFeatureSelectorand aSelectedFeatureDatasetwhich applies a feature set to another dataset. It also includes implementations of 4 information theoretic feature selection algorithms (MIM, CMIM, JMI, mRMR) for classification tasks. TheFeatureSelectorinterface is sufficiently general that we can implement it for regression, anomaly detection etc, though it does not cover operating on sequence data except via theSequenceDataset.getFlatDatasethook which flattens the sequence data into a series of independent examples. In the future we'll look at adding feature selection algorithms for regression and other output types, and before the next release we'll add a tutorial on the feature selection algorithms added in this PR.The information theoretic algorithms bin the feature values using equal width bins to convert the real valued features into categoricals. We could add extensions to use the categorical values if the feature map believes them to be categorical, or to add other binning systems. This also densifies the dataset while it transposes it into a columnar view from Tribuo's default row view. We could use a sparse implementation of the columnar view which treats all features as binary, this has worked well in the past but the code is rather complicated so this initial version doesn't contain it.
Currently the feature selection algorithms are lightly tested to compare the parallel versions against the serial version. I've tested them offline against the implementations from FEAST which is the reference implementation I developed in grad school. I'm considering adding a generated dataset with known outputs from FEAST to compare against in the unit tests.
Motivation
Feature selection algorithms are useful for reducing the complexity of downstream prediction tasks, and for analysing the relevant features in a dataset.
Paper reference
An overview of the feature selection techniques implemented here can be found in:
Brown G, Pocock A, Zhao M-J, Lujan M.
Conditional Likelihood Maximization: a Unifying Framework for Information Theoretic Feature Selection, JMLR 2012 - PDF