Add classification feature selection support #254

Craigacp · 2022-08-02T21:27:11Z

Description

Adds support for feature selection algorithms via a FeatureSelector interface, a SelectedFeatureSet class to represent the output of a FeatureSelector and a SelectedFeatureDataset which applies a feature set to another dataset. It also includes implementations of 4 information theoretic feature selection algorithms (MIM, CMIM, JMI, mRMR) for classification tasks. The FeatureSelector interface is sufficiently general that we can implement it for regression, anomaly detection etc, though it does not cover operating on sequence data except via the SequenceDataset.getFlatDataset hook which flattens the sequence data into a series of independent examples. In the future we'll look at adding feature selection algorithms for regression and other output types, and before the next release we'll add a tutorial on the feature selection algorithms added in this PR.

The information theoretic algorithms bin the feature values using equal width bins to convert the real valued features into categoricals. We could add extensions to use the categorical values if the feature map believes them to be categorical, or to add other binning systems. This also densifies the dataset while it transposes it into a columnar view from Tribuo's default row view. We could use a sparse implementation of the columnar view which treats all features as binary, this has worked well in the past but the code is rather complicated so this initial version doesn't contain it.

Currently the feature selection algorithms are lightly tested to compare the parallel versions against the serial version. I've tested them offline against the implementations from FEAST which is the reference implementation I developed in grad school. I'm considering adding a generated dataset with known outputs from FEAST to compare against in the unit tests.

Motivation

Feature selection algorithms are useful for reducing the complexity of downstream prediction tasks, and for analysing the relevant features in a dataset.

Paper reference

An overview of the feature selection techniques implemented here can be found in:

Brown G, Pocock A, Zhao M-J, Lujan M.
Conditional Likelihood Maximization: a Unifying Framework for Information Theoretic Feature Selection, JMLR 2012 - PDF

Craigacp · 2022-08-02T21:29:38Z

Core/src/main/java/org/tribuo/evaluation/EvaluationAggregator.java

+     * @param <R> The evaluation type.
+     * @return The descriptive statistics for each metric.
+     */
+    public static <T extends Output<T>, R extends Evaluation<T>> Map<MetricID<T>, DescriptiveStats> summarizeCrossValidation(List<Pair<R, Model<T>>> evaluations) {


This change is separate from the rest of them but I added it to make it easier to aggregate across cross validation folds which is currently a little ugly to do and it didn't feel worth a full PR.

…dataset.

…ctedFeatureDataset.

…ed values out of a CategoricalInfo, and made BinningTransformer and it's constructor public for easy binning.

…ts found.

…e quality of the relevant features in the test, relax the type returned from the binning operation to allow specialisation for categorical features later.

…'s constructor args so they are the same as the rest.

…d to.

…ataset.

Core/src/main/java/org/tribuo/SelectedFeatureSet.java

Core/src/main/java/org/tribuo/dataset/SelectedFeatureDataset.java

Classification/FeatureSelection/src/main/java/org/tribuo/classification/fs/FSMatrix.java

tutorials/feature-selection-tribuo-v4.ipynb

JackSullivan

Looks good apart from a few minor documentation quibbles

Co-authored-by: Jack Sullivan <[email protected]>

oracle-contributor-agreement · 2022-10-03T20:34:01Z

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

[email protected]

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

Co-authored-by: Jack Sullivan <[email protected]>

JackSullivan

Looks good to me

Mohammed-Ryiad-Eiadeh · 2022-10-11T06:55:30Z

It would be awesome if you make a tutorial about that, and I am planning to creat a wrapper model for feature selection using evolutionary algorithm. If it's okay tell me please?

…

On Wed, Aug 3, 2022, 12:29 AM Adam Pocock ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In Core/src/main/java/org/tribuo/evaluation/EvaluationAggregator.java <#254 (comment)>: > @@ -174,6 +174,27 @@ public static <T extends Output<T>, R extends Evaluation<T>> Map<MetricID<T>, De return results; } + /** + * Summarize all fields of a list of evaluations produced by ***@***.*** CrossValidation}. + * @param evaluations The evaluations to summarize. + * @param <T> The output type. + * @param <R> The evaluation type. + * @return The descriptive statistics for each metric. + */ + public static <T extends Output<T>, R extends Evaluation<T>> Map<MetricID<T>, DescriptiveStats> summarizeCrossValidation(List<Pair<R, Model<T>>> evaluations) { This change is separate from the rest of them but I added it to make it easier to aggregate across cross validation folds which is currently a little ugly to do and it didn't feel worth a full PR. — Reply to this email directly, view it on GitHub <#254 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWGLSQ3PKJ6OLAP7XRXONLDVXGHNDANCNFSM55MTS23Q> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Craigacp · 2022-10-11T14:24:08Z

The summarizeCrossValidation method will convert a list of evaluations from the CrossValidation class into a Map<MetricID<T>, DescriptiveStats> which gives you the mean, min, max & std deviation of each metric across the folds. You could then use that to drive an evolutionary procedure if you wanted. Alternatively, if you're asking how to implement the FeatureSelector interface for an evolutionary algorithm, then you can see how to use it in the feature selection tutorial added in this PR, but we don't have detailed docs on implementing the interface. Probably best to look at the existing implementations and open an issue if you get stuck.

Mohammed-Ryiad-Eiadeh · 2022-10-11T15:24:47Z

Sounds well.
Cross Validation is suitable for the evaluation function when working with wrapper FS, and implementing the FS interface is a very good idea, but how about using TRIBUO to read data, evaluate the generated subsets and using TABLESAW to drop the noisy columns based on the used evolutionary algorithm and group all of that in a new module in TRIBUO library where this module can contain multiple evolutionary algorithms like Crow search, Black Hole, ...etc. the implementation will be based on stream, parallel stream, Runnable, Callable, Executor Service.
and I am sorry to bother you, Adam.

Craigacp · 2022-10-11T15:34:28Z

The feature selection machinery will drop the unselected features for you, that's what SelectedFeatureDataset does. So you should only need to implement the evolutionary algorithms as FeatureSelector implementations which accept a Trainer and a number of folds (and any other parameters for the evolutionary part).

Mohammed-Ryiad-Eiadeh · 2022-10-11T15:41:29Z

sounds very good, thank you

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Aug 2, 2022

Craigacp commented Aug 2, 2022

View reviewed changes

Craigacp added the squash-commits Squash the commits when merging this PR label Aug 8, 2022

Craigacp force-pushed the fs branch from ac16ab3 to b283b4b Compare September 2, 2022 12:37

Craigacp added 18 commits October 1, 2022 17:59

Adding feature selection interfaces and provenance.

d3a29b6

Adding a selected feature dataset modelled after minimum cardinality …

0815194

…dataset.

Adding a provenance object for feature selectors, and a test for Sele…

73ccae1

…ctedFeatureDataset.

Roughing out MIM.

fd2319e

Adding documentation.

449f838

Filling out MIM, adding a test, writing the dataset interface.

5d63e0d

Adding feature selection algorithms.

c096378

Adding javadoc for FSMatrix.buildMatrix

86e34b0

Adding a dense implementation of FSMatrix, a method to get the observ…

06d2a28

…ed values out of a CategoricalInfo, and made BinningTransformer and it's constructor public for easy binning.

Adding tests for CMIM, JMI & mRMR, and then fixing the bugs those tes…

f0f6298

…ts found.

Make the logging print feature names not indices, slightly improve th…

fb6bbe4

…e quality of the relevant features in the test, relax the type returned from the binning operation to allow specialisation for categorical features later.

Minor refactoring to feature selection algorithms, and rearranged MIM…

7640dbe

…'s constructor args so they are the same as the rest.

Small fixes for mRMR and MIM.

d5eca2b

SelectedFeatureSet needs to be Serializable.

36207eb

Fix DenseFSMatrix so it actually computes the CMI not JMI like it use…

77d9757

…d to.

Adding a toString to SelectedFeatureSet.

82e4c02

Adding proto serialization to SelectedFeatureSet and SelectedFeatureD…

e04ddd3

…ataset.

Adding a feature selection tutorial.

1916ab5

Craigacp force-pushed the fs branch from 0c7ff21 to 1916ab5 Compare October 1, 2022 22:02

Craigacp added 2 commits October 1, 2022 21:34

Updates for the feature selection tutorial.

6f7a2cb

Tidying up imports.

cb1061a