Skip to content

Conversation

@Craigacp
Copy link
Member

@Craigacp Craigacp commented Aug 2, 2022

Description

Adds support for feature selection algorithms via a FeatureSelector interface, a SelectedFeatureSet class to represent the output of a FeatureSelector and a SelectedFeatureDataset which applies a feature set to another dataset. It also includes implementations of 4 information theoretic feature selection algorithms (MIM, CMIM, JMI, mRMR) for classification tasks. The FeatureSelector interface is sufficiently general that we can implement it for regression, anomaly detection etc, though it does not cover operating on sequence data except via the SequenceDataset.getFlatDataset hook which flattens the sequence data into a series of independent examples. In the future we'll look at adding feature selection algorithms for regression and other output types, and before the next release we'll add a tutorial on the feature selection algorithms added in this PR.

The information theoretic algorithms bin the feature values using equal width bins to convert the real valued features into categoricals. We could add extensions to use the categorical values if the feature map believes them to be categorical, or to add other binning systems. This also densifies the dataset while it transposes it into a columnar view from Tribuo's default row view. We could use a sparse implementation of the columnar view which treats all features as binary, this has worked well in the past but the code is rather complicated so this initial version doesn't contain it.

Currently the feature selection algorithms are lightly tested to compare the parallel versions against the serial version. I've tested them offline against the implementations from FEAST which is the reference implementation I developed in grad school. I'm considering adding a generated dataset with known outputs from FEAST to compare against in the unit tests.

Motivation

Feature selection algorithms are useful for reducing the complexity of downstream prediction tasks, and for analysing the relevant features in a dataset.

Paper reference

An overview of the feature selection techniques implemented here can be found in:

Brown G, Pocock A, Zhao M-J, Lujan M.
Conditional Likelihood Maximization: a Unifying Framework for Information Theoretic Feature Selection, JMLR 2012 - PDF

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Aug 2, 2022
* @param <R> The evaluation type.
* @return The descriptive statistics for each metric.
*/
public static <T extends Output<T>, R extends Evaluation<T>> Map<MetricID<T>, DescriptiveStats> summarizeCrossValidation(List<Pair<R, Model<T>>> evaluations) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is separate from the rest of them but I added it to make it easier to aggregate across cross validation folds which is currently a little ugly to do and it didn't feel worth a full PR.

@Craigacp Craigacp added the squash-commits Squash the commits when merging this PR label Aug 8, 2022
…ed values out of a CategoricalInfo, and made BinningTransformer and it's constructor public for easy binning.
…e quality of the relevant features in the test, relax the type returned from the binning operation to allow specialisation for categorical features later.
…'s constructor args so they are the same as the rest.
Copy link
Member

@JackSullivan JackSullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good apart from a few minor documentation quibbles

@oracle-contributor-agreement oracle-contributor-agreement bot removed the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Oct 3, 2022
@oracle-contributor-agreement
Copy link

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Oct 3, 2022
Copy link
Member

@JackSullivan JackSullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@Craigacp Craigacp merged commit a4597ae into oracle:main Oct 4, 2022
@Craigacp Craigacp deleted the fs branch October 4, 2022 00:41
@Mohammed-Ryiad-Eiadeh
Copy link

Mohammed-Ryiad-Eiadeh commented Oct 11, 2022 via email

@Craigacp
Copy link
Member Author

The summarizeCrossValidation method will convert a list of evaluations from the CrossValidation class into a Map<MetricID<T>, DescriptiveStats> which gives you the mean, min, max & std deviation of each metric across the folds. You could then use that to drive an evolutionary procedure if you wanted. Alternatively, if you're asking how to implement the FeatureSelector interface for an evolutionary algorithm, then you can see how to use it in the feature selection tutorial added in this PR, but we don't have detailed docs on implementing the interface. Probably best to look at the existing implementations and open an issue if you get stuck.

@Mohammed-Ryiad-Eiadeh
Copy link

Sounds well.
Cross Validation is suitable for the evaluation function when working with wrapper FS, and implementing the FS interface is a very good idea, but how about using TRIBUO to read data, evaluate the generated subsets and using TABLESAW to drop the noisy columns based on the used evolutionary algorithm and group all of that in a new module in TRIBUO library where this module can contain multiple evolutionary algorithms like Crow search, Black Hole, ...etc. the implementation will be based on stream, parallel stream, Runnable, Callable, Executor Service.
and I am sorry to bother you, Adam.

@Craigacp
Copy link
Member Author

The feature selection machinery will drop the unselected features for you, that's what SelectedFeatureDataset does. So you should only need to implement the evolutionary algorithms as FeatureSelector implementations which accept a Trainer and a number of folds (and any other parameters for the evolutionary part).

@Mohammed-Ryiad-Eiadeh
Copy link

sounds very good, thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. squash-commits Squash the commits when merging this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants