Fixes for HashSet iteration order problems #225

Craigacp · 2022-03-22T21:55:42Z

Description

When looking at the fix for #220 we found several other issues where Tribuo's behaviour depended on the iteration order of HashSet. These fell into a few categories:

issues where a test failed due to sorting functions being stable, but the presentation order being slightly non-deterministic (MNB, XGBoost, LinearSGD) - these were left as fixing them required changes in dependencies or intrusive changes to how predictions are generated.
issues where the output of an evaluation toString depended on the label indices - these were fixed by adding/exposing methods to fix the label order, then updating the tests to use those methods.
WeightedEnsembleModel assumed that the output indices of the ensemble members were the same, which in practice only affects ONNX export of ensemble models - this is fixed by tightening up the validation during ensemble creation, and adding ONNX gather ops to ensure that the output indices are consistent across models.
CategoricalInfo.uniformSample and CategoricalInfo.frequencyBasedSample used methods which iterate the key or entry set of a HashMap to construct the sampling tables. While the outputs were always a valid sample from the distribution, this sample depended on the iteration order of the HashMap and so was not necessarily portable between machines or runs of the JVM. This has been fixed by sorting the entries using double's natural ordering before building the tables. This produces different samples to Tribuo 4.2.0 and earlier, but it is now self-consistent and will remain so.

Motivation

Non-determinism is bad for a reproducible ML library.

…, org.tribuo.multilabel.evaluation, CSVSaver and WeightedEnsembleModel.

…, and refactoring WeightedEnsembleModel to tighten the creation check and improve the ONNX export.

…t iteration order, and thus produces reproducible samples. This is a behaviour change from 4.2.0 where the order was undefined.

…and ImmutableOutputInfo.

Craigacp · 2022-03-22T21:57:06Z

FYI @kaiyaok2 these are the other issues I found when running down the iteration order determinism. Thanks for bringing that to our attention, in particular the ONNX and CategoricalInfo fixes are definite bugs in Tribuo's correctness and reproducibility.

JackSullivan · 2022-03-25T13:28:19Z

...sification/Core/src/main/java/org/tribuo/classification/evaluation/LabelConfusionMatrix.java

-        this.labelOrder = labelOrder;
+    @Override
+    public void setLabelOrder(List<Label> newLabelOrder) {
+        if (newLabelOrder == null || newLabelOrder.isEmpty()) {


Do we know what size the labelset should be at this point? Can we easily check for that too while we're at it?

This actually allows you to reduce the set of labels that you print (and was designed that way many years ago), if you only care about showing a subset of them. However that's not properly documented, nor do I have a test for it, so I should add one.

JackSullivan · 2022-03-25T13:38:39Z

Data/src/main/java/org/tribuo/data/csv/CSVSaver.java

 import java.nio.charset.StandardCharsets;
 import java.nio.file.Files;
 import java.nio.file.Path;
+import java.util.ArrayList;


Presumably this is an unused import

Yep, there were a few others in there too, I've fixed it.

MultiLabel/Core/src/main/java/org/tribuo/multilabel/ImmutableMultiLabelInfo.java

JackSullivan · 2022-03-25T13:41:28Z

MultiLabel/Core/src/main/java/org/tribuo/multilabel/evaluation/MultiLabelConfusionMatrix.java

+     */
+    @Override
+    public void setLabelOrder(List<MultiLabel> labelOrder) {
+        if (labelOrder == null || labelOrder.isEmpty()) {


Same question as before - can we easily check size here as well?

Same answer, it's intentional to allow subsetting, but not documented or tested so I'll do that.

Regression/Core/src/main/java/org/tribuo/regression/ImmutableRegressionInfo.java

JackSullivan

A few very minor points, but otherwise looks good

…remove labels that they haven't seen, avoiding a crash.

JackSullivan

Looks good to me

* Fixing iteration order issues in org.tribuo.classification.evaluation, org.tribuo.multilabel.evaluation, CSVSaver and WeightedEnsembleModel. * Fix flaky tests in MultiLabelConfusionMatrixTest. * Adding ImmutableOutputInfo.domainAndIDEquals, FeatureMap.domainEquals, and refactoring WeightedEnsembleModel to tighten the creation check and improve the ONNX export. * Fix uniform sampling method in CategoricalInfo so it uses a consistent iteration order, and thus produces reproducible samples. This is a behaviour change from 4.2.0 where the order was undefined. * Adding default implementations of the new methods in ConfusionMatrix and ImmutableOutputInfo. * Updating copyright years. * Adding more docs for setLabelOrder. * Adding ConfusionMatrix.observed to allow the evaluation tostrings to remove labels that they haven't seen, avoiding a crash.

Craigacp added 6 commits March 22, 2022 12:06

Fixing iteration order issues in org.tribuo.classification.evaluation…

a34dce1

…, org.tribuo.multilabel.evaluation, CSVSaver and WeightedEnsembleModel.

Fix flaky tests in MultiLabelConfusionMatrixTest.

9891ef0

Adding ImmutableOutputInfo.domainAndIDEquals, FeatureMap.domainEquals…

ba7252a

…, and refactoring WeightedEnsembleModel to tighten the creation check and improve the ONNX export.

Fix uniform sampling method in CategoricalInfo so it uses a consisten…

fb3d2f8

…t iteration order, and thus produces reproducible samples. This is a behaviour change from 4.2.0 where the order was undefined.

Adding default implementations of the new methods in ConfusionMatrix …

5567bff

…and ImmutableOutputInfo.

Updating copyright years.

2ccc0c8

Craigacp added Oracle employee This PR is from an Oracle employee squash-commits Squash the commits when merging this PR labels Mar 22, 2022

Craigacp mentioned this pull request Mar 24, 2022

shutdown thread pool when kmeans train done #224

Merged

JackSullivan reviewed Mar 25, 2022

View reviewed changes

MultiLabel/Core/src/main/java/org/tribuo/multilabel/ImmutableMultiLabelInfo.java Show resolved Hide resolved

JackSullivan reviewed Mar 25, 2022

View reviewed changes

Regression/Core/src/main/java/org/tribuo/regression/ImmutableRegressionInfo.java Show resolved Hide resolved

JackSullivan reviewed Mar 25, 2022

View reviewed changes

Craigacp added 2 commits March 25, 2022 11:18

Adding more docs for setLabelOrder.

487644a

Adding ConfusionMatrix.observed to allow the evaluation tostrings to …

e9c52e3

…remove labels that they haven't seen, avoiding a crash.

JackSullivan approved these changes Mar 25, 2022

View reviewed changes

Craigacp merged commit 14051d5 into oracle:main Mar 25, 2022

Craigacp deleted the iteration-order branch March 25, 2022 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for HashSet iteration order problems #225

Fixes for HashSet iteration order problems #225

Uh oh!

Craigacp commented Mar 22, 2022

Uh oh!

Craigacp commented Mar 22, 2022

Uh oh!

JackSullivan Mar 25, 2022

Uh oh!

Craigacp Mar 25, 2022

Uh oh!

JackSullivan Mar 25, 2022

Uh oh!

Craigacp Mar 25, 2022

Uh oh!

Uh oh!

JackSullivan Mar 25, 2022

Uh oh!

Craigacp Mar 25, 2022

Uh oh!

Uh oh!

JackSullivan left a comment

Uh oh!

JackSullivan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixes for HashSet iteration order problems #225

Fixes for HashSet iteration order problems #225

Uh oh!

Conversation

Craigacp commented Mar 22, 2022

Description

Motivation

Uh oh!

Craigacp commented Mar 22, 2022

Uh oh!

JackSullivan Mar 25, 2022

Choose a reason for hiding this comment

Uh oh!

Craigacp Mar 25, 2022

Choose a reason for hiding this comment

Uh oh!

JackSullivan Mar 25, 2022

Choose a reason for hiding this comment

Uh oh!

Craigacp Mar 25, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JackSullivan Mar 25, 2022

Choose a reason for hiding this comment

Uh oh!

Craigacp Mar 25, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JackSullivan left a comment

Choose a reason for hiding this comment

Uh oh!

JackSullivan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants