-
Notifications
You must be signed in to change notification settings - Fork 194
Fixes for HashSet iteration order problems #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…, org.tribuo.multilabel.evaluation, CSVSaver and WeightedEnsembleModel.
…, and refactoring WeightedEnsembleModel to tighten the creation check and improve the ONNX export.
…t iteration order, and thus produces reproducible samples. This is a behaviour change from 4.2.0 where the order was undefined.
…and ImmutableOutputInfo.
|
FYI @kaiyaok2 these are the other issues I found when running down the iteration order determinism. Thanks for bringing that to our attention, in particular the ONNX and |
| this.labelOrder = labelOrder; | ||
| @Override | ||
| public void setLabelOrder(List<Label> newLabelOrder) { | ||
| if (newLabelOrder == null || newLabelOrder.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know what size the labelset should be at this point? Can we easily check for that too while we're at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually allows you to reduce the set of labels that you print (and was designed that way many years ago), if you only care about showing a subset of them. However that's not properly documented, nor do I have a test for it, so I should add one.
| import java.nio.charset.StandardCharsets; | ||
| import java.nio.file.Files; | ||
| import java.nio.file.Path; | ||
| import java.util.ArrayList; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably this is an unused import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, there were a few others in there too, I've fixed it.
MultiLabel/Core/src/main/java/org/tribuo/multilabel/ImmutableMultiLabelInfo.java
Show resolved
Hide resolved
| */ | ||
| @Override | ||
| public void setLabelOrder(List<MultiLabel> labelOrder) { | ||
| if (labelOrder == null || labelOrder.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as before - can we easily check size here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer, it's intentional to allow subsetting, but not documented or tested so I'll do that.
Regression/Core/src/main/java/org/tribuo/regression/ImmutableRegressionInfo.java
Show resolved
Hide resolved
JackSullivan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few very minor points, but otherwise looks good
…remove labels that they haven't seen, avoiding a crash.
JackSullivan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
* Fixing iteration order issues in org.tribuo.classification.evaluation, org.tribuo.multilabel.evaluation, CSVSaver and WeightedEnsembleModel. * Fix flaky tests in MultiLabelConfusionMatrixTest. * Adding ImmutableOutputInfo.domainAndIDEquals, FeatureMap.domainEquals, and refactoring WeightedEnsembleModel to tighten the creation check and improve the ONNX export. * Fix uniform sampling method in CategoricalInfo so it uses a consistent iteration order, and thus produces reproducible samples. This is a behaviour change from 4.2.0 where the order was undefined. * Adding default implementations of the new methods in ConfusionMatrix and ImmutableOutputInfo. * Updating copyright years. * Adding more docs for setLabelOrder. * Adding ConfusionMatrix.observed to allow the evaluation tostrings to remove labels that they haven't seen, avoiding a crash.
Description
When looking at the fix for #220 we found several other issues where Tribuo's behaviour depended on the iteration order of
HashSet. These fell into a few categories:toStringdepended on the label indices - these were fixed by adding/exposing methods to fix the label order, then updating the tests to use those methods.WeightedEnsembleModelassumed that the output indices of the ensemble members were the same, which in practice only affects ONNX export of ensemble models - this is fixed by tightening up the validation during ensemble creation, and adding ONNX gather ops to ensure that the output indices are consistent across models.CategoricalInfo.uniformSampleandCategoricalInfo.frequencyBasedSampleused methods which iterate the key or entry set of aHashMapto construct the sampling tables. While the outputs were always a valid sample from the distribution, this sample depended on the iteration order of theHashMapand so was not necessarily portable between machines or runs of the JVM. This has been fixed by sorting the entries using double's natural ordering before building the tables. This produces different samples to Tribuo 4.2.0 and earlier, but it is now self-consistent and will remain so.Motivation
Non-determinism is bad for a reproducible ML library.