Implement deduplication of ImmutableFeatureMap and ImmutableOutputInfo on deserialization #426
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Adds a deserialization cache which canonicalises
ImmutableFeatureMapandImmutableOutputInfoinstances during deserialization. This modifies the main deserialization method so it has an extra argument, all implementations in Tribuo are updated, and it falls back to the old one if it's not found (to maintain compatibility with any libraries built on Tribuo).It's plumbed through almost everywhere, but some of the interfaces which have a static deserialization helper don't participate in the cache. This won't matter too much as they don't actually contain objects which could be deduplicated, but it does have a slight overhead as empty caches are created and GC'd repeatedly. We'll fix this by plumbing it through absolutely everywhere at a later date.
DatasetDataCarrierandModelDataCarrierare converted into records as those classes had to be modified to allow the deduplication. The actual canonicalisation call happens inFeatureMap.deserializeandOutputInfo.deserialize, and all calls to deserialize those types have been routed through there.The cache may be extended in the future to allow for deduplicating other objects.
Motivation
Moving away from Java serialization to protobuf means the serialized object graph isn't deduped, so feature maps and output infos get duplicated when serializing. This is particularly bad for ensembles which independently serialize the feature domain inside each ensemble member. Adding deduplication reduces memory pressure when working with ensembles, and allows future optimizations similar to #417 once the SGDVectors contain a reference to the feature map that created them. Fixing the serialization format to remove the duplication would require revising the serialization format to explicitly allow for backlinks which is one of the things that makes Java serialization tricky to implement, so the model files on disk will continue to contain duplicate entries.