Skip to content

Conversation

@Craigacp
Copy link
Member

@Craigacp Craigacp commented Dec 24, 2025

Description

Adds a deserialization cache which canonicalises ImmutableFeatureMap and ImmutableOutputInfo instances during deserialization. This modifies the main deserialization method so it has an extra argument, all implementations in Tribuo are updated, and it falls back to the old one if it's not found (to maintain compatibility with any libraries built on Tribuo).

It's plumbed through almost everywhere, but some of the interfaces which have a static deserialization helper don't participate in the cache. This won't matter too much as they don't actually contain objects which could be deduplicated, but it does have a slight overhead as empty caches are created and GC'd repeatedly. We'll fix this by plumbing it through absolutely everywhere at a later date.

DatasetDataCarrier and ModelDataCarrier are converted into records as those classes had to be modified to allow the deduplication. The actual canonicalisation call happens in FeatureMap.deserialize and OutputInfo.deserialize, and all calls to deserialize those types have been routed through there.

The cache may be extended in the future to allow for deduplicating other objects.

Motivation

Moving away from Java serialization to protobuf means the serialized object graph isn't deduped, so feature maps and output infos get duplicated when serializing. This is particularly bad for ensembles which independently serialize the feature domain inside each ensemble member. Adding deduplication reduces memory pressure when working with ensembles, and allows future optimizations similar to #417 once the SGDVectors contain a reference to the feature map that created them. Fixing the serialization format to remove the duplication would require revising the serialization format to explicitly allow for backlinks which is one of the things that makes Java serialization tricky to implement, so the model files on disk will continue to contain duplicate entries.

@Craigacp Craigacp added the Oracle employee This PR is from an Oracle employee label Dec 24, 2025
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement. Oracle employee This PR is from an Oracle employee

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant