-
Notifications
You must be signed in to change notification settings - Fork 92
Implement Custom Types -- AttributeType #1289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dimitri-yatsenko
wants to merge
55
commits into
pre/v2.0
Choose a base branch
from
claude/upgrade-adapted-type-1W3ap
base: pre/v2.0
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+6,193
−1,894
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit introduces a modern, extensible custom type system for DataJoint: **New Features:** - AttributeType base class with encode()/decode() methods - Global type registry with @register_type decorator - Entry point discovery for third-party type packages (datajoint.types) - Type chaining: dtype can reference another custom type - Automatic validation via validate() method before encoding - resolve_dtype() for resolving chained types **API Changes:** - New: dj.AttributeType, dj.register_type, dj.list_types - AttributeAdapter is now deprecated (backward-compatible wrapper) - Feature flag DJ_SUPPORT_ADAPTED_TYPES is no longer required **Entry Point Specification:** Third-party packages can declare types in pyproject.toml: [project.entry-points."datajoint.types"] zarr_array = "dj_zarr:ZarrArrayType" **Migration Path:** Old AttributeAdapter subclasses continue to work but emit DeprecationWarning. Migrate to AttributeType with encode/decode.
- Rewrite customtype.md with comprehensive documentation: - Overview of encode/decode pattern - Required components (type_name, dtype, encode, decode) - Type registration with @dj.register_type decorator - Validation with validate() method - Storage types (dtype options) - Type chaining for composable types - Key parameter for context-aware encoding - Entry point packages for distribution - Complete neuroscience example - Migration guide from AttributeAdapter - Best practices - Update attributes.md to reference custom types
Introduces `<djblob>` as an explicit AttributeType for DataJoint's native blob serialization, allowing users to be explicit about serialization behavior in table definitions. Key changes: - Add DJBlobType class with `serializes=True` flag to indicate it handles its own serialization (avoiding double pack/unpack) - Update table.py and fetch.py to respect the `serializes` flag, skipping blob.pack/unpack when adapter handles serialization - Add `dj.migrate` module with utilities for migrating existing schemas to use explicit `<djblob>` type declarations - Add tests for DJBlobType functionality - Document `<djblob>` type and migration procedure The migration is metadata-only - blob data format is unchanged. Existing `longblob` columns continue to work with implicit serialization for backward compatibility.
Simplified design: - Plain longblob columns store/return raw bytes (no serialization) - <djblob> type handles serialization via encode/decode - Legacy AttributeAdapter handles blob pack/unpack internally for backward compatibility This eliminates the need for the serializes flag by making blob serialization the responsibility of the adapter/type, not the framework. Migration to <djblob> is now required for existing schemas that rely on implicit serialization.
…adapted-type-1W3ap
…p' into claude/upgrade-adapted-type-1W3ap
…t' into claude/upgrade-adapted-type-1W3ap
…t' into claude/upgrade-adapted-type-1W3ap
Base automatically changed from
claude/add-file-column-type-LtXQt
to
pre/v2.0
December 24, 2025 20:09
Design document for reimplementing blob, attach, filepath, and object types as a coherent AttributeType system. Separates storage location (@store) from encoding behavior.
Layer 1: Native database types (FLOAT, TINYINT, etc.) - backend-specific, discouraged Layer 2: Core DataJoint types (float32, uint8, bool, json) - standardized, scientist-friendly Layer 3: AttributeTypes (object, content, <djblob>, etc.) - encode/decode, composable Core types provide: - Consistent interface across MySQL and PostgreSQL - Scientist-friendly names (float32 vs FLOAT, uint8 vs TINYINT UNSIGNED) - Automatic backend translation Co-authored-by: dimitri-yatsenko <[email protected]>
All AttributeTypes (Layer 3) now use angle bracket syntax in table definitions: - Core types (Layer 2): int32, float64, varchar(255) - no brackets - AttributeTypes (Layer 3): <object>, <djblob>, <filepath@main> - angle brackets This clear visual distinction helps users immediately identify: - Core types: direct database mapping - AttributeTypes: encode/decode transformation Co-authored-by: dimitri-yatsenko <[email protected]>
Seven-phase implementation plan covering: - Phase 1: Core type system foundation (type mappings, store parameters) - Phase 2: Content-addressed storage (<content> type, ContentRegistry) - Phase 3: User-defined AttributeTypes (<xblob>, <attach>, <xattach>, <filepath>) - Phase 4: Insert and fetch integration (type composition) - Phase 5: Garbage collection (project-wide GC scanner) - Phase 6: Migration utilities (legacy external stores) - Phase 7: Documentation and testing Estimated effort: 24-32 days across all phases Co-authored-by: dimitri-yatsenko <[email protected]>
Phase 1.1 - Core type mappings already complete in declare.py Phase 1.2 - Enhanced AttributeType with store parameter support: - Added parse_type_spec() to parse "<type@store>" into (type_name, store_name) - Updated get_type() to handle parameterized types - Updated is_type_registered() to ignore store parameters - Updated resolve_dtype() to propagate store through type chains - Returns (final_dtype, type_chain, store_name) tuple - Store from outer type overrides inner type's store Phase 1.3 - Updated heading and declaration parsing: - Updated get_adapter() to return (adapter, store_name) tuple - Updated substitute_special_type() to capture store from ADAPTED types - Store parameter is now properly passed through type resolution Co-authored-by: dimitri-yatsenko <[email protected]>
- Remove AttributeAdapter class and context-based lookup from attribute_adapter.py - Simplify attribute_adapter.py to compatibility shim that re-exports from attribute_type - Remove AttributeAdapter from package exports in __init__.py - Update tests/schema_adapted.py to use @dj.register_type decorator - Update tests/test_adapted_attributes.py to work with globally registered types - Remove test_attribute_adapter_deprecated test from test_attribute_type.py Types are now registered globally via @dj.register_type decorator, eliminating the need for context-based adapter lookup. Co-authored-by: dimitri-yatsenko <[email protected]>
…ntics Core types (uuid, json, blob) now map directly to native database types without any implicit serialization. Serialization is handled by AttributeTypes like <djblob> via encode()/decode() methods. Changes: - Rename SERIALIZED_TYPES to BINARY_TYPES in declare.py (clearer naming) - Update check for default values in compile_attribute() - Clarify in spec that core blob types store raw bytes Co-authored-by: dimitri-yatsenko <[email protected]>
Major simplification of the type system to two categories: 1. Core DataJoint types (no brackets): float32, uuid, bool, json, blob, etc. 2. AttributeTypes (angle brackets): <djblob>, <object>, <attach>, etc. Changes: - declare.py: Remove EXTERNAL_TYPES, BINARY_TYPES; simplify to CORE_TYPE_ALIASES + ADAPTED - heading.py: Remove is_attachment, is_filepath, is_object, is_external flags - fetch.py: Simplify _get() to only handle uuid, json, blob, and adapters - table.py: Simplify __make_placeholder() to only handle uuid, json, blob, numeric - preview.py: Remove special object field handling (will be AttributeType) - staged_insert.py: Update object type check to use adapter All special handling (attach, filepath, object, external storage) will be implemented as built-in AttributeTypes in subsequent phases. Co-authored-by: dimitri-yatsenko <[email protected]>
Core DataJoint types (fully supported, recorded in :type: comments): - Numeric: float32, float64, int64, uint64, int32, uint32, int16, uint16, int8, uint8 - Boolean: bool - UUID: uuid → binary(16) - JSON: json - Binary: blob → longblob - Temporal: date, datetime - String: char(n), varchar(n) - Enumeration: enum(...) Changes: - declare.py: Define CORE_TYPES with (pattern, sql_mapping) pairs - declare.py: Add warning for non-standard native type usage - heading.py: Update to use CORE_TYPE_NAMES - storage-types-spec.md: Update documentation to reflect core types Native database types (text, mediumint, etc.) pass through with a warning about non-standard usage. Co-authored-by: dimitri-yatsenko <[email protected]>
Add content-addressed storage with deduplication for the <content> and <xblob> AttributeTypes. New files: - content_registry.py: Content storage utilities - compute_content_hash(): SHA256 hashing - build_content_path(): Hierarchical path generation (_content/xx/yy/hash) - put_content(): Store with deduplication - get_content(): Retrieve with hash verification - content_exists(), delete_content(), get_content_size() New built-in AttributeTypes in attribute_type.py: - ContentType (<content>): Content-addressed storage for raw bytes - dtype = "json" (stores metadata: hash, store, size) - Automatic deduplication via SHA256 hashing - XBlobType (<xblob>): Serialized blobs with external storage - dtype = "<content>" (composition with ContentType) - Combines djblob serialization with content-addressed storage Updated insert/fetch for type chain support: - table.py: Apply encoder chain from outermost to innermost - fetch.py: Apply decoder chain from innermost to outermost - Both pass store_name through the chain for external storage Example usage: data : <content@mystore> # Raw bytes, deduplicated array : <xblob@mystore> # Serialized objects, deduplicated Co-authored-by: dimitri-yatsenko <[email protected]>
Co-authored-by: dimitri-yatsenko <[email protected]>
…lization Breaking changes: - Remove attribute_adapter.py entirely (hard deprecate) - Remove bypass_serialization flag from blob.py - blobs always serialize now - Remove unused 'database' field from Attribute in heading.py Import get_adapter from attribute_type instead of attribute_adapter. Co-authored-by: dimitri-yatsenko <[email protected]>
- Document function-based content storage (not registry class) - Add implementation status table - Explain design decision: functions vs database table - Update Phase 5 GC design for scanning approach - Document removed/deprecated items Co-authored-by: dimitri-yatsenko <[email protected]>
- Create builtin_types.py with DJBlobType, ContentType, XBlobType - Types serve as examples for users creating custom types - Module docstring includes example of defining a custom GraphType - Add get_adapter() function to attribute_type.py for compatibility - Auto-register built-in types via import at module load Co-authored-by: dimitri-yatsenko <[email protected]>
Add <object> type for files and folders (Zarr, HDF5, etc.):
- Path derived from primary key: {schema}/{table}/objects/{pk}/{field}_{token}
- Supports bytes, files, and directories
- Returns ObjectRef for lazy fsspec-based access
- No deduplication (unlike <content>)
Update implementation plan with Phase 2b documenting ObjectType.
Co-authored-by: dimitri-yatsenko <[email protected]>
Migration utilities are out of scope for now. This is a breaking change version - users will need to recreate tables with new types. Co-authored-by: dimitri-yatsenko <[email protected]>
- Document staged_insert.py for direct object storage writes - Add flow comparison: normal insert vs staged insert - Include staged_insert.py in critical files summary Co-authored-by: dimitri-yatsenko <[email protected]>
Add remaining built-in AttributeTypes: - <attach>: Internal file attachment stored in longblob - <xattach>: External file attachment via <content> with deduplication - <filepath@store>: Reference to existing file (no copy, returns ObjectRef) Update implementation plan to mark Phase 3 complete. Co-authored-by: dimitri-yatsenko <[email protected]>
Add garbage collection module (gc.py) for content-addressed storage: - scan_references() to find content hashes in schemas - list_stored_content() to enumerate _content/ directory - scan() for orphan detection without deletion - collect() for orphan removal with dry_run option - format_stats() for human-readable output Add test files: - test_content_storage.py for content_registry.py functions - test_type_composition.py for type chain encoding/decoding - test_gc.py for garbage collection Update implementation plan to mark all phases complete. Co-authored-by: dimitri-yatsenko <[email protected]>
Extend gc.py to handle both storage patterns: - Content-addressed storage: <content>, <xblob>, <xattach> - Path-addressed storage: <object> New functions added: - _uses_object_storage() - detect object type attributes - _extract_object_refs() - extract path refs from JSON - scan_object_references() - scan schemas for object paths - list_stored_objects() - list all objects in storage - delete_object() - delete object directory tree Updated scan() and collect() to handle both storage types, with combined and per-type statistics in the output. Updated tests for new statistics format. Co-authored-by: dimitri-yatsenko <[email protected]>
External tables are deprecated in favor of the new storage type system. Move the constant to external.py where it's used, keeping declare.py clean. Co-authored-by: dimitri-yatsenko <[email protected]>
External tables (~external_*) are deprecated in favor of the new AttributeType-based storage system. The new types (<xblob>, <content>, <object>) store data directly to storage via StorageBackend without tracking tables. - Remove src/datajoint/external.py entirely - Remove ExternalMapping from schemas.py - Remove external table pre-declaration from table.py Co-authored-by: dimitri-yatsenko <[email protected]>
Python 3.10+ doesn't have a built-in class property decorator (the @classmethod + @Property chaining was deprecated in 3.11). The modern approach is to define properties on the metaclass, which automatically makes them work at the class level. - Move connection, table_name, full_table_name properties to TableMeta - Create PartMeta subclass with overridden properties for Part tables - Remove ClassProperty class from utils.py Co-authored-by: dimitri-yatsenko <[email protected]>
Replace pytest-managed Docker containers with external docker-compose services. This removes complexity, improves reliability, and allows running tests both from the host machine and inside the devcontainer. - Remove docker container lifecycle management from conftest.py - Add pixi tasks for running tests (services-up, test, test-cov) - Expose MySQL and MinIO ports in docker-compose.yaml for host access - Simplify devcontainer to extend the main docker-compose.yaml - Remove docker dependency from test requirements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Fix Table.table_name property to delegate to metaclass for UserTable subclasses (table_name was returning None instead of computed name) - Fix heading type loading to preserve database type for core types (uuid, etc.) instead of overwriting with alias from comment - Add original_type field to Attribute for storing the alias while keeping the actual SQL type in type field - Fix tests: remove obsolete test_external.py, update resolve_dtype tests to expect 3 return values, update type alias tests to use CORE_TYPE_SQL - Update pyproject.toml pytest_env to use D: prefix for default-only vars Test results improved from 174 passed/284 errors to 381 passed/62 errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Type system changes: - Core type `blob` stores raw bytes without serialization - Built-in type `<djblob>` handles automatic serialization/deserialization - Update jobs table to use <djblob> for key and error_stack columns - Remove enable_python_native_blobs config check (always enabled) Bug fixes: - Fix is_blob detection to include NATIVE_BLOB types (longblob, mediumblob, etc.) - Fix original_type fallback when None - Fix test_type_aliases to use lowercase keys for CORE_TYPE_SQL lookup - Allow None context for built-in types in heading initialization - Update native type warning message wording 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Update settings access tests to check type instead of specific value (safemode is set to False by conftest fixtures) - Fix config.load() to handle nested JSON dicts in addition to flat dot-notation keys Test results: 417 passed (was 414) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Update GraphType and LayoutToFilepathType to use <djblob> dtype (old filepath@store syntax no longer supported) - Fix local_schema and schema_virtual_module fixtures to pass connection - Remove unused imports Test results: 421 passed, 58 errors, 13 failed (was 417/62/13) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Source code fixes: - Add download_path setting and squeeze handling in fetch.py - Add filename collision handling in AttachType and XAttachType - Fix is_blob detection to check both BLOB and NATIVE_BLOB patterns - Fix FilepathType.validate to accept Path objects - Add proper error message for undecorated tables Test infrastructure updates: - Update schema_external.py to use new <xblob@store>, <xattach@store>, <filepath@store> syntax - Update all test tables to use <djblob> instead of longblob for serialization - Configure object_storage.stores in conftest.py fixtures - Remove obsolete test_admin.py (set_password was removed) - Fix connection passing in various tests to avoid credential prompts - Fix test_query_caching to handle existing directories README: - Add Developer Guide section with setup, test, and pre-commit instructions Test results: 408 passed, 2 skipped (macOS multiprocessing limitation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR implements a major redesign of DataJoint's type system and storage architecture for v2.0. The changes establish a clean three-layer type architecture and modernize external storage handling.
Three-Layer Type Architecture
Key Changes
New Type System
int32,float64,bool,uuid,json,blob,enum(...)) - scientist-friendly, portable across backends<djblob>,<xblob>,<object>,<content>,<attach>,<xattach>,<filepath>) - composable encode/decode with angle bracket syntaxblobnow stores raw bytes; use<djblob>for serialized Python objectsBuilt-in AttributeTypes
<djblob><xblob@store><object@store><content@store><attach><xattach@store><filepath@store>Storage Architecture (OAS - Object-Augmented Schema)
{schema}/{table}/{pk}/- path-addressed, deleted with row_content/{hash}- content-addressed, deduplicated, garbage collected~external_*tablesConfiguration System
datajoint.json)object_storage.stores.*configuration for external storesconfig.save()methodsBreaking Changes
longblobno longer auto-serializes - use<djblob>insteadblob@store,attach@store,filepath@storesyntax replaced with<xblob@store>,<xattach@store>,<filepath@store>AttributeAdapter- useAttributeTypewith@dj.register_typeset_password()functionbypass_serializationcontext managerexternal.pymodule (deprecated)New Features
Testing Infrastructure
New Test Modules
test_attribute_type.pytest_content_storage.pytest_gc.pytest_object.pytest_type_aliases.pytest_type_composition.pytest_settings.pyRemoved Obsolete Tests
test_admin.py- tested removedset_password()functiontest_bypass_serialization.py- tested removed context managertest_external.py- tested legacy external storageUpdated Test Schemas
schema_object.py- new schema for object type testsschema_type_aliases.py- new schema for type alias tests<djblob>instead oflongblobfor serialized dataschema_external.pyupdated to use new<xblob@store>,<xattach@store>,<filepath@store>syntaxInfrastructure Improvements
conftest.py(714 lines changed, net reduction)object_storage.stores.*fixture configurationMigration from Legacy Types
longblob(auto-serialized)<djblob>blob@store<xblob@store>attach<attach>attach@store<xattach@store>filepath@store<filepath@store>Test Results