Skip to content
This repository was archived by the owner on Oct 10, 2025. It is now read-only.

Conversation

@royi-luo
Copy link
Contributor

@royi-luo royi-luo commented Jul 31, 2024

Description

Implements (ALP compression for floating-point values)[https://dl.acm.org/doi/pdf/10.1145/3626717]. The general idea of this compression algorithm consists of the following steps:

  1. Pick values e, f (these can be found by sampling values, running the below steps, and picking the ones that provide the best compression ratio)
  2. Floating point values are encoded to integers via the formula encoded_value = float_value * 10^f * 10^(-e). We test if encoding and decoding a value will result in a loss of data. For values where there is no loss of data, the encoded integers are bitpacked normally. For values where there is a loss of data ('exceptions'), they are stored separately in uncompressed form.
  3. Repeat the above steps for each chunk of values

Generally, the compression ratio will improve if all the values are in a similar range + have similar decimal precision.

This implementation deviates from the reference implementation (duckdb) in a few ways:

  • DuckDB compressed one chunk for each segment. Since we don't have the concept of segments, we compress one chunk for each column chunk (which results in worse compression ratio since it is over a larger chunk of values).
  • In the DuckDB implementation exceptions are directly appended to the each of each ALP chunk. This does not work for us as we rely on there being a constant number of compressed values in each page. Thus we store all the exceptions for a column chunk as a separate uncompressed chunk and iterate through the exception chunk (which we cache in memory during each checkpoint) while compressing/uncompressing the actual column chunk of floating point values. There is a performance cost of both reading/flushing the exception chunk and searching the exception chunk so we turn off compression if there are too many exceptions in a chunk.

The deviations generally negatively affect our performance and compression ratio; we should update our implementation once segmentation is done to follow the paper more faithfully.

Contributor agreement

@royi-luo
Copy link
Contributor Author

Some older benchmarks can be found in this PR

@royi-luo royi-luo changed the base branch from master to royi/alp-float-double-compression July 31, 2024 18:53
@royi-luo royi-luo force-pushed the royi/alp-implementation branch from c85af48 to a328bbd Compare July 31, 2024 19:02
@royi-luo royi-luo changed the base branch from royi/alp-float-double-compression to master July 31, 2024 19:05
@royi-luo royi-luo force-pushed the royi/alp-implementation branch 4 times, most recently from 011d428 to 1ffc72f Compare July 31, 2024 22:05
@codecov
Copy link

codecov bot commented Jul 31, 2024

Codecov Report

Attention: Patch coverage is 95.02664% with 56 lines in your changes missing coverage. Please review.

Project coverage is 84.03%. Comparing base (a6dc2cb) to head (d4f0ede).
Report is 4 commits behind head on master.

Files Patch % Lines
test/storage/compress_chunk_test.cpp 96.08% 5 Missing and 6 partials ⚠️
src/storage/store/column_reader_writer.cpp 93.82% 10 Missing ⚠️
test/storage/column_chunk_metadata_test.cpp 70.00% 4 Missing and 5 partials ⚠️
test/storage/compression_test.cpp 80.43% 4 Missing and 5 partials ⚠️
src/storage/compression/compression.cpp 94.57% 7 Missing ⚠️
src/storage/store/column.cpp 92.98% 4 Missing ⚠️
src/storage/compression/float_compression.cpp 97.75% 2 Missing ⚠️
...rc/include/storage/compression/float_compression.h 80.00% 1 Missing ⚠️
src/storage/store/column_chunk_metadata.cpp 98.71% 1 Missing ⚠️
src/storage/store/compression_flush_buffer.cpp 98.73% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3994      +/-   ##
==========================================
+ Coverage   83.80%   84.03%   +0.23%     
==========================================
  Files        1321     1331      +10     
  Lines       52228    53126     +898     
  Branches     7302     7400      +98     
==========================================
+ Hits        43772    44647     +875     
- Misses       8301     8308       +7     
- Partials      155      171      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@royi-luo royi-luo force-pushed the royi/alp-implementation branch from b832548 to 71cc55a Compare August 1, 2024 13:42
@royi-luo royi-luo force-pushed the royi/alp-implementation branch from 71cc55a to 88652bd Compare August 1, 2024 14:43
mxwli
mxwli previously requested changes Aug 1, 2024
@royi-luo royi-luo force-pushed the royi/alp-implementation branch from 7666ed3 to ca5f9a8 Compare August 1, 2024 17:37
@royi-luo royi-luo force-pushed the royi/alp-implementation branch from ca5f9a8 to d56eaeb Compare August 1, 2024 18:19
@royi-luo royi-luo force-pushed the royi/alp-implementation branch from d56eaeb to 41f0bd2 Compare August 1, 2024 20:35
@ray6080 ray6080 mentioned this pull request Aug 22, 2024
5 tasks
@royi-luo royi-luo force-pushed the royi/alp-implementation branch from a216e89 to 137426e Compare August 22, 2024 23:03
Remove unused exception handling code

Refactor column write so different behaviour can be implemented for floats

Run clang-format

In place update WIP

Implement initial version of in place updates

Run clang-format

Optimize in-place updates

In-place update fixes

Fixes after rebase

Cache exception buffer in memory during transaction

Fix warnings

Compress chunk test cleanup

Code cleanup 1

Fix compile issues on other platforms

Add missing Column::initializeScanState() calls

Fix overflow error in ALP exception count

Add test TODOs

Ignore current exceptions is entire chunk is updated

Only flush used parts of exception buffer

Avoid searching among non-finalized exceptions

CI fix

In place update optimizations

Code Cleanup 1

Update exceptions in place if they are replaced by new exceptions

Fix in-place updates for vector inputs

Tests/modififications for constant/uncompressed compression type

Add tests

Code cleanup + tests

Rust build fix

Enable assertion during in place update

More code cleanup + tests

More code cleanup 2

Fix assertion failure in GetFloatCompressionMetadata when numValues is 0

Bump DB version

Pass std::function by reference

Allow use of constant compression for encoded floats

Update uncompressed test to not break on 32-bit

Bump extension version

Optimize

Pad compression metadata to multiple of 8 bytes

self-review

self-review 2

Add test for column chunk metadata serialize/deserialize

Run clang-format

Update values in compress chunk test to work on 32-bit system

Remove unneeded changes to hash index/disk array

Use existing serializer/deserializer for column chunk metadata

Change ALP_EXCEPTION_* to physical type

Fix test failures

Address misc review comments

Fix test failures again
Fix issues after rebase

Address review comments

Address review comments 2
@royi-luo royi-luo force-pushed the royi/alp-implementation branch 3 times, most recently from a52cdf2 to 408ca5a Compare August 23, 2024 16:04
@royi-luo royi-luo force-pushed the royi/alp-implementation branch from 408ca5a to 08f22f7 Compare August 23, 2024 17:07
Comment on lines +280 to +284
static LogicalType ANY(PhysicalTypeID physicalType) {
auto ret = LogicalType(LogicalTypeID::ANY);
ret.physicalType = physicalType;
return ret;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this constructor to treat ANY as an internal logical type that contains a physical type of our choice. This is because logical types are needed to construct Column and ColumnChunkData.

KUZU_API static std::vector<LogicalType> copy(const std::vector<LogicalType*>& types);

static LogicalType ANY() { return LogicalType(LogicalTypeID::ANY); }
static LogicalType ANY(PhysicalTypeID physicalType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment here say this is a temporary hack and this interface is NOT supposed to be used anywhere else as we should get rid of this later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@royi-luo royi-luo requested a review from mxwli August 23, 2024 20:20
@github-actions
Copy link

Benchmark Result

Master commit hash: 69b82e2aba5e60e4fc7f10f92fa175405ea82e3f
Branch commit hash: a7aa84e2a6d4c8fd33b6cbc2206744318f7ac692

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 669.42 671.39 -1.97 (-0.29%)
aggregation q28 12287.28 11369.35 917.93 (8.07%)
copy node-Comment 69145.88 N/A N/A
copy node-Forum 5325.27 N/A N/A
copy node-Organisation 1514.21 N/A N/A
copy node-Person 2730.40 N/A N/A
copy node-Place 1473.38 N/A N/A
copy node-Post 23752.94 N/A N/A
copy node-Tag 1494.68 N/A N/A
copy node-Tagclass 1372.76 N/A N/A
copy rel-comment-hasCreator 60975.20 N/A N/A
copy rel-comment-hasTag 97642.42 N/A N/A
copy rel-comment-isLocatedIn 63489.35 N/A N/A
copy rel-containerOf 18619.15 N/A N/A
copy rel-forum-hasTag 3826.91 N/A N/A
copy rel-hasInterest 2639.86 N/A N/A
copy rel-hasMember 51838.33 N/A N/A
copy rel-hasModerator 2004.77 N/A N/A
copy rel-hasType 469.11 N/A N/A
copy rel-isPartOf 583.73 N/A N/A
copy rel-isSubclassOf 319.27 N/A N/A
copy rel-knows 5844.08 N/A N/A
copy rel-likes-comment 97172.25 N/A N/A
copy rel-likes-post 34779.57 N/A N/A
copy rel-organisation-isLocatedIn 671.44 N/A N/A
copy rel-person-isLocatedIn 739.93 N/A N/A
copy rel-post-hasCreator 16900.63 N/A N/A
copy rel-post-hasTag 23006.94 N/A N/A
copy rel-post-isLocatedIn 16827.76 N/A N/A
copy rel-replyOf-comment 77226.54 N/A N/A
copy rel-replyOf-post 48867.11 N/A N/A
copy rel-studyAt 902.19 N/A N/A
copy rel-workAt 1051.19 N/A N/A
filter q14 164.72 151.59 13.13 (8.66%)
filter q15 160.10 156.58 3.52 (2.25%)
filter q16 331.37 337.04 -5.67 (-1.68%)
filter q17 474.82 473.91 0.91 (0.19%)
filter q18 1971.10 1935.24 35.86 (1.85%)
fixed_size_expr_evaluator q07 575.75 569.75 6.01 (1.05%)
fixed_size_expr_evaluator q08 788.69 776.17 12.53 (1.61%)
fixed_size_expr_evaluator q09 794.55 777.82 16.72 (2.15%)
fixed_size_expr_evaluator q10 272.35 266.55 5.80 (2.18%)
fixed_size_expr_evaluator q11 267.63 259.26 8.37 (3.23%)
fixed_size_expr_evaluator q12 266.16 261.22 4.94 (1.89%)
fixed_size_expr_evaluator q13 1500.51 1492.98 7.53 (0.50%)
fixed_size_seq_scan q23 147.30 145.76 1.55 (1.06%)
join q31 12.67 12.98 -0.31 (-2.36%)
ldbc_snb_ic q35 780.86 1069.50 -288.65 (-26.99%)
ldbc_snb_ic q36 52.01 46.44 5.58 (12.01%)
ldbc_snb_is q32 9.35 9.15 0.21 (2.27%)
ldbc_snb_is q33 17.43 17.81 -0.38 (-2.12%)
ldbc_snb_is q34 8.11 8.09 0.02 (0.23%)
multi-rel multi-rel-large-scan 2917.91 3396.90 -478.98 (-14.10%)
multi-rel multi-rel-lookup 48.10 51.28 -3.19 (-6.21%)
multi-rel multi-rel-small-scan 47.10 53.24 -6.14 (-11.53%)
order_by q25 159.25 152.69 6.56 (4.30%)
order_by q26 472.16 476.07 -3.91 (-0.82%)
order_by q27 1436.37 1425.10 11.27 (0.79%)
scan_after_filter q01 201.57 201.03 0.55 (0.27%)
scan_after_filter q02 188.32 187.34 0.98 (0.52%)
shortest_path_ldbc100 q39 87.79 92.77 -4.98 (-5.37%)
var_size_expr_evaluator q03 2100.83 2113.91 -13.08 (-0.62%)
var_size_expr_evaluator q04 2265.83 2223.31 42.51 (1.91%)
var_size_expr_evaluator q05 2641.38 2582.96 58.42 (2.26%)
var_size_expr_evaluator q06 1412.18 1403.52 8.66 (0.62%)
var_size_seq_scan q19 1497.63 1494.62 3.01 (0.20%)
var_size_seq_scan q20 3243.27 3185.51 57.76 (1.81%)
var_size_seq_scan q21 2471.25 2411.05 60.21 (2.50%)
var_size_seq_scan q22 135.91 134.09 1.82 (1.36%)

@royi-luo royi-luo merged commit d3abcd0 into master Aug 24, 2024
@royi-luo royi-luo deleted the royi/alp-implementation branch August 24, 2024 00:07
ted-wq-x pushed a commit to ted-wq-x/kuzu that referenced this pull request Nov 14, 2024
ALP implementation

(cherry picked from commit d3abcd0)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants