ALP implementation #3994

royi-luo · 2024-07-31T17:21:13Z

Description

Implements (ALP compression for floating-point values)[https://dl.acm.org/doi/pdf/10.1145/3626717]. The general idea of this compression algorithm consists of the following steps:

Pick values e, f (these can be found by sampling values, running the below steps, and picking the ones that provide the best compression ratio)
Floating point values are encoded to integers via the formula encoded_value = float_value * 10^f * 10^(-e). We test if encoding and decoding a value will result in a loss of data. For values where there is no loss of data, the encoded integers are bitpacked normally. For values where there is a loss of data ('exceptions'), they are stored separately in uncompressed form.
Repeat the above steps for each chunk of values

Generally, the compression ratio will improve if all the values are in a similar range + have similar decimal precision.

This implementation deviates from the reference implementation (duckdb) in a few ways:

DuckDB compressed one chunk for each segment. Since we don't have the concept of segments, we compress one chunk for each column chunk (which results in worse compression ratio since it is over a larger chunk of values).
In the DuckDB implementation exceptions are directly appended to the each of each ALP chunk. This does not work for us as we rely on there being a constant number of compressed values in each page. Thus we store all the exceptions for a column chunk as a separate uncompressed chunk and iterate through the exception chunk (which we cache in memory during each checkpoint) while compressing/uncompressing the actual column chunk of floating point values. There is a performance cost of both reading/flushing the exception chunk and searching the exception chunk so we turn off compression if there are too many exceptions in a chunk.

The deviations generally negatively affect our performance and compression ratio; we should update our implementation once segmentation is done to follow the paper more faithfully.

Contributor agreement

I have read and agree to the Contributor Agreement.

royi-luo · 2024-07-31T17:22:15Z

Some older benchmarks can be found in this PR

src/include/storage/compression/compression.h

codecov · 2024-07-31T22:35:30Z

Codecov Report

Attention: Patch coverage is 95.02664% with 56 lines in your changes missing coverage. Please review.

Project coverage is 84.03%. Comparing base (a6dc2cb) to head (d4f0ede).
Report is 4 commits behind head on master.

Files	Patch %	Lines
test/storage/compress_chunk_test.cpp	96.08%	5 Missing and 6 partials ⚠️
src/storage/store/column_reader_writer.cpp	93.82%	10 Missing ⚠️
test/storage/column_chunk_metadata_test.cpp	70.00%	4 Missing and 5 partials ⚠️
test/storage/compression_test.cpp	80.43%	4 Missing and 5 partials ⚠️
src/storage/compression/compression.cpp	94.57%	7 Missing ⚠️
src/storage/store/column.cpp	92.98%	4 Missing ⚠️
src/storage/compression/float_compression.cpp	97.75%	2 Missing ⚠️
...rc/include/storage/compression/float_compression.h	80.00%	1 Missing ⚠️
src/storage/store/column_chunk_metadata.cpp	98.71%	1 Missing ⚠️
src/storage/store/compression_flush_buffer.cpp	98.73%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3994      +/-   ##
==========================================
+ Coverage   83.80%   84.03%   +0.23%     
==========================================
  Files        1321     1331      +10     
  Lines       52228    53126     +898     
  Branches     7302     7400      +98     
==========================================
+ Hits        43772    44647     +875     
- Misses       8301     8308       +7     
- Partials      155      171      +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/include/common/utils.h

src/include/storage/compression/compression_float.h

src/include/common/utils.h

third_party/alp/include/alp/utils.hpp

src/include/storage/compression/compression_float.h

src/include/storage/store/column_chunk_data.h

src/include/storage/store/column_chunk_metadata.h

src/include/storage/store/column_read.h

src/storage/compression/compression.cpp

src/storage/compression/compression_float.cpp

src/storage/store/column_chunk_flush.cpp

Remove unused exception handling code Refactor column write so different behaviour can be implemented for floats Run clang-format In place update WIP Implement initial version of in place updates Run clang-format Optimize in-place updates In-place update fixes Fixes after rebase Cache exception buffer in memory during transaction Fix warnings Compress chunk test cleanup Code cleanup 1 Fix compile issues on other platforms Add missing Column::initializeScanState() calls Fix overflow error in ALP exception count Add test TODOs Ignore current exceptions is entire chunk is updated Only flush used parts of exception buffer Avoid searching among non-finalized exceptions CI fix In place update optimizations Code Cleanup 1 Update exceptions in place if they are replaced by new exceptions Fix in-place updates for vector inputs Tests/modififications for constant/uncompressed compression type Add tests Code cleanup + tests Rust build fix Enable assertion during in place update More code cleanup + tests More code cleanup 2 Fix assertion failure in GetFloatCompressionMetadata when numValues is 0 Bump DB version Pass std::function by reference Allow use of constant compression for encoded floats Update uncompressed test to not break on 32-bit Bump extension version Optimize Pad compression metadata to multiple of 8 bytes self-review self-review 2 Add test for column chunk metadata serialize/deserialize Run clang-format Update values in compress chunk test to work on 32-bit system Remove unneeded changes to hash index/disk array Use existing serializer/deserializer for column chunk metadata Change ALP_EXCEPTION_* to physical type Fix test failures Address misc review comments Fix test failures again

Fix issues after rebase Address review comments Address review comments 2

royi-luo · 2024-08-23T17:17:33Z

src/include/common/types/types.h

+    static LogicalType ANY(PhysicalTypeID physicalType) {
+        auto ret = LogicalType(LogicalTypeID::ANY);
+        ret.physicalType = physicalType;
+        return ret;
+    }


I added this constructor to treat ANY as an internal logical type that contains a physical type of our choice. This is because logical types are needed to construct Column and ColumnChunkData.

…d (but we want to test the update anyways)

src/storage/store/in_memory_exception_chunk.cpp

ray6080 · 2024-08-23T20:13:29Z

src/include/common/types/types.h

    KUZU_API static std::vector<LogicalType> copy(const std::vector<LogicalType*>& types);

    static LogicalType ANY() { return LogicalType(LogicalTypeID::ANY); }
+    static LogicalType ANY(PhysicalTypeID physicalType) {


Comment here say this is a temporary hack and this interface is NOT supposed to be used anywhere else as we should get rid of this later.

addressed

github-actions · 2024-08-23T20:54:21Z

Benchmark Result

Master commit hash: 69b82e2aba5e60e4fc7f10f92fa175405ea82e3f
Branch commit hash: a7aa84e2a6d4c8fd33b6cbc2206744318f7ac692

Query Group	Query Name	Mean Time - Commit (ms)	Mean Time - Master (ms)	Diff
aggregation	q24	669.42	671.39	-1.97 (-0.29%)
aggregation	q28	12287.28	11369.35	917.93 (8.07%)
copy	node-Comment	69145.88	N/A	N/A
copy	node-Forum	5325.27	N/A	N/A
copy	node-Organisation	1514.21	N/A	N/A
copy	node-Person	2730.40	N/A	N/A
copy	node-Place	1473.38	N/A	N/A
copy	node-Post	23752.94	N/A	N/A
copy	node-Tag	1494.68	N/A	N/A
copy	node-Tagclass	1372.76	N/A	N/A
copy	rel-comment-hasCreator	60975.20	N/A	N/A
copy	rel-comment-hasTag	97642.42	N/A	N/A
copy	rel-comment-isLocatedIn	63489.35	N/A	N/A
copy	rel-containerOf	18619.15	N/A	N/A
copy	rel-forum-hasTag	3826.91	N/A	N/A
copy	rel-hasInterest	2639.86	N/A	N/A
copy	rel-hasMember	51838.33	N/A	N/A
copy	rel-hasModerator	2004.77	N/A	N/A
copy	rel-hasType	469.11	N/A	N/A
copy	rel-isPartOf	583.73	N/A	N/A
copy	rel-isSubclassOf	319.27	N/A	N/A
copy	rel-knows	5844.08	N/A	N/A
copy	rel-likes-comment	97172.25	N/A	N/A
copy	rel-likes-post	34779.57	N/A	N/A
copy	rel-organisation-isLocatedIn	671.44	N/A	N/A
copy	rel-person-isLocatedIn	739.93	N/A	N/A
copy	rel-post-hasCreator	16900.63	N/A	N/A
copy	rel-post-hasTag	23006.94	N/A	N/A
copy	rel-post-isLocatedIn	16827.76	N/A	N/A
copy	rel-replyOf-comment	77226.54	N/A	N/A
copy	rel-replyOf-post	48867.11	N/A	N/A
copy	rel-studyAt	902.19	N/A	N/A
copy	rel-workAt	1051.19	N/A	N/A
filter	q14	164.72	151.59	13.13 (8.66%)
filter	q15	160.10	156.58	3.52 (2.25%)
filter	q16	331.37	337.04	-5.67 (-1.68%)
filter	q17	474.82	473.91	0.91 (0.19%)
filter	q18	1971.10	1935.24	35.86 (1.85%)
fixed_size_expr_evaluator	q07	575.75	569.75	6.01 (1.05%)
fixed_size_expr_evaluator	q08	788.69	776.17	12.53 (1.61%)
fixed_size_expr_evaluator	q09	794.55	777.82	16.72 (2.15%)
fixed_size_expr_evaluator	q10	272.35	266.55	5.80 (2.18%)
fixed_size_expr_evaluator	q11	267.63	259.26	8.37 (3.23%)
fixed_size_expr_evaluator	q12	266.16	261.22	4.94 (1.89%)
fixed_size_expr_evaluator	q13	1500.51	1492.98	7.53 (0.50%)
fixed_size_seq_scan	q23	147.30	145.76	1.55 (1.06%)
join	q31	12.67	12.98	-0.31 (-2.36%)
ldbc_snb_ic	q35	780.86	1069.50	-288.65 (-26.99%)
ldbc_snb_ic	q36	52.01	46.44	5.58 (12.01%)
ldbc_snb_is	q32	9.35	9.15	0.21 (2.27%)
ldbc_snb_is	q33	17.43	17.81	-0.38 (-2.12%)
ldbc_snb_is	q34	8.11	8.09	0.02 (0.23%)
multi-rel	multi-rel-large-scan	2917.91	3396.90	-478.98 (-14.10%)
multi-rel	multi-rel-lookup	48.10	51.28	-3.19 (-6.21%)
multi-rel	multi-rel-small-scan	47.10	53.24	-6.14 (-11.53%)
order_by	q25	159.25	152.69	6.56 (4.30%)
order_by	q26	472.16	476.07	-3.91 (-0.82%)
order_by	q27	1436.37	1425.10	11.27 (0.79%)
scan_after_filter	q01	201.57	201.03	0.55 (0.27%)
scan_after_filter	q02	188.32	187.34	0.98 (0.52%)
shortest_path_ldbc100	q39	87.79	92.77	-4.98 (-5.37%)
var_size_expr_evaluator	q03	2100.83	2113.91	-13.08 (-0.62%)
var_size_expr_evaluator	q04	2265.83	2223.31	42.51 (1.91%)
var_size_expr_evaluator	q05	2641.38	2582.96	58.42 (2.26%)
var_size_expr_evaluator	q06	1412.18	1403.52	8.66 (0.62%)
var_size_seq_scan	q19	1497.63	1494.62	3.01 (0.20%)
var_size_seq_scan	q20	3243.27	3185.51	57.76 (1.81%)
var_size_seq_scan	q21	2471.25	2411.05	60.21 (2.50%)
var_size_seq_scan	q22	135.91	134.09	1.82 (1.36%)

ALP implementation (cherry picked from commit d3abcd0)

royi-luo mentioned this pull request Jul 31, 2024

[DO NOT REVIEW] Add ALP third-party library + Implementation #3712

Closed

1 task

royi-luo commented Jul 31, 2024

View reviewed changes

src/include/storage/compression/compression.h Show resolved Hide resolved

royi-luo changed the base branch from master to royi/alp-float-double-compression July 31, 2024 18:53

royi-luo force-pushed the royi/alp-implementation branch from c85af48 to a328bbd Compare July 31, 2024 19:02

royi-luo changed the base branch from royi/alp-float-double-compression to master July 31, 2024 19:05

royi-luo force-pushed the royi/alp-implementation branch 4 times, most recently from 011d428 to 1ffc72f Compare July 31, 2024 22:05

royi-luo force-pushed the royi/alp-implementation branch from b832548 to 71cc55a Compare August 1, 2024 13:42