fix(parquet): write single file if option is set#17009
fix(parquet): write single file if option is set#17009hknlof wants to merge 6 commits intoapache:mainfrom
Conversation
1c05032 to
f1d58c5
Compare
|
I restarted the checks |
|
Sorry for the delay -- this PR seems to have quite. few conflicts. |
00148b2 to
651cfb8
Compare
No worries! My bad, I forgot to sync before rebasing the branch onto main again. Thx! |
Jefffrey
left a comment
There was a problem hiding this comment.
This way of implicitly encoding single fileness via adding an extension to the path does seem quite hacky; not sure if this is the correct approach to follow 🤔
| let path = if file_type.get_ext() != DEFAULT_PARQUET_EXTENSION | ||
| && options.single_file_output | ||
| { | ||
| let mut path = path.to_owned(); | ||
| path.push_str(SINGLE_FILE_EXTENSION); | ||
| path | ||
| } else { | ||
| path.to_owned() | ||
| }; |
There was a problem hiding this comment.
This part confuses me; If I'm not wrong file_type.get_ext() != DEFAULT_PARQUET_EXTENSION will always be true, because:
datafusion/datafusion/datasource-parquet/src/file_format.rs
Lines 157 to 162 in 602475f
| let test_path = std::path::Path::new(&path); | ||
| assert!( | ||
| test_path.is_dir(), | ||
| "No extension and default DataFrameWriteOptons should have yielded a dir." | ||
| ); |
There was a problem hiding this comment.
I think we would need to also check there are indeed multiple parquet files
| let test_path = std::path::Path::new(&path); | ||
| assert!( | ||
| test_path.is_file(), | ||
| "No extension and DataFrameWriteOptons::with_single_file_output(true) should have yielded a single file." | ||
| ); |
There was a problem hiding this comment.
Similarly here we would need to check that only one file is written, not just a file is written
|
@alamb @Jefffrey Sorry for not responding earlier. I will have some head room this week. And will address the comments. @Jefffrey The implicit extension is anything but ideal. Looking into the issue, you will find a discussion around approaches. We probably would have to change the signature to have the more ideal scenario. I will update my findings this evening (CET). |
@Jefffrey Agreed. Where would you suggest to implement it to have a cleaner version? Maybe setting an option in |
I'll try find some time to look into this |
|
Sorry for the long delay, only got around to looking into this issue a bit more recently. Can find my comment here: #13323 (comment) cc @hknlof |
|
Superseded by #19931 |
DataFrameWriteOptions::with_single_file_outputproduces a directory #13323Which issue does this PR close?
DataFrameWriteOptions::with_single_file_outputproduces a directory #13323Rationale for this change
DF.write_parquetwrites multiple files / one directory even ifoptions.single_file_outputis set.What changes are included in this PR?
Introduce an internal
.singleextension.Are these changes tested?
Yes, tests are part of this PR.
Are there any user-facing changes?
Not in this implementation. There might be, if we decide to move to an
FileSinkConfigbased solution.Quoting: #13323 (comment)