Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: add epub and markdown tests + byta column #469

Merged
merged 3 commits into from
Feb 12, 2025

Conversation

Askir
Copy link
Contributor

@Askir Askir commented Feb 11, 2025

I am not sure if doing the filetype guessing in the loader is cleaner or not, but by having the dataclass to pass around we avoid having to inject the config into the parser at least, which I think is definitely better.

"use parsing_auto or parsing_none instead"
)
with io.BytesIO() as file_like:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to close bytesIO objects they are completely contained in memory and simply cleaned up by the garbage collector. The reason other IO objects have to be closed is because they make use of OS resources like file descriptors which are not freed by the GC itself, but BytesIO just "pretends" to be a file so that the interface is consistent.

Copy link
Contributor

@smoya smoya Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the context manager here was to immediately free the in-memory usage, considering the amount of memory we could use in this function. It does not matter if it requires i/o from system or not, it is a matter of releasing the memory as fast as we can; common when handling files and easy to maintain with contexts.

However, It made sense in a prior version (mine) of the code, where there was more code right after the pymupdf parsing. With the current code, there is no benefit on ensuring the file_like gets freed because we are actually referring to that var until the very end of the function.

So 👍 in this particular case.

@Askir Askir changed the title test: add epub and md tests test: add epub and markdown tests Feb 11, 2025
@Askir Askir changed the title test: add epub and markdown tests test: add epub and markdown tests + byta column Feb 11, 2025
@Askir Askir force-pushed the jascha/refactoring-md-epub branch from 90b450a to 4b39966 Compare February 11, 2025 21:26
@Askir Askir marked this pull request as ready for review February 11, 2025 21:26
@Askir Askir requested a review from a team as a code owner February 11, 2025 21:26
else:
raise ValueError("No file extension could be determined")
if payload.file_type is None:
raise ValueError("No file extension could be determined")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError("No file extension could be determined")
raise ValueError("No file type could be determined")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We can change it in another PR 👍 )

@smoya
Copy link
Contributor

smoya commented Feb 12, 2025

I am not sure if doing the filetype guessing in the loader is cleaner or not,

Even though we could discuss if this is responsibility of the parser or loader, in practice I actually like this; it makes things way easier. we can always change it in the future 👍

Copy link
Contributor

@smoya smoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀🌔

@smoya smoya merged commit 9af568b into s3-integration-feature-branch Feb 12, 2025
5 checks passed
@smoya smoya deleted the jascha/refactoring-md-epub branch February 12, 2025 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants