-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: add epub and markdown tests + byta column #469
test: add epub and markdown tests + byta column #469
Conversation
"use parsing_auto or parsing_none instead" | ||
) | ||
with io.BytesIO() as file_like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to close bytesIO objects they are completely contained in memory and simply cleaned up by the garbage collector. The reason other IO objects have to be closed is because they make use of OS resources like file descriptors which are not freed by the GC itself, but BytesIO just "pretends" to be a file so that the interface is consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the context manager here was to immediately free the in-memory usage, considering the amount of memory we could use in this function. It does not matter if it requires i/o from system or not, it is a matter of releasing the memory as fast as we can; common when handling files and easy to maintain with contexts.
However, It made sense in a prior version (mine) of the code, where there was more code right after the pymupdf parsing. With the current code, there is no benefit on ensuring the file_like
gets freed because we are actually referring to that var until the very end of the function.
So 👍 in this particular case.
90b450a
to
4b39966
Compare
else: | ||
raise ValueError("No file extension could be determined") | ||
if payload.file_type is None: | ||
raise ValueError("No file extension could be determined") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise ValueError("No file extension could be determined") | |
raise ValueError("No file type could be determined") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(We can change it in another PR 👍 )
Even though we could discuss if this is responsibility of the parser or loader, in practice I actually like this; it makes things way easier. we can always change it in the future 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚀🌔
I am not sure if doing the filetype guessing in the loader is cleaner or not, but by having the dataclass to pass around we avoid having to inject the config into the parser at least, which I think is definitely better.