test: add epub and markdown tests + byta column #469

Askir · 2025-02-11T21:05:37Z

I am not sure if doing the filetype guessing in the loader is cleaner or not, but by having the dataclass to pass around we avoid having to inject the config into the parser at least, which I think is definitely better.

Askir · 2025-02-11T21:08:47Z

projects/pgai/pgai/vectorizer/parsing.py

                "use parsing_auto or parsing_none instead"
            )
-        with io.BytesIO() as file_like:


You don't need to close bytesIO objects they are completely contained in memory and simply cleaned up by the garbage collector. The reason other IO objects have to be closed is because they make use of OS resources like file descriptors which are not freed by the GC itself, but BytesIO just "pretends" to be a file so that the interface is consistent.

Using the context manager here was to immediately free the in-memory usage, considering the amount of memory we could use in this function. It does not matter if it requires i/o from system or not, it is a matter of releasing the memory as fast as we can; common when handling files and easy to maintain with contexts.

However, It made sense in a prior version (mine) of the code, where there was more code right after the pymupdf parsing. With the current code, there is no benefit on ensuring the file_like gets freed because we are actually referring to that var until the very end of the function.

So 👍 in this particular case.

smoya · 2025-02-12T10:08:53Z

projects/pgai/pgai/vectorizer/parsing.py

-            else:
-                raise ValueError("No file extension could be determined")
+        if payload.file_type is None:
+            raise ValueError("No file extension could be determined")


Suggested change

raise ValueError("No file extension could be determined")

raise ValueError("No file type could be determined")

(We can change it in another PR 👍 )

smoya · 2025-02-12T10:09:37Z

I am not sure if doing the filetype guessing in the loader is cleaner or not,

Even though we could discuss if this is responsibility of the parser or loader, in practice I actually like this; it makes things way easier. we can always change it in the future 👍

smoya

LGTM! 🚀🌔

test: add epub and md tests

373f592

Askir commented Feb 11, 2025

View reviewed changes

fix: typing

0a2716c

Askir changed the title ~~test: add epub and md tests~~ test: add epub and markdown tests Feb 11, 2025

Askir changed the title ~~test: add epub and markdown tests~~ test: add epub and markdown tests + byta column Feb 11, 2025

test: binary docs in column as byta

4b39966

Askir force-pushed the jascha/refactoring-md-epub branch from 90b450a to 4b39966 Compare February 11, 2025 21:26

Askir marked this pull request as ready for review February 11, 2025 21:26

Askir requested a review from a team as a code owner February 11, 2025 21:26

smoya reviewed Feb 12, 2025

View reviewed changes

smoya approved these changes Feb 12, 2025

View reviewed changes

smoya merged commit 9af568b into s3-integration-feature-branch Feb 12, 2025
5 checks passed

smoya deleted the jascha/refactoring-md-epub branch February 12, 2025 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add epub and markdown tests + byta column #469

test: add epub and markdown tests + byta column #469

Askir commented Feb 11, 2025 •

edited

Loading

Askir Feb 11, 2025

smoya Feb 12, 2025 •

edited

Loading

smoya Feb 12, 2025

smoya Feb 12, 2025

smoya commented Feb 12, 2025 •

edited

Loading

smoya left a comment

	raise ValueError("No file extension could be determined")
	raise ValueError("No file type could be determined")

test: add epub and markdown tests + byta column #469

test: add epub and markdown tests + byta column #469

Conversation

Askir commented Feb 11, 2025 • edited Loading

Askir Feb 11, 2025

Choose a reason for hiding this comment

smoya Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

smoya Feb 12, 2025

Choose a reason for hiding this comment

smoya Feb 12, 2025

Choose a reason for hiding this comment

smoya commented Feb 12, 2025 • edited Loading

smoya left a comment

Choose a reason for hiding this comment

Askir commented Feb 11, 2025 •

edited

Loading

smoya Feb 12, 2025 •

edited

Loading

smoya commented Feb 12, 2025 •

edited

Loading