Skip to content

Commit aecdd66

Browse files
committed
Fix inconsistency in newmm-safe engine by copilot
Related to #755 Update the calculation of `cut_pos` in `newmm-safe` engine to ensure consistent tokenization results. * Modify `pythainlp/tokenize/newmm.py` to update the calculation of `cut_pos` at line 193 to `cut_pos = space_idx + 1 + _TEXT_SCAN_BEGIN`.
1 parent 9a9d11f commit aecdd66

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

pythainlp/tokenize/newmm.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,7 @@ def segment(
182182
# try to break by space first
183183
space_idx = sample.rfind(" ")
184184
if space_idx >= 0:
185-
cut_pos = space_idx + 1
185+
cut_pos = space_idx + 1 + _TEXT_SCAN_BEGIN
186186
else:
187187
tokens = list(_onecut(sample, custom_dict))
188188
token_max_idx = 0

0 commit comments

Comments
 (0)