-
-
Notifications
You must be signed in to change notification settings - Fork 33.8k
bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
4025110
2a222da
26892d3
27e8122
3762787
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX #15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| The implementation of :func:`~unicodedata.is_normalized` has been greatly | ||
| sped up on strings that aren't normalized, by implementing the full | ||
| normalization-quick-check algorithm from the Unicode standard. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -775,29 +775,37 @@ nfc_nfkc(PyObject *self, PyObject *input, int k) | |
| return result; | ||
| } | ||
|
|
||
| typedef enum {YES, NO, MAYBE} NormalMode; | ||
|
|
||
| /* Return YES if the input is certainly normalized, NO or MAYBE if it might not be. */ | ||
| static NormalMode | ||
| is_normalized(PyObject *self, PyObject *input, int nfc, int k) | ||
| // This needs to match the logic in makeunicodedata.py | ||
| // which constructs the quickcheck data. | ||
| typedef enum {YES = 0, MAYBE = 1, NO = 2} QuickcheckResult; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, good idea! Do you think it'd be cleaner to add that to this PR, or make a small separate one?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could be done latter, I think. |
||
|
|
||
| /* Run the Unicode normalization "quickcheck" algorithm. | ||
| * | ||
| * Return YES or NO if quickcheck determines the input is certainly | ||
| * normalized or certainly not, and MAYBE if quickcheck is unable to | ||
| * tell. */ | ||
| static QuickcheckResult | ||
| is_normalized_quickcheck(PyObject *self, PyObject *input, int nfc, int k) | ||
| { | ||
| Py_ssize_t i, len; | ||
| int kind; | ||
| void *data; | ||
| unsigned char prev_combining = 0, quickcheck_mask; | ||
| /* This is an implementation of the following algorithm: | ||
| https://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms | ||
| See there for background. | ||
| */ | ||
gnprice marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| /* An older version of the database is requested, quickchecks must be | ||
| disabled. */ | ||
| if (self && UCD_Check(self)) | ||
| return NO; | ||
|
|
||
| /* This is an implementation of the following algorithm: | ||
| https://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms | ||
| See there for background. | ||
| */ | ||
| Py_ssize_t i, len; | ||
| int kind; | ||
| void *data; | ||
| unsigned char prev_combining = 0; | ||
|
|
||
| /* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No. */ | ||
| quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0)); | ||
| /* The two quickcheck bits at this shift have type QuickcheckResult. */ | ||
| int quickcheck_shift = (nfc ? 4 : 0) + (k ? 2 : 0); | ||
|
|
||
| QuickcheckResult result = YES; /* certainly normalized, unless we find something */ | ||
|
|
||
| i = 0; | ||
| kind = PyUnicode_KIND(input); | ||
|
|
@@ -806,16 +814,21 @@ is_normalized(PyObject *self, PyObject *input, int nfc, int k) | |
| while (i < len) { | ||
| Py_UCS4 ch = PyUnicode_READ(kind, data, i++); | ||
| const _PyUnicode_DatabaseRecord *record = _getrecord_ex(ch); | ||
| unsigned char combining = record->combining; | ||
| unsigned char quickcheck = record->normalization_quick_check; | ||
|
|
||
| if (quickcheck & quickcheck_mask) | ||
| return MAYBE; /* this string might need normalization */ | ||
| unsigned char combining = record->combining; | ||
| if (combining && prev_combining > combining) | ||
| return NO; /* non-canonical sort order, not normalized */ | ||
| prev_combining = combining; | ||
|
|
||
| unsigned char quickcheck = record->normalization_quick_check; | ||
| switch ((quickcheck >> quickcheck_shift) & 3) { | ||
| case NO: | ||
| return NO; | ||
| case MAYBE: | ||
| result = MAYBE; /* this string might need normalization */ | ||
| } | ||
| } | ||
| return YES; /* certainly normalized */ | ||
| return result; | ||
| } | ||
|
|
||
| /*[clinic input] | ||
|
|
@@ -848,7 +861,7 @@ unicodedata_UCD_is_normalized_impl(PyObject *self, PyObject *form, | |
| PyObject *result; | ||
| int nfc = 0; | ||
| int k = 0; | ||
| NormalMode m; | ||
| QuickcheckResult m; | ||
|
|
||
| PyObject *cmp; | ||
| int match = 0; | ||
|
|
@@ -871,7 +884,7 @@ unicodedata_UCD_is_normalized_impl(PyObject *self, PyObject *form, | |
| return NULL; | ||
| } | ||
|
|
||
| m = is_normalized(self, input, nfc, k); | ||
| m = is_normalized_quickcheck(self, input, nfc, k); | ||
|
|
||
| if (m == MAYBE) { | ||
| cmp = (nfc ? nfc_nfkc : nfd_nfkd)(self, input, k); | ||
|
|
@@ -917,28 +930,28 @@ unicodedata_UCD_normalize_impl(PyObject *self, PyObject *form, | |
| } | ||
|
|
||
| if (_PyUnicode_EqualToASCIIId(form, &PyId_NFC)) { | ||
| if (is_normalized(self, input, 1, 0) == YES) { | ||
| if (is_normalized_quickcheck(self, input, 1, 0) == YES) { | ||
| Py_INCREF(input); | ||
| return input; | ||
| } | ||
| return nfc_nfkc(self, input, 0); | ||
| } | ||
| if (_PyUnicode_EqualToASCIIId(form, &PyId_NFKC)) { | ||
| if (is_normalized(self, input, 1, 1) == YES) { | ||
| if (is_normalized_quickcheck(self, input, 1, 1) == YES) { | ||
| Py_INCREF(input); | ||
| return input; | ||
| } | ||
| return nfc_nfkc(self, input, 1); | ||
| } | ||
| if (_PyUnicode_EqualToASCIIId(form, &PyId_NFD)) { | ||
| if (is_normalized(self, input, 0, 0) == YES) { | ||
| if (is_normalized_quickcheck(self, input, 0, 0) == YES) { | ||
| Py_INCREF(input); | ||
| return input; | ||
| } | ||
| return nfd_nfkd(self, input, 0); | ||
| } | ||
| if (_PyUnicode_EqualToASCIIId(form, &PyId_NFKD)) { | ||
| if (is_normalized(self, input, 0, 1) == YES) { | ||
| if (is_normalized_quickcheck(self, input, 0, 1) == YES) { | ||
| Py_INCREF(input); | ||
| return input; | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh, I would support merging
test_normalizationinto this file for clarity.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah; as it is, it actually caused me to think this function had evaded being tested at all -- I'd gone as far as to write up some tests myself before I spotted that other file.
A bit of context on someone's thinking in having a separate file can be found above at line 178:
I don't see why that means the test code should be in a separate file, though. There's already a try/skip mechanism to deal with the external file being unavailable. (I'm guessing that if I looked in the history I'd find that that mechanism wasn't there when the separate file was first added.)
Happy to send a PR to fold that file in.