Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
bpo-37966: Fully implement the UAX #15 quick-check algorithm.
The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX #15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop
  • Loading branch information
gnprice committed Aug 28, 2019
commit 2a222da9638b77a014ba388ebccd1a0e9c9606af
5 changes: 3 additions & 2 deletions Doc/whatsnew/3.8.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1090,8 +1090,9 @@ unicodedata
<http://blog.unicode.org/2019/05/unicode-12-1-en.html>`_ release.

* New function :func:`~unicodedata.is_normalized` can be used to verify a string
is in a specific normal form. (Contributed by Max Belanger and David Euresti in
:issue:`32285`).
is in a specific normal form, often much faster than by actually normalizing
the string. (Contributed by Max Belanger, David Euresti, and Greg Price in
:issue:`32285` and :issue:`37966`).


unittest
Expand Down
2 changes: 2 additions & 0 deletions Lib/test/test_unicodedata.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,8 @@ def test_issue29456(self):
self.assertEqual(self.db.normalize('NFC', u11a7_str_a), u11a7_str_b)
self.assertEqual(self.db.normalize('NFC', u11c3_str_a), u11c3_str_b)

# For tests of unicodedata.is_normalized / self.db.is_normalized ,
# see test_normalization.py .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, I would support merging test_normalization into this file for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah; as it is, it actually caused me to think this function had evaded being tested at all -- I'd gone as far as to write up some tests myself before I spotted that other file.

A bit of context on someone's thinking in having a separate file can be found above at line 178:

        # The rest can be found in test_normalization.py
        # which requires an external file.

I don't see why that means the test code should be in a separate file, though. There's already a try/skip mechanism to deal with the external file being unavailable. (I'm guessing that if I looked in the history I'd find that that mechanism wasn't there when the separate file was first added.)

Happy to send a PR to fold that file in.


def test_east_asian_width(self):
eaw = self.db.east_asian_width
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
The implementation of :func:`~unicodedata.is_normalized` has been greatly
sped up on strings that aren't normalized, by implementing the full
normalization-quick-check algorithm from the Unicode standard.
65 changes: 39 additions & 26 deletions Modules/unicodedata.c
Original file line number Diff line number Diff line change
Expand Up @@ -775,29 +775,37 @@ nfc_nfkc(PyObject *self, PyObject *input, int k)
return result;
}

typedef enum {YES, NO, MAYBE} NormalMode;

/* Return YES if the input is certainly normalized, NO or MAYBE if it might not be. */
static NormalMode
is_normalized(PyObject *self, PyObject *input, int nfc, int k)
// This needs to match the logic in makeunicodedata.py
// which constructs the quickcheck data.
typedef enum {YES = 0, MAYBE = 1, NO = 2} QuickcheckResult;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe makeunicodedata.py should output this enum (with better name namespacing) then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good idea!

Do you think it'd be cleaner to add that to this PR, or make a small separate one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be done latter, I think.


/* Run the Unicode normalization "quickcheck" algorithm.
*
* Return YES or NO if quickcheck determines the input is certainly
* normalized or certainly not, and MAYBE if quickcheck is unable to
* tell. */
static QuickcheckResult
is_normalized_quickcheck(PyObject *self, PyObject *input, int nfc, int k)
{
Py_ssize_t i, len;
int kind;
void *data;
unsigned char prev_combining = 0, quickcheck_mask;
/* This is an implementation of the following algorithm:
https://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms
See there for background.
*/

/* An older version of the database is requested, quickchecks must be
disabled. */
if (self && UCD_Check(self))
return NO;

/* This is an implementation of the following algorithm:
https://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms
See there for background.
*/
Py_ssize_t i, len;
int kind;
void *data;
unsigned char prev_combining = 0;

/* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No. */
quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0));
/* The two quickcheck bits at this shift have type QuickcheckResult. */
int quickcheck_shift = (nfc ? 4 : 0) + (k ? 2 : 0);

QuickcheckResult result = YES; /* certainly normalized, unless we find something */

i = 0;
kind = PyUnicode_KIND(input);
Expand All @@ -806,16 +814,21 @@ is_normalized(PyObject *self, PyObject *input, int nfc, int k)
while (i < len) {
Py_UCS4 ch = PyUnicode_READ(kind, data, i++);
const _PyUnicode_DatabaseRecord *record = _getrecord_ex(ch);
unsigned char combining = record->combining;
unsigned char quickcheck = record->normalization_quick_check;

if (quickcheck & quickcheck_mask)
return MAYBE; /* this string might need normalization */
unsigned char combining = record->combining;
if (combining && prev_combining > combining)
return NO; /* non-canonical sort order, not normalized */
prev_combining = combining;

unsigned char quickcheck = record->normalization_quick_check;
switch ((quickcheck >> quickcheck_shift) & 3) {
case NO:
return NO;
case MAYBE:
result = MAYBE; /* this string might need normalization */
}
}
return YES; /* certainly normalized */
return result;
}

/*[clinic input]
Expand Down Expand Up @@ -848,7 +861,7 @@ unicodedata_UCD_is_normalized_impl(PyObject *self, PyObject *form,
PyObject *result;
int nfc = 0;
int k = 0;
NormalMode m;
QuickcheckResult m;

PyObject *cmp;
int match = 0;
Expand All @@ -871,7 +884,7 @@ unicodedata_UCD_is_normalized_impl(PyObject *self, PyObject *form,
return NULL;
}

m = is_normalized(self, input, nfc, k);
m = is_normalized_quickcheck(self, input, nfc, k);

if (m == MAYBE) {
cmp = (nfc ? nfc_nfkc : nfd_nfkd)(self, input, k);
Expand Down Expand Up @@ -917,28 +930,28 @@ unicodedata_UCD_normalize_impl(PyObject *self, PyObject *form,
}

if (_PyUnicode_EqualToASCIIId(form, &PyId_NFC)) {
if (is_normalized(self, input, 1, 0) == YES) {
if (is_normalized_quickcheck(self, input, 1, 0) == YES) {
Py_INCREF(input);
return input;
}
return nfc_nfkc(self, input, 0);
}
if (_PyUnicode_EqualToASCIIId(form, &PyId_NFKC)) {
if (is_normalized(self, input, 1, 1) == YES) {
if (is_normalized_quickcheck(self, input, 1, 1) == YES) {
Py_INCREF(input);
return input;
}
return nfc_nfkc(self, input, 1);
}
if (_PyUnicode_EqualToASCIIId(form, &PyId_NFD)) {
if (is_normalized(self, input, 0, 0) == YES) {
if (is_normalized_quickcheck(self, input, 0, 0) == YES) {
Py_INCREF(input);
return input;
}
return nfd_nfkd(self, input, 0);
}
if (_PyUnicode_EqualToASCIIId(form, &PyId_NFKD)) {
if (is_normalized(self, input, 0, 1) == YES) {
if (is_normalized_quickcheck(self, input, 0, 1) == YES) {
Py_INCREF(input);
return input;
}
Expand Down