regexp/syntax: recognize Unicode category aliases #70781

rsc · 2024-12-11T16:32:07Z

The Unicode specification defines aliases for some of the general category names. For example the category "L" has alias "Letter".

The regexp package supports \p{L} but not \p{Letter}, because there is nothing in the Unicode tables that lets regexp know about Letter.

Package regexp would be a permitted implementation for use in JSON-API Schema implementations except that there are tests with usage of aliases like \p{Letter} instead of \p{L}.

In #70780 I proposed adding a new CategoryAliases table to package unicode.

If that is accepted, I propose to also recognize the category aliases in regexp/syntax, which will make them work in package regexp.

I also propose to follow https://unicode.org/reports/tr18/#General_Category_Property and add \p{Any}, \p{Assigned}, and \p{ASCII}.

Finally, I propose to make the Unicode names case-insensitive, so that \p{ascii} can be used instead of \p{ASCII}.

gabyhelp · 2024-12-11T16:32:18Z

Related Issues

proposal: unicode: add CategoryAliases #70780

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

rsc · 2025-01-08T16:30:22Z

#70780 has been updated to include adding LC and Cn.
These would automatically start working in regexp as well,
as \p{LC} and \p{Cn}.

gopherbot · 2025-01-08T16:37:37Z

Change https://go.dev/cl/641377 mentions this issue: regexp/syntax: recognize category aliases like \p{Letter}

rsc · 2025-01-08T20:09:29Z

@BurntSushi pointed out that https://unicode.org/reports/tr18/#General_Category_Property also mentions \p{Any}, \p{Assigned}, and \p{ASCII}. Perhaps these should be included in the change as well.

BurntSushi · 2025-01-08T20:12:55Z

\p{any} is nice as a substitute for (?s:.) and \p{ascii} is especially nice in its negation. e.g., \P{ascii} when you're hunting around for things that aren't ASCII.

Obviously neither add any new capabilities, but I do use them occasionally myself.

rsc · 2025-01-08T20:15:43Z

The other potential change, as Andrew just used in his comment, is treating the names as case-insensitive and also ignoring spaces, hyphens, and underscores. If we're going to be more like TR18, it may be worth hitting that at the same time too.

rsc · 2025-02-05T19:39:10Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

willfaught · 2025-02-07T17:37:14Z

@rsc What is TR18?

BurntSushi · 2025-02-07T17:39:06Z

@willfaught A "Technical Standard" for how regular expressions should behave with respect to Unicode.

rsc · 2025-02-13T15:01:30Z

Have all remaining concerns about this proposal been addressed?

The proposal is to bring the regexp package into further alignment with Unicode TR18 by recognizing the following forms inside \p{...} and \P{...}

The new unicode.CategoryAliases, Cn, and LC (unicode: add CategoryAliases, LC, Cn #70780)
The pseudo-categories Any (all code points), Assigned (= not Cn), and ASCII (00-7F)
As TR18 says, "be lenient as to spaces, casing, hyphens and underbars", meaning accept \p{lu} for \p{Lu}, \p{ascii} for \p{ASCII}, \p{spacing mark} and even \p{spa_cing-ma rk} for \p{Spacing_Mark} (== \p{Mc})

aclements · 2025-02-19T19:32:12Z

Based on the discussion above, this proposal seems like a likely accept.
— aclements for the proposal review group

The proposal is to bring the regexp package into further alignment with Unicode TR18 by recognizing the following forms inside \p{...} and \P{...}

The new unicode.CategoryAliases, Cn, and LC (unicode: add CategoryAliases, LC, Cn #70780)
The pseudo-categories Any (all code points), Assigned (= not Cn), and ASCII (00-7F)
As TR18 says, "be lenient as to spaces, casing, hyphens and underbars", meaning accept \p{lu} for \p{Lu}, \p{ascii} for \p{ASCII}, \p{spacing mark} and even \p{spa_cing-ma rk} for \p{Spacing_Mark} (== \p{Mc})

aclements · 2025-02-26T19:32:15Z

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— aclements for the proposal review group

The proposal is to bring the regexp package into further alignment with Unicode TR18 by recognizing the following forms inside \p{...} and \P{...}

The new unicode.CategoryAliases, Cn, and LC (unicode: add CategoryAliases, LC, Cn #70780)
The pseudo-categories Any (all code points), Assigned (= not Cn), and ASCII (00-7F)
As TR18 says, "be lenient as to spaces, casing, hyphens and underbars", meaning accept \p{lu} for \p{Lu}, \p{ascii} for \p{ASCII}, \p{spacing mark} and even \p{spa_cing-ma rk} for \p{Spacing_Mark} (== \p{Mc})

gopherbot added the Proposal label Dec 11, 2024

gopherbot added this to the Proposal milestone Dec 11, 2024

gabyhelp mentioned this issue Dec 11, 2024

unicode: add CategoryAliases, LC, Cn #70780

Open

apparentlymart mentioned this issue Dec 11, 2024

regex and regexall to support Unicode category aliases (upstream Go proposal) opentofu/opentofu#2283

Open

ianlancetaylor added this to Proposals Dec 11, 2024

ianlancetaylor moved this to Incoming in Proposals Dec 11, 2024

rsc moved this from Incoming to Active in Proposals Feb 5, 2025

aclements moved this from Active to Likely Accept in Proposals Feb 19, 2025

aclements added the Proposal-FinalCommentPeriod label Feb 19, 2025

aclements moved this from Likely Accept to Accepted in Proposals Feb 26, 2025

aclements changed the title ~~proposal: regexp/syntax: recognize Unicode category aliases~~ regexp/syntax: recognize Unicode category aliases Feb 26, 2025

aclements modified the milestones: Proposal, Backlog Feb 26, 2025

aclements added Proposal-Accepted and removed Proposal-FinalCommentPeriod labels Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regexp/syntax: recognize Unicode category aliases #70781

regexp/syntax: recognize Unicode category aliases #70781

rsc commented Dec 11, 2024 •

edited

Loading

gabyhelp commented Dec 11, 2024

rsc commented Jan 8, 2025

gopherbot commented Jan 8, 2025

rsc commented Jan 8, 2025

BurntSushi commented Jan 8, 2025

rsc commented Jan 8, 2025

rsc commented Feb 5, 2025

willfaught commented Feb 7, 2025

BurntSushi commented Feb 7, 2025

rsc commented Feb 13, 2025

aclements commented Feb 19, 2025

aclements commented Feb 26, 2025

regexp/syntax: recognize Unicode category aliases #70781

regexp/syntax: recognize Unicode category aliases #70781

Comments

rsc commented Dec 11, 2024 • edited Loading

gabyhelp commented Dec 11, 2024

rsc commented Jan 8, 2025

gopherbot commented Jan 8, 2025

rsc commented Jan 8, 2025

BurntSushi commented Jan 8, 2025

rsc commented Jan 8, 2025

rsc commented Feb 5, 2025

willfaught commented Feb 7, 2025

BurntSushi commented Feb 7, 2025

rsc commented Feb 13, 2025

aclements commented Feb 19, 2025

aclements commented Feb 26, 2025

rsc commented Dec 11, 2024 •

edited

Loading