Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp/syntax: recognize Unicode category aliases #70781

Open
rsc opened this issue Dec 11, 2024 · 12 comments
Open

regexp/syntax: recognize Unicode category aliases #70781

rsc opened this issue Dec 11, 2024 · 12 comments

Comments

@rsc
Copy link
Contributor

rsc commented Dec 11, 2024

The Unicode specification defines aliases for some of the general category names. For example the category "L" has alias "Letter".

The regexp package supports \p{L} but not \p{Letter}, because there is nothing in the Unicode tables that lets regexp know about Letter.

Package regexp would be a permitted implementation for use in JSON-API Schema implementations except that there are tests with usage of aliases like \p{Letter} instead of \p{L}.

In #70780 I proposed adding a new CategoryAliases table to package unicode.

If that is accepted, I propose to also recognize the category aliases in regexp/syntax, which will make them work in package regexp.

I also propose to follow https://unicode.org/reports/tr18/#General_Category_Property and add \p{Any}, \p{Assigned}, and \p{ASCII}.

Finally, I propose to make the Unicode names case-insensitive, so that \p{ascii} can be used instead of \p{ASCII}.

@gopherbot gopherbot added this to the Proposal milestone Dec 11, 2024
@gabyhelp
Copy link

Related Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@rsc
Copy link
Contributor Author

rsc commented Jan 8, 2025

#70780 has been updated to include adding LC and Cn.
These would automatically start working in regexp as well,
as \p{LC} and \p{Cn}.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/641377 mentions this issue: regexp/syntax: recognize category aliases like \p{Letter}

@rsc
Copy link
Contributor Author

rsc commented Jan 8, 2025

@BurntSushi pointed out that https://unicode.org/reports/tr18/#General_Category_Property also mentions \p{Any}, \p{Assigned}, and \p{ASCII}. Perhaps these should be included in the change as well.

@BurntSushi
Copy link

\p{any} is nice as a substitute for (?s:.) and \p{ascii} is especially nice in its negation. e.g., \P{ascii} when you're hunting around for things that aren't ASCII.

Obviously neither add any new capabilities, but I do use them occasionally myself.

@rsc
Copy link
Contributor Author

rsc commented Jan 8, 2025

The other potential change, as Andrew just used in his comment, is treating the names as case-insensitive and also ignoring spaces, hyphens, and underscores. If we're going to be more like TR18, it may be worth hitting that at the same time too.

@rsc
Copy link
Contributor Author

rsc commented Feb 5, 2025

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals Feb 5, 2025
@willfaught
Copy link
Contributor

@rsc What is TR18?

@rsc
Copy link
Contributor Author

rsc commented Feb 13, 2025

Have all remaining concerns about this proposal been addressed?

The proposal is to bring the regexp package into further alignment with Unicode TR18 by recognizing the following forms inside \p{...} and \P{...}

  1. The new unicode.CategoryAliases, Cn, and LC (unicode: add CategoryAliases, LC, Cn #70780)
  2. The pseudo-categories Any (all code points), Assigned (= not Cn), and ASCII (00-7F)
  3. As TR18 says, "be lenient as to spaces, casing, hyphens and underbars", meaning accept \p{lu} for \p{Lu}, \p{ascii} for \p{ASCII}, \p{spacing mark} and even \p{spa_cing-ma rk} for \p{Spacing_Mark} (== \p{Mc})

@aclements
Copy link
Member

Based on the discussion above, this proposal seems like a likely accept.
— aclements for the proposal review group

The proposal is to bring the regexp package into further alignment with Unicode TR18 by recognizing the following forms inside \p{...} and \P{...}

  1. The new unicode.CategoryAliases, Cn, and LC (unicode: add CategoryAliases, LC, Cn #70780)
  2. The pseudo-categories Any (all code points), Assigned (= not Cn), and ASCII (00-7F)
  3. As TR18 says, "be lenient as to spaces, casing, hyphens and underbars", meaning accept \p{lu} for \p{Lu}, \p{ascii} for \p{ASCII}, \p{spacing mark} and even \p{spa_cing-ma rk} for \p{Spacing_Mark} (== \p{Mc})

@aclements aclements moved this from Active to Likely Accept in Proposals Feb 19, 2025
@aclements
Copy link
Member

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— aclements for the proposal review group

The proposal is to bring the regexp package into further alignment with Unicode TR18 by recognizing the following forms inside \p{...} and \P{...}

  1. The new unicode.CategoryAliases, Cn, and LC (unicode: add CategoryAliases, LC, Cn #70780)
  2. The pseudo-categories Any (all code points), Assigned (= not Cn), and ASCII (00-7F)
  3. As TR18 says, "be lenient as to spaces, casing, hyphens and underbars", meaning accept \p{lu} for \p{Lu}, \p{ascii} for \p{ASCII}, \p{spacing mark} and even \p{spa_cing-ma rk} for \p{Spacing_Mark} (== \p{Mc})

@aclements aclements moved this from Likely Accept to Accepted in Proposals Feb 26, 2025
@aclements aclements changed the title proposal: regexp/syntax: recognize Unicode category aliases regexp/syntax: recognize Unicode category aliases Feb 26, 2025
@aclements aclements modified the milestones: Proposal, Backlog Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests

6 participants