-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regexp/syntax: recognize Unicode category aliases #70781
Comments
Related Issues (Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
#70780 has been updated to include adding LC and Cn. |
Change https://go.dev/cl/641377 mentions this issue: |
@BurntSushi pointed out that https://unicode.org/reports/tr18/#General_Category_Property also mentions \p{Any}, \p{Assigned}, and \p{ASCII}. Perhaps these should be included in the change as well. |
Obviously neither add any new capabilities, but I do use them occasionally myself. |
The other potential change, as Andrew just used in his comment, is treating the names as case-insensitive and also ignoring spaces, hyphens, and underscores. If we're going to be more like TR18, it may be worth hitting that at the same time too. |
This proposal has been added to the active column of the proposals project |
@rsc What is TR18? |
Have all remaining concerns about this proposal been addressed? The proposal is to bring the regexp package into further alignment with Unicode TR18 by recognizing the following forms inside \p{...} and \P{...}
|
Based on the discussion above, this proposal seems like a likely accept. The proposal is to bring the regexp package into further alignment with Unicode TR18 by recognizing the following forms inside \p{...} and \P{...}
|
No change in consensus, so accepted. 🎉 The proposal is to bring the regexp package into further alignment with Unicode TR18 by recognizing the following forms inside \p{...} and \P{...}
|
The Unicode specification defines aliases for some of the general category names. For example the category "L" has alias "Letter".
The regexp package supports \p{L} but not \p{Letter}, because there is nothing in the Unicode tables that lets regexp know about Letter.
Package regexp would be a permitted implementation for use in JSON-API Schema implementations except that there are tests with usage of aliases like \p{Letter} instead of \p{L}.
In #70780 I proposed adding a new CategoryAliases table to package unicode.
If that is accepted, I propose to also recognize the category aliases in regexp/syntax, which will make them work in package regexp.
I also propose to follow https://unicode.org/reports/tr18/#General_Category_Property and add \p{Any}, \p{Assigned}, and \p{ASCII}.
Finally, I propose to make the Unicode names case-insensitive, so that \p{ascii} can be used instead of \p{ASCII}.
The text was updated successfully, but these errors were encountered: