Skip to content

Conjuncts are not selected as a single unit when styling initials #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
r12a opened this issue Feb 5, 2020 · 3 comments
Open

Conjuncts are not selected as a single unit when styling initials #94

r12a opened this issue Feb 5, 2020 · 3 comments
Labels
doc:beng doc:deva gap i:initials Styling initials l:as Assamese l:bn Bengali language & script l:hi Hindi, Devanagari script l:mni Manipuri l:mr Marathi p:basic s:beng Bengali script s:deva Devanagari script x:beng x:blink x:deva x:gecko x:gujr x:webkit

Comments

@r12a
Copy link
Contributor

r12a commented Feb 5, 2020

This issue is applicable to most languages that form conjuncts from consonant clusters using an invisible virama.

When the start of a line contains a consonant cluster that uses a conjunct (rather than visible virama), ::first-letter should highlight the whole cluster.

Consonant clusters that form conjuncts using an invisible virama between the component letters need to be selected as a unit. This doesn't work well if segmentation relies on Unicode grapheme clusters, since a conjunct with two consonants will be parsed as two grapheme clusters (the first ending after the virama, and the second starting with the second consonant and including any following vowel-signs or other combining characters).

For these situations it is necessary to tailor the segmentation algorithm, so that it recognises the whole consonant cluster plus any attached vowel-signs or combining characters as a single unit.

For examples see Typographic character units in complex scripts.

Specs:

css-text-3 CSS uses the concept of 'typographic character unit', rather than grapheme cluster, in its specs with the explanation that the cases just described go beyond the scope of the grapheme cluster concept and that implementations should provide appropriate support. The spec doesn't provide details about the support needed for each language.

The Unicode Consortium made some attempts to address this issue, but it has so far not yielded results. CLDR now flags up a few scripts for which conjuncts are common.

Tests & results:
Interactive test, When ::first-letter is applied to Devanagari the browser will select a 2-consonant conjunct as a unit

Interactive test, When ::first-letter is applied to Bengali the browser will select a conjunct as a unit, if the virama is hidden

  • Gecko: ✅❌ Most of the half-form conjuncts fail (which is the large majority of all conjuncts in Devanagari), and are broken into an initial consonant with visible virama and a following consonant.
  • Blink: ✅ All conjuncts are fully selected.
  • Webkit: ✅ All conjuncts are fully selected.

I18n test suite, Devanagari text

Browser bug reports:

Gecko

Priority:
Keeping conjuncts together is a pretty basic requirement. Without a fix for this, authors need to manually mark up text to apply initial letter styling, but that isn't a very useful workaround.

@r12a
Copy link
Contributor Author

r12a commented Feb 5, 2020

The first comment in this issue contains text that will automatically appear in one or more gap-analysis documents as a subsection with the same title as this issue. Any edits made to that comment will be immediately available in the document. Proposals for changes or discussion of the content can be made in comments below this point.

Relevant gap analysis documents include:
BengaliDevanagariGujarati

@r12a
Copy link
Contributor Author

r12a commented Mar 29, 2021

I rewrote this topic completely, applying the latest template. It introduces one aspect of the conjunct parsing problem that is a fundamental issue in many Brahmi derived scripts, and surfaces in other operations too, such as letter-spacing, line-breaking, etc.

@r12a r12a changed the title Incorrect segmentation for styling initials Conjuncts are not selected as a single unit when styling initials Mar 30, 2021
@r12a
Copy link
Contributor Author

r12a commented Mar 30, 2021

Removed the information about handling clusters with visible viramas into #115, which reduces the complexity here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc:beng doc:deva gap i:initials Styling initials l:as Assamese l:bn Bengali language & script l:hi Hindi, Devanagari script l:mni Manipuri l:mr Marathi p:basic s:beng Bengali script s:deva Devanagari script x:beng x:blink x:deva x:gecko x:gujr x:webkit
Projects
Status: Browser bug raised
Development

No branches or pull requests

1 participant