Fix CESU8Encoding.leftAdjustCharHead #61

djoooooe · 2023-03-10T11:36:03Z

CESU8Encoding.leftAdjustCharHead currently does not properly handle 6-byte sequences. This pull request aims to fix that.

headius · 2023-03-11T16:23:31Z

Nice, thank you! Will review and merge.

headius · 2023-03-11T16:32:18Z

It looks ok to me but the equivalent code in CRuby for left_adjust_char_head does not have the extra code you added:

static UChar*
left_adjust_char_head(const UChar* start, const UChar* s, const UChar* end, OnigEncoding enc ARG_UNUSED)
{
  const UChar *p;

  if (s <= start) return (UChar* )s;
  p = s;

  while (!utf8_islead(*p) && p > start) p--;
  return (UChar* )p;
}

Does this mean CRuby is not doing this right? Or else why do we need this code?

djoooooe · 2023-03-13T08:41:21Z

I'd say that CRuby is not doing it right:
As far as i understand the JCodings API, left_adjust_char_head should move the cursor left by an entire codepoint.
CESU-8 encodes codepoints greater than 0xffff as a UTF-16 surrogate pair, where the individual surrogate values are encoded in UTF-8, so they look like two individual UTF-8 codepoints; When left_adjust_char_head only moves to the previous UTF-8 codepoint head, it will move into the "middle" of the surrogate pair instead of the beginning of the full codepoint.

djoooooe · 2023-03-13T08:46:21Z

The new unit test I included in this pull request demonstrates this: There is already a unit test that verifies that the codepoint length of the surrogate pair "\u00ed\u00a0\u0080\u00ed\u00b0\u0080" is 1. The new test also verifies that left_adjust_char_head moves past the entire sequence. Without the patch, left_adjust_char_head only moves to byte index 3.

enebo · 2023-03-13T14:17:59Z

src/org/jcodings/specific/CESU8Encoding.java

+  }
+
+  private static int utf8Decode3ByteSequence(byte[] bytes, int p, int c) {
+    return ((c & 0xF) << 12) | ((bytes[p + 1] & 0xff & 0x3f) << 6) | (bytes[p + 2] & 0xff & 0x3f);


It was pointed out to me that 0x3f is not needed here as 0xff contains same bit pattern (sans 1 bit). I am sure the compiler will figure this out so unless there is some cognitive benefit to seeing it?

0xff is not needed, but 0x3f is. I suppose 0xff was introduced during conversion from C to Java to account for signedness. I just kept the & 0xff because it seems to be used everywhere in the JCodings implementation. Should I remove it in this PR?

@djoooooe Ah yeah I said that backwards. It probably just comes down to whether this is worth leaving in to make it simpler to compare. I don't think there is harm in leaving it. The only argument for removing it would be Java people studying wondering why we have an extra bit AND we don't need.

FWIW this probably isn't too important because all accesses of unsigned byte values should move to the Java 8 Byte.toUnsignedInt anyway. If that means we do the unsigned conversion and then duplicate some work with an additional mask, I think it's still better than making other contributors guess at why one particular byte access doesn't have the unsigned masking.

enebo

Another comment I got was:

there os USE_INVALID_CODE_SCHEME as truee
wrt:
#ifdef USE_INVALID_CODE_SCHEME
if (*p > 0xfd) {
return ((*p == 0xfe) ? INVALID_CODE_FE : INVALID_CODE_FF);
}

djoooooe · 2023-03-13T14:28:40Z

Another comment I got was:

there os USE_INVALID_CODE_SCHEME as truee wrt: #ifdef USE_INVALID_CODE_SCHEME if (*p > 0xfd) { return ((*p == 0xfe) ? INVALID_CODE_FE : INVALID_CODE_FF); }

I'm sorry, i don't know how this interacts with the implementation of left_adjust_char_head. Could you elaborate?

headius · 2023-03-13T14:36:26Z

I'd say that CRuby is not doing it right

... it will move into the "middle" of the surrogate pair instead of the beginning of the full codepoint.

Ok that makes sense. I think we should bring this up with CRuby folks (https://bugs.ruby-lang.org) and probably eventually (after they accept it as a bug) move the test into spec/ruby so it can be upstreamed to the other implementations.

I don't see any reason not to merge this now. 👍

headius · 2023-03-13T14:36:48Z

Oh, for bonus points you could make a PR for CRuby that fixes it in the same way!

enebo · 2023-03-13T14:37:36Z

@djoooooe Yeah I just got that from someone as a comment. I think it was mentioned only because it was in the same method you had modified but in reading this myself I realize the logic is actually there and it is not even an issue in matching up to C codebase. It just uses true constant to look more closely like the C. So no changes for that.

djoooooe · 2023-03-13T15:54:43Z

Oh, for bonus points you could make a PR for CRuby that fixes it in the same way!

Done:
ruby/ruby#7510

enebo · 2023-03-16T13:45:46Z

@djoooooe whoops we should have did that on Monday :) Thanks for your contribution.

djoooooe · 2023-03-17T12:29:40Z

@enebo You're welcome :) Thanks for the quick responses!

headius · 2023-03-17T17:48:10Z

We can spin a release of this any time. If you need it in JRuby, that might take a little longer.

Issues fixed: * jruby/jcodings#61 * jruby/jcodings#62 * jruby/jcodings#60

Fix CESU8Encoding.leftAdjustCharHead

bf2ce06

enebo reviewed Mar 13, 2023

View reviewed changes

enebo merged commit 14b093e into jruby:master Mar 16, 2023

eregon mentioned this pull request Nov 6, 2024

New release #63

Closed

headius added this to the 1.0.59 milestone Jan 15, 2025

headius added a commit to headius/jruby that referenced this pull request Jan 15, 2025

Update to jcodings 1.0.59

f0ddc9a

Issues fixed: * jruby/jcodings#61 * jruby/jcodings#62 * jruby/jcodings#60

headius mentioned this pull request Jan 15, 2025

Update to jcodings 1.0.59 jruby/jruby#8571

Merged

Fix CESU8Encoding.leftAdjustCharHead #61

Fix CESU8Encoding.leftAdjustCharHead #61

Uh oh!

Conversation

djoooooe commented Mar 10, 2023

Uh oh!

headius commented Mar 11, 2023

Uh oh!

headius commented Mar 11, 2023

Uh oh!

djoooooe commented Mar 13, 2023

Uh oh!

djoooooe commented Mar 13, 2023

Uh oh!

enebo Mar 13, 2023

Choose a reason for hiding this comment

Uh oh!

djoooooe Mar 13, 2023

Choose a reason for hiding this comment

Uh oh!

enebo Mar 13, 2023

Choose a reason for hiding this comment

Uh oh!

headius Mar 13, 2023

Choose a reason for hiding this comment

Uh oh!

enebo left a comment

Choose a reason for hiding this comment

Uh oh!

djoooooe commented Mar 13, 2023

Uh oh!

headius commented Mar 13, 2023

Uh oh!

headius commented Mar 13, 2023

Uh oh!

enebo commented Mar 13, 2023

Uh oh!

djoooooe commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enebo commented Mar 16, 2023

Uh oh!

djoooooe commented Mar 17, 2023

Uh oh!

headius commented Mar 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

djoooooe commented Mar 13, 2023 •

edited

Loading