Skip to content

Conversation

@djoooooe
Copy link
Contributor

CESU8Encoding.leftAdjustCharHead currently does not properly handle 6-byte sequences. This pull request aims to fix that.

@headius
Copy link
Member

headius commented Mar 11, 2023

Nice, thank you! Will review and merge.

@headius
Copy link
Member

headius commented Mar 11, 2023

It looks ok to me but the equivalent code in CRuby for left_adjust_char_head does not have the extra code you added:

static UChar*
left_adjust_char_head(const UChar* start, const UChar* s, const UChar* end, OnigEncoding enc ARG_UNUSED)
{
  const UChar *p;

  if (s <= start) return (UChar* )s;
  p = s;

  while (!utf8_islead(*p) && p > start) p--;
  return (UChar* )p;
}

Does this mean CRuby is not doing this right? Or else why do we need this code?

@djoooooe
Copy link
Contributor Author

I'd say that CRuby is not doing it right:
As far as i understand the JCodings API, left_adjust_char_head should move the cursor left by an entire codepoint.
CESU-8 encodes codepoints greater than 0xffff as a UTF-16 surrogate pair, where the individual surrogate values are encoded in UTF-8, so they look like two individual UTF-8 codepoints; When left_adjust_char_head only moves to the previous UTF-8 codepoint head, it will move into the "middle" of the surrogate pair instead of the beginning of the full codepoint.

@djoooooe
Copy link
Contributor Author

The new unit test I included in this pull request demonstrates this: There is already a unit test that verifies that the codepoint length of the surrogate pair "\u00ed\u00a0\u0080\u00ed\u00b0\u0080" is 1. The new test also verifies that left_adjust_char_head moves past the entire sequence. Without the patch, left_adjust_char_head only moves to byte index 3.

}

private static int utf8Decode3ByteSequence(byte[] bytes, int p, int c) {
return ((c & 0xF) << 12) | ((bytes[p + 1] & 0xff & 0x3f) << 6) | (bytes[p + 2] & 0xff & 0x3f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was pointed out to me that 0x3f is not needed here as 0xff contains same bit pattern (sans 1 bit). I am sure the compiler will figure this out so unless there is some cognitive benefit to seeing it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0xff is not needed, but 0x3f is. I suppose 0xff was introduced during conversion from C to Java to account for signedness. I just kept the & 0xff because it seems to be used everywhere in the JCodings implementation. Should I remove it in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djoooooe Ah yeah I said that backwards. It probably just comes down to whether this is worth leaving in to make it simpler to compare. I don't think there is harm in leaving it. The only argument for removing it would be Java people studying wondering why we have an extra bit AND we don't need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this probably isn't too important because all accesses of unsigned byte values should move to the Java 8 Byte.toUnsignedInt anyway. If that means we do the unsigned conversion and then duplicate some work with an additional mask, I think it's still better than making other contributors guess at why one particular byte access doesn't have the unsigned masking.

Copy link
Member

@enebo enebo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment I got was:

there os USE_INVALID_CODE_SCHEME as truee
wrt:
#ifdef USE_INVALID_CODE_SCHEME
if (*p > 0xfd) {
return ((*p == 0xfe) ? INVALID_CODE_FE : INVALID_CODE_FF);
}

@djoooooe
Copy link
Contributor Author

Another comment I got was:

there os USE_INVALID_CODE_SCHEME as truee wrt: #ifdef USE_INVALID_CODE_SCHEME if (*p > 0xfd) { return ((*p == 0xfe) ? INVALID_CODE_FE : INVALID_CODE_FF); }

I'm sorry, i don't know how this interacts with the implementation of left_adjust_char_head. Could you elaborate?

@headius
Copy link
Member

headius commented Mar 13, 2023

I'd say that CRuby is not doing it right

... it will move into the "middle" of the surrogate pair instead of the beginning of the full codepoint.

Ok that makes sense. I think we should bring this up with CRuby folks (https://bugs.ruby-lang.org) and probably eventually (after they accept it as a bug) move the test into spec/ruby so it can be upstreamed to the other implementations.

I don't see any reason not to merge this now. 👍

@headius
Copy link
Member

headius commented Mar 13, 2023

Oh, for bonus points you could make a PR for CRuby that fixes it in the same way!

@enebo
Copy link
Member

enebo commented Mar 13, 2023

@djoooooe Yeah I just got that from someone as a comment. I think it was mentioned only because it was in the same method you had modified but in reading this myself I realize the logic is actually there and it is not even an issue in matching up to C codebase. It just uses true constant to look more closely like the C. So no changes for that.

@djoooooe
Copy link
Contributor Author

djoooooe commented Mar 13, 2023

Oh, for bonus points you could make a PR for CRuby that fixes it in the same way!

Done:
ruby/ruby#7510

@enebo enebo merged commit 14b093e into jruby:master Mar 16, 2023
@enebo
Copy link
Member

enebo commented Mar 16, 2023

@djoooooe whoops we should have did that on Monday :) Thanks for your contribution.

@djoooooe
Copy link
Contributor Author

@enebo You're welcome :) Thanks for the quick responses!

@headius
Copy link
Member

headius commented Mar 17, 2023

We can spin a release of this any time. If you need it in JRuby, that might take a little longer.

@eregon eregon mentioned this pull request Nov 6, 2024
@headius headius added this to the 1.0.59 milestone Jan 15, 2025
headius added a commit to headius/jruby that referenced this pull request Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants