-
Notifications
You must be signed in to change notification settings - Fork 30
Fix CESU8Encoding.leftAdjustCharHead #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Nice, thank you! Will review and merge. |
|
It looks ok to me but the equivalent code in CRuby for left_adjust_char_head does not have the extra code you added: Does this mean CRuby is not doing this right? Or else why do we need this code? |
|
I'd say that CRuby is not doing it right: |
|
The new unit test I included in this pull request demonstrates this: There is already a unit test that verifies that the codepoint length of the surrogate pair |
| } | ||
|
|
||
| private static int utf8Decode3ByteSequence(byte[] bytes, int p, int c) { | ||
| return ((c & 0xF) << 12) | ((bytes[p + 1] & 0xff & 0x3f) << 6) | (bytes[p + 2] & 0xff & 0x3f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was pointed out to me that 0x3f is not needed here as 0xff contains same bit pattern (sans 1 bit). I am sure the compiler will figure this out so unless there is some cognitive benefit to seeing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0xff is not needed, but 0x3f is. I suppose 0xff was introduced during conversion from C to Java to account for signedness. I just kept the & 0xff because it seems to be used everywhere in the JCodings implementation. Should I remove it in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djoooooe Ah yeah I said that backwards. It probably just comes down to whether this is worth leaving in to make it simpler to compare. I don't think there is harm in leaving it. The only argument for removing it would be Java people studying wondering why we have an extra bit AND we don't need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this probably isn't too important because all accesses of unsigned byte values should move to the Java 8 Byte.toUnsignedInt anyway. If that means we do the unsigned conversion and then duplicate some work with an additional mask, I think it's still better than making other contributors guess at why one particular byte access doesn't have the unsigned masking.
enebo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another comment I got was:
there os USE_INVALID_CODE_SCHEME as truee
wrt:
#ifdef USE_INVALID_CODE_SCHEME
if (*p > 0xfd) {
return ((*p == 0xfe) ? INVALID_CODE_FE : INVALID_CODE_FF);
}
I'm sorry, i don't know how this interacts with the implementation of |
Ok that makes sense. I think we should bring this up with CRuby folks (https://bugs.ruby-lang.org) and probably eventually (after they accept it as a bug) move the test into spec/ruby so it can be upstreamed to the other implementations. I don't see any reason not to merge this now. 👍 |
|
Oh, for bonus points you could make a PR for CRuby that fixes it in the same way! |
|
@djoooooe Yeah I just got that from someone as a comment. I think it was mentioned only because it was in the same method you had modified but in reading this myself I realize the logic is actually there and it is not even an issue in matching up to C codebase. It just uses true constant to look more closely like the C. So no changes for that. |
Done: |
|
@djoooooe whoops we should have did that on Monday :) Thanks for your contribution. |
|
@enebo You're welcome :) Thanks for the quick responses! |
|
We can spin a release of this any time. If you need it in JRuby, that might take a little longer. |
Issues fixed: * jruby/jcodings#61 * jruby/jcodings#62 * jruby/jcodings#60
CESU8Encoding.leftAdjustCharHeadcurrently does not properly handle 6-byte sequences. This pull request aims to fix that.