-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Json normalization performance #16313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Since the underlying JrJackson now properly (and efficiently) encodes the UTF-8 transcode of whichever strings it is given, we no longer need to pre-normalize to UTF-8 in ruby _except_ when the string is flagged as BINARY because we have alternate behaviour to preserve valid UTF-8 sequences. By emitting a _copy_ of binary-flagged strings that have been re-flagged as UTF-8, we allow the downstream (efficient) encoding operation in jrjackson to produce equivalent behaviour at much lower cost.
@@ -51,7 +51,12 @@ def jruby_dump(o, options = {}) | |||
def normalize_encoding(data) | |||
case data | |||
when String | |||
LogStash::UnicodeNormalizer.normalize_string_encoding(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may be able to delete LogStash::UnicodeNormalizer
, as I believe it is no longer referenced.
We don't need a normalized copy of the string; we simply need to emit a string that can be normalized by JrJackson. In most cases this is simply the string that we received (no copy necessary), but for strings that are flagged as BINARY
, we need to return a copy that has been flagged as utf-8 so that the RubyString#asJavaString()
that jrjackson uses will preserve any valid utf-8 sequences that it contains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- we are free to remove
LogStash::UnicodeNormalizer
as it is not referenced anywhere - we developed unicode handling when JrJackson was v0.4.18 and with your
RubyUtils#writeRubyString
changes in in 0.4.20, you are right that we no longer needString#encode & scrub!
expensive operations
LGTM if you please remove LogStash::UnicodeNormalizer
.
|
💚 Build Succeeded
History
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@logstashmachine backport 8.15 |
* licenses: allow elv2, standard abbreviation for Elastic License version 2 * json-dump: reduce unicode normalization cost Since the underlying JrJackson now properly (and efficiently) encodes the UTF-8 transcode of whichever strings it is given, we no longer need to pre-normalize to UTF-8 in ruby _except_ when the string is flagged as BINARY because we have alternate behaviour to preserve valid UTF-8 sequences. By emitting a _copy_ of binary-flagged strings that have been re-flagged as UTF-8, we allow the downstream (efficient) encoding operation in jrjackson to produce equivalent behaviour at much lower cost. * cleanup: remove orphan unicode normalizer (cherry picked from commit 66aeeee)
@logstashmachine backport 8.14 |
* licenses: allow elv2, standard abbreviation for Elastic License version 2 * json-dump: reduce unicode normalization cost Since the underlying JrJackson now properly (and efficiently) encodes the UTF-8 transcode of whichever strings it is given, we no longer need to pre-normalize to UTF-8 in ruby _except_ when the string is flagged as BINARY because we have alternate behaviour to preserve valid UTF-8 sequences. By emitting a _copy_ of binary-flagged strings that have been re-flagged as UTF-8, we allow the downstream (efficient) encoding operation in jrjackson to produce equivalent behaviour at much lower cost. * cleanup: remove orphan unicode normalizer (cherry picked from commit 66aeeee)
Since the underlying JrJackson now properly (and efficiently) encodes the UTF-8 transcode of whichever strings it is given, we no longer need to pre-normalize to UTF-8 in ruby _except_ when the string is flagged as BINARY because we have alternate behaviour to preserve valid UTF-8 sequences. By emitting a _copy_ of binary-flagged strings that have been re-flagged as UTF-8, we allow the downstream (efficient) encoding operation in jrjackson to produce equivalent behaviour at much lower cost. (cherry picked from commit 66aeeee) Co-authored-by: Ry Biesemeyer <[email protected]>
Release notes
What does this PR do?
Why is it important/What is the impact to the user?
The safety added by 8.14.1's unicode normalization came at the cost of requiring all deeply-nested strings to be encoded to utf-8 prior to being passed to jrjackson, but separate upstream fixes in jrjackson eliminated this need for most cases.
This PR restores some of the lost performance while keeping identical behaviour. We can chase down the remaining performance gap (which may require a behavior change) in a separate effort.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files (and/or docker env variables)[ ] I have added tests that prove my fix is effective or that my feature worksHow to test this PR locally
export an environment variable containing deeply-nested JSON structure
use that environment variable as input to a simple logstash pipeline with generator and json codec that invokes
LogStash::Json.dump
:observe throughput:
On my M3 laptop, this translated to: