Skip to content

Json normalization performance #16313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 9, 2024

Conversation

yaauie
Copy link
Member

@yaauie yaauie commented Jul 9, 2024

Release notes

  • Restores performance of JSON-encoding operation by deferring an encoding operation and reducing unnecessary memory copying

What does this PR do?

  • Eliminates wasteful computation when encoding JSON

Why is it important/What is the impact to the user?

The safety added by 8.14.1's unicode normalization came at the cost of requiring all deeply-nested strings to be encoded to utf-8 prior to being passed to jrjackson, but separate upstream fixes in jrjackson eliminated this need for most cases.

This PR restores some of the lost performance while keeping identical behaviour. We can chase down the remaining performance gap (which may require a behavior change) in a separate effort.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files (and/or docker env variables)
  • [ ] I have added tests that prove my fix is effective or that my feature works

How to test this PR locally

  1. export an environment variable containing deeply-nested JSON structure

    export MESSAGE='{"one":{"two":{"three":{"four":{"five":{"six":{"seven":{"eight":{"nine":{"ten":{"eleven":{"twelve":{"thirteen":{"fourteen":{"fifteen":{}}}}}}}}}}}}}}}}'
    
  2. use that environment variable as input to a simple logstash pipeline with generator and json codec that invokes LogStash::Json.dump:

    bin/logstash -e 'input { generator { threads => 4 message => "${MESSAGE}" codec => json } } filter { ruby { code => "::LogStash::Json.dump(event.to_hash)" } } output { sink { } }'
    
  3. observe throughput:

    watch --color '(curl --silent -XGET http://localhost:9600/_node/stats | jq --color-output "pick(.process, .flow)")'
    

    On my M3 laptop, this translated to:

    scenario valid utf-8 preserved utf-8 throughput cpu
    no normalization (8.13.x) ⚠️ 440k eps 48%
    pre-normalization (8.14.2) 270k eps 62%
    jackson normalization (8.14.2) ⚠️ 440k eps 49%
    filtered pre-normalization (POC) 320k eps 57%
    reduced copy pre-normalization (this PR) 360k eps 55%
    • valid utf-8:
      • ✅ : output is always valid UTF-8, a pre-requisite for being valid JSON
      • ⚠️ : when inputs are non-UTF-8 or invalid-UTF-8, output can be corrupt
    • preserved utf-8:
      • ✅ : valid UTF-8 sequences in binary-flagged strings are preserved
      • ⚠️ : each byte in a valid multibyte UTF-8 sequences in binary-flagged strings is replaced with the unicode replacement character (lossy)

yaauie added 2 commits July 9, 2024 17:57
Since the underlying JrJackson now properly (and efficiently) encodes the
UTF-8 transcode of whichever strings it is given, we no longer need to
pre-normalize to UTF-8 in ruby _except_ when the string is flagged as BINARY
because we have alternate behaviour to preserve valid UTF-8 sequences.

By emitting a _copy_ of binary-flagged strings that have been re-flagged as
UTF-8, we allow the downstream (efficient) encoding operation in jrjackson
to produce equivalent behaviour at much lower cost.
@@ -51,7 +51,12 @@ def jruby_dump(o, options = {})
def normalize_encoding(data)
case data
when String
LogStash::UnicodeNormalizer.normalize_string_encoding(data)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may be able to delete LogStash::UnicodeNormalizer, as I believe it is no longer referenced.

We don't need a normalized copy of the string; we simply need to emit a string that can be normalized by JrJackson. In most cases this is simply the string that we received (no copy necessary), but for strings that are flagged as BINARY, we need to return a copy that has been flagged as utf-8 so that the RubyString#asJavaString() that jrjackson uses will preserve any valid utf-8 sequences that it contains.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • we are free to remove LogStash::UnicodeNormalizer as it is not referenced anywhere
  • we developed unicode handling when JrJackson was v0.4.18 and with your RubyUtils#writeRubyString changes in in 0.4.20, you are right that we no longer need String#encode & scrub! expensive operations

LGTM if you please remove LogStash::UnicodeNormalizer.

Copy link

Quality Gate passed Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

Copy link
Contributor

@mashhurs mashhurs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yaauie yaauie merged commit 66aeeee into elastic:main Jul 9, 2024
6 checks passed
@yaauie yaauie deleted the json-normalization-performance branch July 9, 2024 21:12
@yaauie
Copy link
Member Author

yaauie commented Jul 9, 2024

@logstashmachine backport 8.15

github-actions bot pushed a commit that referenced this pull request Jul 9, 2024
* licenses: allow elv2, standard abbreviation for Elastic License version 2

* json-dump: reduce unicode normalization cost

Since the underlying JrJackson now properly (and efficiently) encodes the
UTF-8 transcode of whichever strings it is given, we no longer need to
pre-normalize to UTF-8 in ruby _except_ when the string is flagged as BINARY
because we have alternate behaviour to preserve valid UTF-8 sequences.

By emitting a _copy_ of binary-flagged strings that have been re-flagged as
UTF-8, we allow the downstream (efficient) encoding operation in jrjackson
to produce equivalent behaviour at much lower cost.

* cleanup: remove orphan unicode normalizer

(cherry picked from commit 66aeeee)
@yaauie
Copy link
Member Author

yaauie commented Jul 9, 2024

@logstashmachine backport 8.14

github-actions bot pushed a commit that referenced this pull request Jul 9, 2024
* licenses: allow elv2, standard abbreviation for Elastic License version 2

* json-dump: reduce unicode normalization cost

Since the underlying JrJackson now properly (and efficiently) encodes the
UTF-8 transcode of whichever strings it is given, we no longer need to
pre-normalize to UTF-8 in ruby _except_ when the string is flagged as BINARY
because we have alternate behaviour to preserve valid UTF-8 sequences.

By emitting a _copy_ of binary-flagged strings that have been re-flagged as
UTF-8, we allow the downstream (efficient) encoding operation in jrjackson
to produce equivalent behaviour at much lower cost.

* cleanup: remove orphan unicode normalizer

(cherry picked from commit 66aeeee)
jsvd pushed a commit that referenced this pull request Aug 8, 2024
Since the underlying JrJackson now properly (and efficiently) encodes the
UTF-8 transcode of whichever strings it is given, we no longer need to
pre-normalize to UTF-8 in ruby _except_ when the string is flagged as BINARY
because we have alternate behaviour to preserve valid UTF-8 sequences.

By emitting a _copy_ of binary-flagged strings that have been re-flagged as
UTF-8, we allow the downstream (efficient) encoding operation in jrjackson
to produce equivalent behaviour at much lower cost.

(cherry picked from commit 66aeeee)

Co-authored-by: Ry Biesemeyer <[email protected]>
@jsvd jsvd removed the v8.14.3 label Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants