Json normalization performance #16313

yaauie · 2024-07-09T18:30:47Z

Release notes

Restores performance of JSON-encoding operation by deferring an encoding operation and reducing unnecessary memory copying

What does this PR do?

Eliminates wasteful computation when encoding JSON

Why is it important/What is the impact to the user?

The safety added by 8.14.1's unicode normalization came at the cost of requiring all deeply-nested strings to be encoded to utf-8 prior to being passed to jrjackson, but separate upstream fixes in jrjackson eliminated this need for most cases.

This PR restores some of the lost performance while keeping identical behaviour. We can chase down the remaining performance gap (which may require a behavior change) in a separate effort.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files (and/or docker env variables)~~
~~[ ] I have added tests that prove my fix is effective or that my feature works~~

How to test this PR locally

export an environment variable containing deeply-nested JSON structure

export MESSAGE='{"one":{"two":{"three":{"four":{"five":{"six":{"seven":{"eight":{"nine":{"ten":{"eleven":{"twelve":{"thirteen":{"fourteen":{"fifteen":{}}}}}}}}}}}}}}}}'

use that environment variable as input to a simple logstash pipeline with generator and json codec that invokes LogStash::Json.dump:

bin/logstash -e 'input { generator { threads => 4 message => "${MESSAGE}" codec => json } } filter { ruby { code => "::LogStash::Json.dump(event.to_hash)" } } output { sink { } }'

observe throughput:

watch --color '(curl --silent -XGET http://localhost:9600/_node/stats | jq --color-output "pick(.process, .flow)")'

On my M3 laptop, this translated to:

scenario	valid utf-8	preserved utf-8	throughput	cpu
no normalization (8.13.x)	⚠️	✅	440k eps	48%
pre-normalization (8.14.2)	✅	✅	270k eps	62%
jackson normalization (8.14.2)	✅	⚠️	440k eps	49%
filtered pre-normalization (POC)	✅	✅	320k eps	57%
reduced copy pre-normalization (this PR)	✅	✅	360k eps	55%

valid utf-8:

✅ : output is always valid UTF-8, a pre-requisite for being valid JSON

⚠️ : when inputs are non-UTF-8 or invalid-UTF-8, output can be corrupt

preserved utf-8:

✅ : valid UTF-8 sequences in binary-flagged strings are preserved

⚠️ : each byte in a valid multibyte UTF-8 sequences in binary-flagged strings is replaced with the unicode replacement character (lossy)

…on 2

Since the underlying JrJackson now properly (and efficiently) encodes the UTF-8 transcode of whichever strings it is given, we no longer need to pre-normalize to UTF-8 in ruby _except_ when the string is flagged as BINARY because we have alternate behaviour to preserve valid UTF-8 sequences. By emitting a _copy_ of binary-flagged strings that have been re-flagged as UTF-8, we allow the downstream (efficient) encoding operation in jrjackson to produce equivalent behaviour at much lower cost.

yaauie · 2024-07-09T18:37:31Z

logstash-core/lib/logstash/json.rb

@@ -51,7 +51,12 @@ def jruby_dump(o, options = {})
    def normalize_encoding(data)
      case data
      when String
-        LogStash::UnicodeNormalizer.normalize_string_encoding(data)


we may be able to delete LogStash::UnicodeNormalizer, as I believe it is no longer referenced.

We don't need a normalized copy of the string; we simply need to emit a string that can be normalized by JrJackson. In most cases this is simply the string that we received (no copy necessary), but for strings that are flagged as BINARY, we need to return a copy that has been flagged as utf-8 so that the RubyString#asJavaString() that jrjackson uses will preserve any valid utf-8 sequences that it contains.

we are free to remove LogStash::UnicodeNormalizer as it is not referenced anywhere

we developed unicode handling when JrJackson was v0.4.18 and with your RubyUtils#writeRubyString changes in in 0.4.20, you are right that we no longer need String#encode & scrub! expensive operations

LGTM if you please remove LogStash::UnicodeNormalizer.

elastic-sonarqube · 2024-07-09T20:44:02Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

elasticmachine · 2024-07-09T20:54:46Z

💚 Build Succeeded

Buildkite Build
Commit: 648a4d2

History

💛 Build #1301 was flaky 15d3371

mashhurs

LGTM!

yaauie · 2024-07-09T21:12:39Z

@logstashmachine backport 8.15

* licenses: allow elv2, standard abbreviation for Elastic License version 2 * json-dump: reduce unicode normalization cost Since the underlying JrJackson now properly (and efficiently) encodes the UTF-8 transcode of whichever strings it is given, we no longer need to pre-normalize to UTF-8 in ruby _except_ when the string is flagged as BINARY because we have alternate behaviour to preserve valid UTF-8 sequences. By emitting a _copy_ of binary-flagged strings that have been re-flagged as UTF-8, we allow the downstream (efficient) encoding operation in jrjackson to produce equivalent behaviour at much lower cost. * cleanup: remove orphan unicode normalizer (cherry picked from commit 66aeeee)

yaauie · 2024-07-09T21:14:31Z

@logstashmachine backport 8.14

* licenses: allow elv2, standard abbreviation for Elastic License version 2 * json-dump: reduce unicode normalization cost Since the underlying JrJackson now properly (and efficiently) encodes the UTF-8 transcode of whichever strings it is given, we no longer need to pre-normalize to UTF-8 in ruby _except_ when the string is flagged as BINARY because we have alternate behaviour to preserve valid UTF-8 sequences. By emitting a _copy_ of binary-flagged strings that have been re-flagged as UTF-8, we allow the downstream (efficient) encoding operation in jrjackson to produce equivalent behaviour at much lower cost. * cleanup: remove orphan unicode normalizer (cherry picked from commit 66aeeee)

Since the underlying JrJackson now properly (and efficiently) encodes the UTF-8 transcode of whichever strings it is given, we no longer need to pre-normalize to UTF-8 in ruby _except_ when the string is flagged as BINARY because we have alternate behaviour to preserve valid UTF-8 sequences. By emitting a _copy_ of binary-flagged strings that have been re-flagged as UTF-8, we allow the downstream (efficient) encoding operation in jrjackson to produce equivalent behaviour at much lower cost. (cherry picked from commit 66aeeee) Co-authored-by: Ry Biesemeyer <[email protected]>

yaauie added 2 commits July 9, 2024 17:57

licenses: allow elv2, standard abbreviation for Elastic License versi…

f84de17

…on 2

yaauie added the performance regression label Jul 9, 2024

yaauie requested a review from mashhurs July 9, 2024 18:30

yaauie commented Jul 9, 2024

View reviewed changes

jsvd added the status:needs-review label Jul 9, 2024

cleanup: remove orphan unicode normalizer

648a4d2

mashhurs approved these changes Jul 9, 2024

View reviewed changes

jsvd added status:approved and removed status:needs-review labels Jul 9, 2024

yaauie merged commit 66aeeee into elastic:main Jul 9, 2024
6 checks passed

yaauie deleted the json-normalization-performance branch July 9, 2024 21:12

github-actions bot mentioned this pull request Jul 9, 2024

Backport PR #16313 to 8.15: Json normalization performance #16314

Merged

2 tasks

github-actions bot added the v8.15.0 label Jul 9, 2024

github-actions bot mentioned this pull request Jul 9, 2024

Backport PR #16313 to 8.14: Json normalization performance #16315

Draft

2 tasks

github-actions bot added the v8.14.3 label Jul 9, 2024

jsvd removed the status:approved label Jul 9, 2024

kaisecheng mentioned this pull request Sep 11, 2024

EPS degrade between 8.14 and 8.15 on memory queue with 16 workers #16414

Closed

jsvd removed the v8.14.3 label Nov 14, 2024

jsvd added the int-shortlist label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Json normalization performance #16313

Json normalization performance #16313

Uh oh!

yaauie commented Jul 9, 2024 •

edited

Loading

Uh oh!

yaauie Jul 9, 2024

Uh oh!

mashhurs Jul 9, 2024

Uh oh!

elastic-sonarqube bot commented Jul 9, 2024

Uh oh!

elasticmachine commented Jul 9, 2024

Uh oh!

mashhurs left a comment

Uh oh!

Uh oh!

yaauie commented Jul 9, 2024

Uh oh!

yaauie commented Jul 9, 2024

Uh oh!

Uh oh!

Json normalization performance #16313

Json normalization performance #16313

Uh oh!

Conversation

yaauie commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release notes

What does this PR do?

Why is it important/What is the impact to the user?

Checklist

How to test this PR locally

Uh oh!

yaauie Jul 9, 2024

Choose a reason for hiding this comment

Uh oh!

mashhurs Jul 9, 2024

Choose a reason for hiding this comment

Uh oh!

elastic-sonarqube bot commented Jul 9, 2024

Quality Gate passed

Uh oh!

elasticmachine commented Jul 9, 2024

💚 Build Succeeded

History

Uh oh!

mashhurs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yaauie commented Jul 9, 2024

Uh oh!

yaauie commented Jul 9, 2024

Uh oh!

Uh oh!

yaauie commented Jul 9, 2024 •

edited

Loading