Why You Should Not Parse Localized Strings
As a programmer, it is natural to think of data as a series of one-to-one relationships between string identifiers and objects in a computer. Many standard libraries in programming languages allow for round-tripping between strings and objects, including JSON, ISO-8601 date strings, and URL encode/decode. So, if you can create a localized string, such as with .toLocaleString()
in ECMA-402, why can’t you parse that string back to an object using a function like .fromLocaleString()
?
The answer comes down to the fact that i18n is fundamentally different from data interchange. This document lays out why you should not parse localized strings and the best practices for what to do instead.
Parsing Is Not a Science
What is the number represented by the localized string "1,004"
? How about the date represented by the localized string "08/07/06"
?
If you grew up in the United States, the answer is “one thousand and four” for the number and “August 7, 2006” for the date. However, in different regions and different languages, the answers differ!
- en-US (US English) | one thousand and four | August 7, 2006
- en-ZA (South African English) | one and four thousandths | July 6, 2008
- es-419 (Latin American Spanish) | one thousand and four | July 8, 2006
- es-ES (Castilian Spanish) | one and four thousandths | July 8, 2006
- zh (Chinese) | one thousand and four | July 6, 2008
- zh-HK (Chinese in Hong Kong) | one thousand and four | July 8, 2006
Table compiled from the Common Locale Data Repository (CLDR): number symbols, date patterns
This table is just scratching the surface on the complexity of parsing localized strings.
User Language !== Device Language
As smart devices expand around the world, we are observing that users set their device locale to something other than their first language. For example, a Hindi speaker in India may have their device set to US English, even if they are more fluent in Hindi than in English. (My colleague Luke Swartz gave an excellent talk on this subject at the 41st Internationalization and Unicode Conference.)
As a result, it is not safe to simply use the device’s locale to disambiguate user input. For example, a user from Chile may have typed “08/07/05” with the intent to mean July 8, 2005, but if they had their phone set to US English, we would interpret it as August 7, 2005.
There Is No One-to-One Mapping
When producing formatted output (such as with .toLocaleString()
), the library picks a single sensible output format to use. However, in reality, there are many ways to express the same thing.
All of the following strings refer to the same date:
- October 21, 2020
- October 21, 2 Reiwa
- 3 Heshvan 5781
- 3 Cheshvan 5781
- 3 Marcheshvan 5781
- 10/21/20
- 21/10/20
- 2020/10/21
- 21.10.2020
- 3.2.5781
This list doesn’t include typos, punctuation differences, and even more additional formats and calendar systems.
Locale Data Changes Over Time
Most of the time, there is no official format for numbers and dates in a particular locale. CLDR attempts to reflect common usage based on newspapers, government documents, and user feedback. However, common usage can and does change over time! This means that if "08/07/06"
parsed as July 8, 2006 one day, after you update your locale data (such as when you update Chrome), it could parse as July 6, 2008 the next day.
When formatting strings, such as in ECMA-402’s .toLocaleString()
, this change in behavior is okay, because you should only be displaying the string to the user and nothing more. However, when parsing strings, you are probably going to be taking the result and passing it into your business logic, which disseminates the uncertainty into many more places.
Computational Complexity
Let’s suppose that we know the user’s actual locale, and that we have the data to parse different variations of localized strings. Even then, we still have the problem of computational complexity.
When passing around strings as an interchange format, they may have to be serialized and deserialized multiple times in multiple different environments. If every step along the way requires a complex locale-sensitive parsing algorithm, this can significantly increase latency. Much more efficient is to use an industry standard interchange format (discussed below).
How to Avoid The Need to Parse
When programmers want a parseLocaleString, we find that they usually fall into one of two buckets.
Parsing User Input: Here, the string originated from a human user, and the programmer wants to use locale information to parse it into a JavaScript object like Date or Number.
Data Interchange: Here, the string originated from a computer algorithm, such as .toLocaleString()
, and the programmer wants to round-trip the string back into a proper JavaScript object.
In both of these situations, the problem of parsing can usually be avoided entirely.
Parsing User Input
Rather than attempting to parse a free-form string, the best practice is to use a proper input mechanism to accept user input directly into a machine-readable format. For example, in HTML, you can use <input type="number">
. iOS, Android, and desktop OSes have similar built-in UI widgets you can use. Using built-in widgets means that users have a consistent experience across the platform and don’t need to learn the nuances of input in each app.
If your platform’s built-in input mechanisms are insufficient, such as when you wish to have more control over the user experience, use an open-source, off-the-shelf, i18n-ready input component. These are available in Angular, React, Vue, Android, iOS, and most other industry-standard front-end frameworks. When evaluating your options, make sure the component’s documentation discusses internationalization; if you can’t find it mentioned, then you probably shouldn’t use that component; just go to the next search result.
Not only are number and date pickers a better user experience, but they also allow the user to react when the thing they entered is wrong. For example, if the user intended to enter July 6, but the date picker is showing June 7, they can correct it, whereas if you do string parsing behind the scenes, the user has no opportunity to fix a mistake.
For apps with heavy data entry, such as spreadsheets and calendars, developers and designers often want to use date parsing to make the input easier. I argue that this is still fundamentally the same problem space: a properly internationalized free-form date parsing library may be used as an implementation detail of this journey, but only so long as the parse results are presented to the end user for validation.
If you still feel the urge to implement your own input component, think again, especially if you haven’t worked in the i18n field before and are unfamiliar with the unique problems that will arise.
Data Interchange
If a machine generated the string you are trying to parse, it means that the original machine-readable form must have also existed at some point in time. Localized strings are not an interchange format. Instead, you should use an industry-standard interchange format, such as ISO 8601 for dates and times (such as "2020-02-21"
), or scientific decimal notation for numbers (such as "1.23E4"
).
If the localized strings are stored someplace like a database or spreadsheet, you should consider storing the localized string alongside an interchange-valid format, and after you do that, you should reconsider whether you really need to store the localized string in your database at all.
Right Tool for the Job?
Parsing of localized number strings is not completely without valid use cases. For example, in data mining and natural langauge understanding, you may find yourself with large amounts of unprocessed text, and you need to parse numbers out of it. In this situation, you probably want a “best effort” parser, and accuracy comes not from each individual sample but from the collective dataset.
In situations like this, you should look at making a custom parser specific for your use case. What you come up with depends greatly on what assumptions you can make about your specific dataset. In addition, you may want to consider presenting the parse results to a human for verification.
Conclusion
I hope you found this blog post useful. If so, please consider following my blog and/or Twitter, @_sffc. Thanks!