Page MenuHomePhabricator

Create configuration for specifying units conversions
Closed, ResolvedPublic

Description

In order to facilitate unit conversion, we need the configuration array that contains rows with the following:

  • Unit Q-ID
  • Primary unit Q-ID
  • Unit name in GNU Units
  • Multiplier from this unit to primary unit (from GNU Units)

Example:

Q253276Q11573mile1609.344
Q11573Q11573m1

Primary units will have first two items the same and 1 as the multiplier.

The full configuration file should be generated by the script from the user-editable config that contains only three first items.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
ResolvedSmalyshev
DuplicateNone
DuplicateNone
OpenNone
OpenNone
InvalidNone
OpenNone
Resolvedhoo
OpenNone
OpenNone
ResolvedFeatureMichael
Resolveddaniel
OpenNone
ResolvedLydia_Pintscher
Resolvedthiemowmde
Resolveddaniel
OpenNone
StalledNone
StalledNone
ResolvedSmalyshev
OpenNone
OpenNone
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
InvalidNone
Resolveddaniel
ResolvedSmalyshev
OpenNone
ResolvedLydia_Pintscher
ResolvedLadsgroup
OpenNone
ResolvedLadsgroup

Event Timeline

Smalyshev raised the priority of this task from to Medium.
Smalyshev updated the task description. (Show Details)
Smalyshev added subscribers: aude, daniel, Aklapper, Smalyshev.

In addition to the multiplier, we need at least an optional offset (for °F to °C conversion) and perhaps also an exponent (for reciprocal conversions like miles/gallon to liter/100km). Both can come from GNU Units.

Perhaps the last column shouldn't be an exponent, but a function name, so we can use it to support more types of conversions in the future. For example, we could support "inv" (inverse) for reciprocal conversion, but also "log" or "exp" in the future, for logarithmic scales.

My current thinking is that we should abstract the specification of unit conversions on two levels:

  • the UnitConverter service interface can have different implementation using information from different sources.
  • the default implementation would be based on a static list (CDB or a PHP file returning an array), with a decorator for caching.

An implementation that relies directly on statements on properties would probably not scale well. It also suffers from the issue that when such a statement is changed, all statements using the respective unit would need to be re-indexed. (Also, P2442 and P2370 define conversion factors, so they only support linear conversion without offset. We will at least need to support offsets, and possibly exponents and even logarithms. This could perhaps be solved using qualifiers on the statements the define the conversion).

My suggestion is to always go with the file based UnitConverter implementation. We still have freedom to decide how that file is created:

  • it could be maintained manually
  • it could be generated periodically, by a maintenance script based on P2442 and P2370
  • it could be generated using GnuUnits, based on a config file that lists all supported units and their base unit (as Q-id and the unit symbol used by GnuUnits).
    • This list of supported units could in turn be generated based on P2442 and P2370; this might however be confusing, since the statements would only be used to determine the base unit, the conversion factor given in the statement would be ignored.

One more concern - to actually do the calculations, we may need arbitrary precision arithmetics... And we need a fast one if we want to do it inside dump. We can maybe require bcmath/gmp to calculate normalized values. But some of those will be big, like: https://www.wikidata.org/wiki/Q180892 - I'm not even sure if RDF tools can deal with such numbers. Or if you try to display it in infobox, it won't blow up the page.

Change 298407 had a related patch set uploaded (by Smalyshev):
Tools for creating unit conversion config

https://gerrit.wikimedia.org/r/298407

Hi, maybe I come after the battle, but if you're not aware of this, Wikidata has now a "conversion to SI unit" that could come valuable for this as it stores the value of the configuration file.

Maybe instead of this configuration we can make the mechanism more flexible by just specifying such a property on the database. The set of unit supported by the export could then be extended with no privilege at all.

@TomT0m yes, we plan to use "conversion to SI unit".

Maybe instead of this configuration we can make the mechanism more flexible by just specifying such a property on the database. The set of unit supported by the export could then be extended with no privilege at all.

The problem is that it could then also be changed and deleted with no privileges at all. What should happen when a statement specifying a conversion factor is edited? Shall we re-import the entire RDF dump into the query service?

Changes that have big impact should not be too easy... We'd have to implement per-property permissions, or per-statement protection, before we could allow this. For now, a configuration file is the best solution, I think.

True, these are the issues. This is why I propose the two-stage scheme, where units config file is generated from Wikidata statements, but this should be done very rarely once initial version is created. Mostly updates will be only new unit additions, and changing existing unit should require good explanation.

Yes, this means we need to reload whole data set (or at least quantity statements, which is not much better) when we change unit conversion tables - excepting the case where new conversions/units are added (in this case, only statements using these units need to be reloaded and existing statements do not need to be changed).

Also, this means the conversion config would not really be synchronized with Wikidata data as such. I don't think it is a very big issue.

Until we have per-property permissions, I don't think direct usage of Wikidata properties is a good thing.

BTW, if we really wanted to use GNU units, I think it won't be hard to create UnitStorage driver that uses it, or generate JSON config from it, but I am not sure it's better than having one generated from Wikidata.

To use GNU units, we'll at least need a mapping between item IDs and GNU unit symbols.

Anyway....

@Smalyshev do you think it makes sense to start small, with a manually written config of maybe 100 units, or do you think we should try to cover as much as possible right away, so we don't have to reload the data multiple times?

I'd prefer starting small and then extending on-demand.

We need then a solution to introduce new unit conversions into dataset. Currently the only way it full reload, but in theory we can add new conversions incrementally. Not change though, change is more complex.

It seems like we need an update strategy before we can deploy this (T145426). The only way around that would be to deploy with ALL conversions right away. Before we can add any new conversions, we would again have to implement some way to update conversions, or re-import the entire data dump.

Like Lydia, I would prefer to start out with a small set of mappings, perhaps even provided manually. That should be feasible for a few dozen units. Perhaps we could start with various units for the length dimension: µm... centimeter, inch, foot, yard, meter, ... kilometer, mile... light second.... AU... light year... parsec...
That would give us an opportunity to gather experience with unit conversions before enabling it for other dimensions.

So I checked our data against GNU units data and found these mismatches, among used units:

DIFF: [stadion(Q1645966)->metre]: GNU 189.738, ours 186
DIFF: [plethron(Q2099374)->metre]: GNU 31.623, ours 29.55
DIFF: [scruple(Q1573593)->kilogram]: GNU 0.0012959782, ours 0.001244
DIFF: [pood(Q923539)->kilogram]: GNU 16.3806872, ours 16.3804964
DIFF: [grain(Q693944)->kilogram]: GNU 6.479891E-5, ours 6.22E-5
DIFF: [pace(Q691543)->metre]: GNU 0.762, ours 1.48
DIFF: [pous(Q7235735)->metre]: GNU 0.31623, ours 0.308

But this not counting for square/cubic/etc. yet.

More mismatches:

DIFF: [congius(Q3646719)->cubic metre]: GNU 0.0034806123936, ours 0.00323
DIFF: [sextarius(Q14333713)->cubic metre]: GNU 0.0005801020656, ours 0.000546
DIFF: [choinix(Q15794456)->cubic metre]: GNU 0.00108, ours 0.001087
DIFF: [amphora(Q2844434)->cubic metre]: GNU 0.0278448991488, ours 0.02586
DIFF: [chous(Q1076762)->cubic metre]: GNU 0, ours 0.00312 <-- here GNU doesn't seem to think it's volume measure
DIFF: [modius(Q669909)->cubic metre]: GNU 0.0092816330496, ours 0.008736
DIFF: [heredium(Q3785200)->square metre]: GNU 5046.6816, ours 5036
DIFF: [jugerum(Q251545)->square metre]: GNU 2523.3408, ours 2518

Most of these units seem to be rather exotic, and used only in cross-definitions (not sure yet how to filter those). The question is - are these differences important? Which data is better?

Change 311206 had a related patch set uploaded (by Smalyshev):
Add config for units on Wikidata

https://gerrit.wikimedia.org/r/311206

It seems like we need an update strategy before we can deploy this

I don't think we need it before we deploy it - we'll only need to update once we deploy second iteration - but I agree we certainly do need one. The easiest way to do it though would be to enable .nt dump (T144103).

Perhaps we could start with various units for the length dimension:

It looks a bit weird to be if conversion would work for length units but not for any other unit, but I can do that.

I've also restricted units to ones having at least 10 usages and looks like none of these have disagreement between our data and GNU data.

Initially https://www.wikidata.org/wiki/Property:P2370 (conversion to SI units) wasn't used for units that don't have a reliable conversion to these (e.g. calendar year, month, historic units).

  • What is the plan for these?
  • Should we leave P2370 or use something else?

@Smalyshev @Esc3300: When extracting conversion factors from statements on Wikidata, we should only use factory that are marked +/- 0. +/-0 indicates that the factory applies per definition, which is what we want for conversion.

For things like calendar year, month, or historical units, the conversion factor should be given with some uncertainty.

Thinking about it, we could even uncertain conversion factory, but we would need to carry the uncertainty through the conversion mechanism, applying it during conversion. Possible, but I don't think we need that right now, and it would complicate things quite a bit.

If there's no conversion to SI units or no SI units, we shouldn't use P2370. We should use P2442 instead. SI is well-defined system and we should only use SI properties to signify SI units. But we do plan to support other units, just a bit later after we see how the system fares with SI ones.

For dates, I'm not sure we should use these - dates are entirely different can of worms which usually don't play by the rules the other measure play. So we'll approach it on case-by-case basis.

Mentioned in SAL (#wikimedia-operations) [2016-10-06T21:51:28Z] <thcipriani@tin> Synchronized wmf-config/unitConversionConfig.json: SWAT: [[gerrit:311206|Add config for units on Wikidata (T117032)]] PART I (duration: 00m 48s)

Mentioned in SAL (#wikimedia-operations) [2016-10-06T21:53:06Z] <thcipriani@tin> Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:311206|Add config for units on Wikidata (T117032)]] PART II (duration: 00m 50s)

Smalyshev claimed this task.

The config is deployed in production.

Change 319401 had a related patch set uploaded (by Smalyshev):
Tools for creating unit conversion config

https://gerrit.wikimedia.org/r/319401

Change 319401 abandoned by Smalyshev:
Tools for creating unit conversion config

Reason:
we can do it without backporting

https://gerrit.wikimedia.org/r/319401