Skip to content

Update median_absolute_deviation aggregation. #9453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 31, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 39 additions & 55 deletions _aggregations/metric/median-absolute-deviation.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,70 +9,41 @@ redirect_from:

# Median absolute deviation aggregations

The `median_absolute_deviation` metric is a single-value metric aggregation that returns a median absolute deviation field. Median absolute deviation is a statistical measure of data variability. Because the median absolute deviation measures dispersion from the median, it provides a more robust measure of variability that is less affected by outliers in a dataset.
The `median_absolute_deviation` aggregation is a single-value metric aggregation. Median absolute deviation is a variability metric that measures dispersion from the median.

Median absolute deviation is calculated as follows:<br>
median_absolute_deviation = median(|X<sub>i</sub> - Median(X<sub>i</sub>)|)
Median absolute deviation is less affected by outliers than standard deviation, which relies on squared error terms and is useful for describing data that is not normally distributed.

The following example calculates the median absolute deviation of the `DistanceMiles` field in the sample dataset `opensearch_dashboards_sample_data_flights`:
Median absolute deviation is computed as follows:


```json
GET opensearch_dashboards_sample_data_flights/_search
{
"size": 0,
"aggs": {
"median_absolute_deviation_DistanceMiles": {
"median_absolute_deviation": {
"field": "DistanceMiles"
}
}
}
}
```
{% include copy-curl.html %}
median_absolute_deviation = median( | x<sub>i</sub> - median(x<sub>i</sub>) | )
```

#### Example response

```json
{
"took": 35,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"median_absolute_deviation_distanceMiles": {
"value": 1829.8993624441966
}
}
}
```
OpenSearch estimates `median_absolute_deviation`, rather than calculating it directly, because of memory limitations. This estimation is computationally expensive. You can adjust the trade-off between estimation accuracy and performance. For more information, see [Adjusting estimation accuracy](https://github.com/opensearch-project/documentation-website/pull/9453/files#adjusting-estimation-accuracy).

## Parameters

The `median_absolute_deviation` aggregation takes the following parameters.

### Missing
| Parameter | Required/Optional | Data type | Description |
| :-- | :-- | :-- | :-- |
| `field` | Required | String | The name of the numeric field for which the median absolute deviation is computed. |
| `missing` | Optional | Numeric | The value to assign to missing instances of the field. If not provided, documents with missing values are omitted from the estimation. |
| `compression` | Optional | Numeric | A parameter that [adjusts the balance between estimate accuracy and performance](#adjusting-estimation-accuracy). The value of `compression` must be greater than `0`. The default value is `1000`. |

By default, if a field is missing or has a null value in a document, it is ignored during computation. However, you can specify a value to be used for those missing or null fields by using the `missing` parameter, as shown in the following request:
## Example

The following example calculates the median absolute deviation of the `DistanceMiles` field in the `opensearch_dashboards_sample_data_flights` dataset:

```json
GET opensearch_dashboards_sample_data_flights/_search
{
"size": 0,
"aggs": {
"median_absolute_deviation_distanceMiles": {
"median_absolute_deviation_DistanceMiles": {
"median_absolute_deviation": {
"field": "DistanceMiles",
"missing": 1000
"field": "DistanceMiles"
}
}
}
Expand All @@ -82,9 +53,11 @@ GET opensearch_dashboards_sample_data_flights/_search

#### Example response

As shown in the following example response, the aggregation returns an estimate of the median absolute deviation in the `median_absolute_deviation_DistanceMiles` variable:

```json
{
"took": 7,
"took": 490,
"timed_out": false,
"_shards": {
"total": 1,
Expand All @@ -101,16 +74,22 @@ GET opensearch_dashboards_sample_data_flights/_search
"hits": []
},
"aggregations": {
"median_absolute_deviation_distanceMiles": {
"value": 1829.6443646143355
"median_absolute_deviation_DistanceMiles": {
"value": 1830.917892238693
}
}
}
```

### Compression
## Missing values

OpenSearch ignores missing and null values when computing `median_absolute_deviation`.

The median absolute deviation is calculated using the [t-digest](https://github.com/tdunning/t-digest/tree/main) data structure, which balances between performance and estimation accuracy through the `compression` parameter (default value: `1000`). Adjusting the `compression` value affects the trade-off between computational efficiency and precision. Lower `compression` values improve performance but may reduce estimation accuracy, while higher values enhance accuracy at the cost of increased computational overhead, as shown in the following request:
You can assign a value to missing instances of the aggregated field. See [Missing aggregations]({{site.url}}{{site.baseurl}}/aggregations/bucket/missing/) for more information.

## Adjusting estimation accuracy

The median absolute deviation is calculated using the [t-digest](https://github.com/tdunning/t-digest/tree/main) data structure, which takes a `compression` parameter to balance performance and estimation accuracy. Lower values of `compression` improve performance but may reduce estimation accuracy, as shown in the following request:

```json
GET opensearch_dashboards_sample_data_flights/_search
Expand All @@ -128,7 +107,12 @@ GET opensearch_dashboards_sample_data_flights/_search
```
{% include copy-curl.html %}

#### Example response
The estimation error depends on the dataset but is usually below 5%, even for `compression` values as low as `100`. (The low example value of `10` is used here to illustrate the trade-off effect and is not recommended.)

Note the decreased computation time (`took` time) and the slightly less accurate value of the estimated parameter in the following response.

For reference, OpenSearch's best estimate (with `compression` set arbitrarily high) for the median absolute deviation of `DistanceMiles` is `1831.076904296875`:


```json
{
Expand Down
Loading