Implement conversion from t-digest to exponential histograms #137575

JonasKunz · 2025-11-04T13:31:58Z

Part of #136605.
Adds a conversion algorithm for converting T-Digests into exponential histograms, but does not wire it up yet.
This algorithm aims to invert the existing exponential histogram to T-Digest conversion, but it can't do it perfectly:
The bucket centers are preserved as centroids, however the bucket widths are lost.

Therefore the conversion algorithm simply generates tiny buckets (scale set to MAX_SCALE) containing the T-Digest centroids. Therefore the current percentile estimation algorithm will in practice return the centroid closest to the percentile.

In addition, this PR fixes an academic bug / edge case in the existing exp-histo -> T-Digest conversion algorithm.
Exponential histograms have a higher resolution than doubles. Therefore in theory it can happen that two buckets map to the same centroid, which the algorithm did not account for.
In practice this does not occur, because the exponential histograms are generated from double values.

elasticsearchmachine · 2025-11-04T14:28:28Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

felixbarny · 2025-11-05T15:12:51Z

...alytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/ParsedHistogramConverter.java

    }

+    /**
+     * Converts t-digest histograms to exponential histograms, trying to do the inverse


I think you did a good job describing the concept around the conversion in the PR description. Could you add some of that to the javadoc, too, so it doesn't get lost?

felixbarny · 2025-11-05T15:17:10Z

...cs/src/test/java/org/elasticsearch/xpack/analytics/mapper/ParsedHistogramConverterTests.java

+        // the conversion looses the width of the original buckets, but the bucket centers (arithmetic mean of boundaries)
+        // should be very close


With the current algorithm for calculating percentiles, does the bucket width even matter? I suppose that may change if we do interpolation within a bucket. Aside from that, for the tidiest histograms, we also don't really know the bucket width/boundaries, do we? Or is tdigest always dense so that the width is implicit based on the previous and next value?

It doesn't matter much: Our algorithm returns the point of least relative error when estimating the percentile for a bucket. This different from the mean of boundaries of the bucket, which we use (for backwards compatibility reasons) for the exp-histo -> TDigest conversion. So basically the round trip changes the percentile from the POLRE to the mean of the bucket boundaries. That increases the relative error a little, but not by much.

I suppose that may change if we do interpolation within a bucket

Correct.

Aside from that, for the tidiest histograms, we also don't really know the bucket width/boundaries, do we? Or is tdigest always dense so that the width is implicit based on the previous and next value?

No, but the T-Digest percentile estimation algorithm interpolates IINM. So percentiles computed on exponential histograms with few buckets will appear "discrete", while smooth for the same t digest. This is more about user experience than mathematical correctness.

felixbarny · 2025-11-05T15:18:19Z

...cs/src/test/java/org/elasticsearch/xpack/analytics/mapper/ParsedHistogramConverterTests.java

+            assertThat(
+                "original center=" + originalCenter + ", converted center=" + convertedCenter + ", relative error=" + relativeError,
+                relativeError,
+                closeTo(0, 0.0000001)


Great that we can guarantee such as small error even after a roundtrip.

Implement conversion from t-digest to exponential histograms

b62c43d

elasticsearchmachine added external-contributor Pull request authored by a developer outside the Elasticsearch team v9.3.0 labels Nov 4, 2025

JonasKunz added >non-issue :StorageEngine/Mapping The storage related side of mappings labels Nov 4, 2025

JonasKunz marked this pull request as ready for review November 4, 2025 14:28

JonasKunz requested a review from felixbarny November 4, 2025 14:28

elasticsearchmachine added the Team:StorageEngine label Nov 4, 2025

felixbarny approved these changes Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement conversion from t-digest to exponential histograms #137575

Implement conversion from t-digest to exponential histograms #137575

JonasKunz commented Nov 4, 2025

Uh oh!

elasticsearchmachine commented Nov 4, 2025

Uh oh!

felixbarny Nov 5, 2025

Uh oh!

felixbarny Nov 5, 2025

Uh oh!

JonasKunz Nov 5, 2025

Uh oh!

felixbarny Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// the conversion looses the width of the original buckets, but the bucket centers (arithmetic mean of boundaries)
		// should be very close

Implement conversion from t-digest to exponential histograms #137575

Are you sure you want to change the base?

Implement conversion from t-digest to exponential histograms #137575

Conversation

JonasKunz commented Nov 4, 2025

Uh oh!

elasticsearchmachine commented Nov 4, 2025

Uh oh!

felixbarny Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

felixbarny Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

JonasKunz Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

felixbarny Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants