Skip to content

Conversation

@JonasKunz
Copy link
Contributor

Part of #136605.
Adds a conversion algorithm for converting T-Digests into exponential histograms, but does not wire it up yet.
This algorithm aims to invert the existing exponential histogram to T-Digest conversion, but it can't do it perfectly:
The bucket centers are preserved as centroids, however the bucket widths are lost.

Therefore the conversion algorithm simply generates tiny buckets (scale set to MAX_SCALE) containing the T-Digest centroids. Therefore the current percentile estimation algorithm will in practice return the centroid closest to the percentile.

In addition, this PR fixes an academic bug / edge case in the existing exp-histo -> T-Digest conversion algorithm.
Exponential histograms have a higher resolution than doubles. Therefore in theory it can happen that two buckets map to the same centroid, which the algorithm did not account for.
In practice this does not occur, because the exponential histograms are generated from double values.

@elasticsearchmachine elasticsearchmachine added external-contributor Pull request authored by a developer outside the Elasticsearch team v9.3.0 labels Nov 4, 2025
@JonasKunz JonasKunz added >non-issue :StorageEngine/Mapping The storage related side of mappings labels Nov 4, 2025
@JonasKunz JonasKunz marked this pull request as ready for review November 4, 2025 14:28
@JonasKunz JonasKunz requested a review from felixbarny November 4, 2025 14:28
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

}

/**
* Converts t-digest histograms to exponential histograms, trying to do the inverse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you did a good job describing the concept around the conversion in the PR description. Could you add some of that to the javadoc, too, so it doesn't get lost?

Comment on lines +44 to +45
// the conversion looses the width of the original buckets, but the bucket centers (arithmetic mean of boundaries)
// should be very close
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current algorithm for calculating percentiles, does the bucket width even matter? I suppose that may change if we do interpolation within a bucket. Aside from that, for the tidiest histograms, we also don't really know the bucket width/boundaries, do we? Or is tdigest always dense so that the width is implicit based on the previous and next value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter much: Our algorithm returns the point of least relative error when estimating the percentile for a bucket. This different from the mean of boundaries of the bucket, which we use (for backwards compatibility reasons) for the exp-histo -> TDigest conversion. So basically the round trip changes the percentile from the POLRE to the mean of the bucket boundaries. That increases the relative error a little, but not by much.

I suppose that may change if we do interpolation within a bucket

Correct.

Aside from that, for the tidiest histograms, we also don't really know the bucket width/boundaries, do we? Or is tdigest always dense so that the width is implicit based on the previous and next value?

No, but the T-Digest percentile estimation algorithm interpolates IINM. So percentiles computed on exponential histograms with few buckets will appear "discrete", while smooth for the same t digest. This is more about user experience than mathematical correctness.

assertThat(
"original center=" + originalCenter + ", converted center=" + convertedCenter + ", relative error=" + relativeError,
relativeError,
closeTo(0, 0.0000001)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that we can guarantee such as small error even after a roundtrip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contributor Pull request authored by a developer outside the Elasticsearch team >non-issue :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants