-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Implement conversion from t-digest to exponential histograms #137575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
| } | ||
|
|
||
| /** | ||
| * Converts t-digest histograms to exponential histograms, trying to do the inverse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you did a good job describing the concept around the conversion in the PR description. Could you add some of that to the javadoc, too, so it doesn't get lost?
| // the conversion looses the width of the original buckets, but the bucket centers (arithmetic mean of boundaries) | ||
| // should be very close |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current algorithm for calculating percentiles, does the bucket width even matter? I suppose that may change if we do interpolation within a bucket. Aside from that, for the tidiest histograms, we also don't really know the bucket width/boundaries, do we? Or is tdigest always dense so that the width is implicit based on the previous and next value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter much: Our algorithm returns the point of least relative error when estimating the percentile for a bucket. This different from the mean of boundaries of the bucket, which we use (for backwards compatibility reasons) for the exp-histo -> TDigest conversion. So basically the round trip changes the percentile from the POLRE to the mean of the bucket boundaries. That increases the relative error a little, but not by much.
I suppose that may change if we do interpolation within a bucket
Correct.
Aside from that, for the tidiest histograms, we also don't really know the bucket width/boundaries, do we? Or is tdigest always dense so that the width is implicit based on the previous and next value?
No, but the T-Digest percentile estimation algorithm interpolates IINM. So percentiles computed on exponential histograms with few buckets will appear "discrete", while smooth for the same t digest. This is more about user experience than mathematical correctness.
| assertThat( | ||
| "original center=" + originalCenter + ", converted center=" + convertedCenter + ", relative error=" + relativeError, | ||
| relativeError, | ||
| closeTo(0, 0.0000001) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great that we can guarantee such as small error even after a roundtrip.
Part of #136605.
Adds a conversion algorithm for converting T-Digests into exponential histograms, but does not wire it up yet.
This algorithm aims to invert the existing exponential histogram to T-Digest conversion, but it can't do it perfectly:
The bucket centers are preserved as centroids, however the bucket widths are lost.
Therefore the conversion algorithm simply generates tiny buckets (scale set to
MAX_SCALE) containing the T-Digest centroids. Therefore the current percentile estimation algorithm will in practice return the centroid closest to the percentile.In addition, this PR fixes an academic bug / edge case in the existing exp-histo -> T-Digest conversion algorithm.
Exponential histograms have a higher resolution than doubles. Therefore in theory it can happen that two buckets map to the same centroid, which the algorithm did not account for.
In practice this does not occur, because the exponential histograms are generated from double values.