Skip to content

Conversation

@JonasKunz
Copy link
Contributor

@JonasKunz JonasKunz commented Nov 4, 2025

Part of #137549.
Adds a new HistogramPercentile scalar function, which extracts a given percentile from a single exponential histogram value.

This functions is not intended to be user-facing, but will only be used as a surrogate to implement the existing PERCENTILE aggregation in combination with an upcoming histogram-merge aggregation. For this reason, this function is undocumented (except for javadoc) and not registered in EsqlFunctionRegistry.
We can eventually expose this function if needed, but in that case we need to go through the design discussions first.

Because this function is not user-facing, this PR does not add CSV tests yet.
CSV tests will follow when the exponential histogram PERCENTILE is implemented via the surrogate explained above.

@elasticsearchmachine elasticsearchmachine added v9.3.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Nov 4, 2025
@JonasKunz JonasKunz added :Analytics/ES|QL AKA ESQL v9.3.0 >non-issue and removed external-contributor Pull request authored by a developer outside the Elasticsearch team v9.3.0 labels Nov 4, 2025
@Evaluator(warnExceptions = ArithmeticException.class)
static void process(DoubleBlock.Builder resultBuilder, ExponentialHistogram value, double percentile) {
if (percentile < 0.0 || percentile > 100.0) {
throw new ArithmeticException("Percentile value must be in the range [0, 100], got: " + percentile);
Copy link
Contributor Author

@JonasKunz JonasKunz Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is it a good or a bad practice to include the invalid value in the warning message? E.g. Asin doesn't include the wrong value in its warnings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the percentile is a constant with the integration, we should verify it and reject invalid values instead. But this should be okay now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC you mean that if someone writes a query like STATS PERCENTILE(my_histo, 200) the query shouldn't even get to the execution phase, but we should reject it with an error.

Am I understanding this correctly?
I unfortunately don't know what the correct callback / hook would be to perform this verification. I looked in the PERCENTILE aggregation and the POW function for examples, but they don't seem to be doing this kind of verification either.

If you could provide me with an example on how to do this, I can add it in a follow-up.

@JonasKunz JonasKunz marked this pull request as ready for review November 4, 2025 11:19
@JonasKunz JonasKunz requested a review from dnhatn November 4, 2025 11:19
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Nov 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if a separate function is required, but the implementation looks good. Let's merge it and move to the next step. Thanks, Jonas

@Evaluator(warnExceptions = ArithmeticException.class)
static void process(DoubleBlock.Builder resultBuilder, ExponentialHistogram value, double percentile) {
if (percentile < 0.0 || percentile > 100.0) {
throw new ArithmeticException("Percentile value must be in the range [0, 100], got: " + percentile);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the percentile is a constant with the integration, we should verify it and reject invalid values instead. But this should be okay now.

@JonasKunz
Copy link
Contributor Author

I'm not sure if a separate function is required

Having this as a separate function helps on two fronts:

  • DRY: We will only need to implement one aggregation functions for histograms: "merge". Then if we add more analytical aggregations in addition to percentile (e.g. rank), those can all reuse the existing merge aggregation via surrogates instead of copying the logic over and over or doing fancy stuff with inheritance
  • Performance: If merging and percentile computation was fused in the aggregation and a user queries for e.g. 10 percentiles (p50, p75, p90, p95, ...) we'd do the merging, which is the actual expensive part, 10 times. If instead it is split in merge aggregation + percentile function extraction, the logical planner will to my understanding reuse the merge result across the percentile computations. So we have to do the expensive merge aggregation only once instead of 10 times.

@JonasKunz JonasKunz enabled auto-merge (squash) November 6, 2025 07:35
@JonasKunz JonasKunz merged commit 34e3417 into elastic:main Nov 6, 2025
34 checks passed
@JonasKunz JonasKunz deleted the esql-percentile-function branch November 6, 2025 08:53
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Nov 6, 2025
szybia added a commit to szybia/elasticsearch that referenced this pull request Nov 6, 2025
…-json

* upstream/main:
  Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=false} elastic#137691
  Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=true} elastic#137690
  [LTR] Fix feature display order when using explain. (elastic#137671)
  Remove extra RemoteClusterService instances in unit test (elastic#137647)
  Fix `ComponentTemplatesFileSettingsIT.testSettingsApplied` (elastic#137669)
  Consolidates troubleshooting content into the "Returning semantic field embeddings in _source" section (elastic#137233)
  Update bundled JDK to 25.0.1 (elastic#137640)
  resolve indices for prefixed _all expressions (elastic#137330)
  ESQL: Add TopN support for exponential histograms (elastic#137313)
  allows field caps to be cross project (elastic#137530)
  ESQL: Add exponential histogram percentile function (elastic#137553)
  Wait for nodes to have downloaded databases in `GeoIpDownloaderIT` (elastic#137636)
  Tighten on when THROTTLE decision can be returned (elastic#136794)
  Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeMetricsIT test elastic#137655
  Add a test for two little known conditional processor paths (elastic#137645)
  Extract a common ORIGIN constant (elastic#137612)
  Remove early phase failure in batched (elastic#136889)
  Returning correct index mode from get data streams api (elastic#137646)
  [ML] Manage AD results indices (elastic#136065)
szybia added a commit to szybia/elasticsearch that referenced this pull request Nov 6, 2025
…-json

* upstream/main:
  Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=false} elastic#137691
  Mute org.elasticsearch.xpack.inference.action.filter.ShardBulkInferenceActionFilterBasicLicenseIT testLicenseInvalidForInference {p0=true} elastic#137690
  [LTR] Fix feature display order when using explain. (elastic#137671)
  Remove extra RemoteClusterService instances in unit test (elastic#137647)
  Fix `ComponentTemplatesFileSettingsIT.testSettingsApplied` (elastic#137669)
  Consolidates troubleshooting content into the "Returning semantic field embeddings in _source" section (elastic#137233)
  Update bundled JDK to 25.0.1 (elastic#137640)
  resolve indices for prefixed _all expressions (elastic#137330)
  ESQL: Add TopN support for exponential histograms (elastic#137313)
  allows field caps to be cross project (elastic#137530)
  ESQL: Add exponential histogram percentile function (elastic#137553)
  Wait for nodes to have downloaded databases in `GeoIpDownloaderIT` (elastic#137636)
  Tighten on when THROTTLE decision can be returned (elastic#136794)
  Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeMetricsIT test elastic#137655
  Add a test for two little known conditional processor paths (elastic#137645)
  Extract a common ORIGIN constant (elastic#137612)
  Remove early phase failure in batched (elastic#136889)
  Returning correct index mode from get data streams api (elastic#137646)
  [ML] Manage AD results indices (elastic#136065)
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants