Add more metrics for reindex #137597

samxbr · 2025-11-04T23:57:07Z

Adds a couple more metrics for reindex that would help tracking usage:

(existing) es.reindex.duration.histogram: operation time for all reindex
es.reindex.duration.histogram.remote: operation time for reindex from remote
es.reindex.completion.success: # of successful reindex
es.reindex.completion.success.remote: # of successful reindex from remote
es.reindex.completion.failure: # of failed reindex
es.reindex.completion.failure.remote: # of failed reindex from remote

samxbr · 2025-11-05T00:08:52Z

...dex/src/internalClusterTest/java/org/elasticsearch/index/reindex/ReindexPluginMetricsIT.java

Ideally there should be a test to verify the metrics for reindex from remote, but I can't get a successful reindex from remote run with internal cluster tests, the existing ones seem to be REST tests. But with REST tests I didn't find a way verify the metrics. So the new non-remote metrics are verified in internal cluster tests, whereas the remote metrics are verified in unit tests.

I was able to get a remote reindex (hitting the same cluster, but via the remote mechanism) succeeding in this class with the following changes:

Add MainRestPlugin.class to the list returned by nodePlugins(). Needed because we'll be doing REST calls to the remote.

Enable real HTTP, for the same reason:
@Override protected boolean addMockHttpTransport() { return false; }

Whitelist everything:
@Override protected Settings nodeSettings(int nodeOrdinal, Settings otherSettings) { return Settings.builder() .put(super.nodeSettings(nodeOrdinal, otherSettings)) .put(TransportReindexAction.REMOTE_CLUSTER_WHITELIST.getKey(), "*:*") .build(); }

Set the remote info on the request, using the hostname and port for an arbitrary node from the cluster:
InetSocketAddress remoteAddress = randomFrom(cluster().httpAddresses()); RemoteInfo remote = new RemoteInfo( "http", remoteAddress.getHostName(), remoteAddress.getPort(), null, new BytesArray("{\"match_all\":{}}"), null, null, emptyMap(), RemoteInfo.DEFAULT_SOCKET_TIMEOUT, RemoteInfo.DEFAULT_CONNECT_TIMEOUT ); reindex().source("source").setRemoteInfo(remote).destination("dest").get();

(This was heavily inspired by RetryTests, which is a Java REST test, but I was able to extract the bits that seemed relevant.)

samxbr · 2025-11-05T00:25:08Z

modules/reindex/src/main/java/org/elasticsearch/reindex/Reindexer.java

I think it is a bug that currently reindex metrics are only recorded if the task is a reindex worker and not a leader, meaning if slicing is enabled there will be not metrics, as this listener is not wrapped with metrics when being used here.
We want to record a single metric for the parent task instead of each child task, I didn't find an easy way to fix this since each slice is an independent BulkByScrollAction.

Since we don't support remote slicing, I don't think this would affect the remote metrics, but would probably make the metrics inaccurate for non-remote reindex if slicing is used.

I am not saying we need to fix it for this PR (I think this PR can go without it), just raising this for awareness.

Yeah, I agree that it is tangential to the work in hand — we only care about remote, where slicing is not available — but it definitely seems like something we should fix at some point. Do you want to create a bug in github?

elasticsearchmachine · 2025-11-05T04:29:15Z

Pinging @elastic/es-data-management (Team:Data Management)

PeteGillinElastic

Thanks @samxbr . I haven't done a full review on this, since you said you were looking for early feedback, but here are some initial thoughts.

We also talked about attempting to get lost operations (due to node restart). Did you look at how the existing chart for that works? Is it grepping the logs? Do you know where that logging is done? I'm wondering whether we want to try to figure out how to distinguish remote vs local in there, too.

PeteGillinElastic · 2025-11-05T13:32:02Z

...dex/src/internalClusterTest/java/org/elasticsearch/index/reindex/ReindexPluginMetricsIT.java

I was able to get a remote reindex (hitting the same cluster, but via the remote mechanism) succeeding in this class with the following changes:

Add MainRestPlugin.class to the list returned by nodePlugins(). Needed because we'll be doing REST calls to the remote.

Enable real HTTP, for the same reason:
@Override protected boolean addMockHttpTransport() { return false; }

Whitelist everything:
@Override protected Settings nodeSettings(int nodeOrdinal, Settings otherSettings) { return Settings.builder() .put(super.nodeSettings(nodeOrdinal, otherSettings)) .put(TransportReindexAction.REMOTE_CLUSTER_WHITELIST.getKey(), "*:*") .build(); }

Set the remote info on the request, using the hostname and port for an arbitrary node from the cluster:
InetSocketAddress remoteAddress = randomFrom(cluster().httpAddresses()); RemoteInfo remote = new RemoteInfo( "http", remoteAddress.getHostName(), remoteAddress.getPort(), null, new BytesArray("{\"match_all\":{}}"), null, null, emptyMap(), RemoteInfo.DEFAULT_SOCKET_TIMEOUT, RemoteInfo.DEFAULT_CONNECT_TIMEOUT ); reindex().source("source").setRemoteInfo(remote).destination("dest").get();

(This was heavily inspired by RetryTests, which is a Java REST test, but I was able to extract the bits that seemed relevant.)

PeteGillinElastic · 2025-11-05T13:35:08Z

modules/reindex/src/main/java/org/elasticsearch/reindex/Reindexer.java

Yeah, I agree that it is tangential to the work in hand — we only care about remote, where slicing is not available — but it definitely seems like something we should fix at some point. Do you want to create a bug in github?

PeteGillinElastic · 2025-11-05T13:44:44Z

modules/reindex/src/main/java/org/elasticsearch/reindex/ReindexMetrics.java

+    private final LongHistogram reindexSuccessHistogram;
+    private final LongHistogram reindexSuccessHistogramRemote;
+    private final LongHistogram reindexFailureHistogram;
+    private final LongHistogram reindexFailureHistogramRemote;


Do you know the pros and cons of using multiple metrics here, vs using one metric with a couple of attributes, one to indicate local/remote and one to indicate success/failure?

FWIW, OTel seems to recommend setting the error.type attribute to indicate a failure (with the absence of that indicating success). They have advice on the values for that attribute.

For the local/remote one, I guess it'd have to be a domain-specific attribute. I don't know if there are naming conventions for that or anything.

Add reindex from remote metrics

f11a9c7

elasticsearchmachine added the v9.3.0 label Nov 4, 2025

Merge branch 'main' into reindex/add-metrics

290fa48

samxbr commented Nov 5, 2025

View reviewed changes

Merge branch 'main' into reindex/add-metrics

6566a55

samxbr added :Data Management/Indices APIs APIs to create and manage indices and templates >non-issue labels Nov 5, 2025

samxbr marked this pull request as ready for review November 5, 2025 04:28

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Nov 5, 2025

PeteGillinElastic reviewed Nov 5, 2025

View reviewed changes

Merge branch 'main' into reindex/add-metrics

70ed46d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add more metrics for reindex #137597

Add more metrics for reindex #137597

Uh oh!

samxbr commented Nov 4, 2025 •

edited

Loading

Uh oh!

samxbr Nov 5, 2025

Uh oh!

PeteGillinElastic Nov 5, 2025

Uh oh!

samxbr Nov 5, 2025 •

edited

Loading

Uh oh!

PeteGillinElastic Nov 5, 2025

Uh oh!

elasticsearchmachine commented Nov 5, 2025

Uh oh!

PeteGillinElastic left a comment

Uh oh!

PeteGillinElastic Nov 5, 2025

Uh oh!

PeteGillinElastic Nov 5, 2025

Uh oh!

PeteGillinElastic Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add more metrics for reindex #137597

Are you sure you want to change the base?

Add more metrics for reindex #137597

Uh oh!

Conversation

samxbr commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samxbr Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

PeteGillinElastic Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

samxbr Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PeteGillinElastic Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Nov 5, 2025

Uh oh!

PeteGillinElastic left a comment

Choose a reason for hiding this comment

Uh oh!

PeteGillinElastic Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

PeteGillinElastic Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

PeteGillinElastic Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samxbr commented Nov 4, 2025 •

edited

Loading

samxbr Nov 5, 2025 •

edited

Loading