Skip to content

Conversation

@samxbr
Copy link
Contributor

@samxbr samxbr commented Nov 4, 2025

Adds a couple more metrics for reindex that would help tracking usage:

  • (existing) es.reindex.duration.histogram: operation time for all reindex
  • es.reindex.duration.histogram.remote: operation time for reindex from remote
  • es.reindex.completion.success: # of successful reindex
  • es.reindex.completion.success.remote: # of successful reindex from remote
  • es.reindex.completion.failure: # of failed reindex
  • es.reindex.completion.failure.remote: # of failed reindex from remote

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally there should be a test to verify the metrics for reindex from remote, but I can't get a successful reindex from remote run with internal cluster tests, the existing ones seem to be REST tests. But with REST tests I didn't find a way verify the metrics. So the new non-remote metrics are verified in internal cluster tests, whereas the remote metrics are verified in unit tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to get a remote reindex (hitting the same cluster, but via the remote mechanism) succeeding in this class with the following changes:

  • Add MainRestPlugin.class to the list returned by nodePlugins(). Needed because we'll be doing REST calls to the remote.
  • Enable real HTTP, for the same reason:
    @Override
    protected boolean addMockHttpTransport() {
        return false;
    }
  • Whitelist everything:
    @Override
    protected Settings nodeSettings(int nodeOrdinal, Settings otherSettings) {
        return Settings.builder()
            .put(super.nodeSettings(nodeOrdinal, otherSettings))
            .put(TransportReindexAction.REMOTE_CLUSTER_WHITELIST.getKey(), "*:*")
            .build();
    }
  • Set the remote info on the request, using the hostname and port for an arbitrary node from the cluster:
    InetSocketAddress remoteAddress = randomFrom(cluster().httpAddresses());
    RemoteInfo remote = new RemoteInfo(
        "http",
        remoteAddress.getHostName(),
        remoteAddress.getPort(),
        null,
        new BytesArray("{\"match_all\":{}}"),
        null,
        null,
        emptyMap(),
        RemoteInfo.DEFAULT_SOCKET_TIMEOUT,
        RemoteInfo.DEFAULT_CONNECT_TIMEOUT
    );
    reindex().source("source").setRemoteInfo(remote).destination("dest").get();

(This was heavily inspired by RetryTests, which is a Java REST test, but I was able to extract the bits that seemed relevant.)

Copy link
Contributor Author

@samxbr samxbr Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a bug that currently reindex metrics are only recorded if the task is a reindex worker and not a leader, meaning if slicing is enabled there will be not metrics, as this listener is not wrapped with metrics when being used here.
We want to record a single metric for the parent task instead of each child task, I didn't find an easy way to fix this since each slice is an independent BulkByScrollAction.

Since we don't support remote slicing, I don't think this would affect the remote metrics, but would probably make the metrics inaccurate for non-remote reindex if slicing is used.

I am not saying we need to fix it for this PR (I think this PR can go without it), just raising this for awareness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that it is tangential to the work in hand — we only care about remote, where slicing is not available — but it definitely seems like something we should fix at some point. Do you want to create a bug in github?

@samxbr samxbr added :Data Management/Indices APIs APIs to create and manage indices and templates >non-issue labels Nov 5, 2025
@samxbr samxbr marked this pull request as ready for review November 5, 2025 04:28
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Nov 5, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Member

@PeteGillinElastic PeteGillinElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @samxbr . I haven't done a full review on this, since you said you were looking for early feedback, but here are some initial thoughts.

We also talked about attempting to get lost operations (due to node restart). Did you look at how the existing chart for that works? Is it grepping the logs? Do you know where that logging is done? I'm wondering whether we want to try to figure out how to distinguish remote vs local in there, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to get a remote reindex (hitting the same cluster, but via the remote mechanism) succeeding in this class with the following changes:

  • Add MainRestPlugin.class to the list returned by nodePlugins(). Needed because we'll be doing REST calls to the remote.
  • Enable real HTTP, for the same reason:
    @Override
    protected boolean addMockHttpTransport() {
        return false;
    }
  • Whitelist everything:
    @Override
    protected Settings nodeSettings(int nodeOrdinal, Settings otherSettings) {
        return Settings.builder()
            .put(super.nodeSettings(nodeOrdinal, otherSettings))
            .put(TransportReindexAction.REMOTE_CLUSTER_WHITELIST.getKey(), "*:*")
            .build();
    }
  • Set the remote info on the request, using the hostname and port for an arbitrary node from the cluster:
    InetSocketAddress remoteAddress = randomFrom(cluster().httpAddresses());
    RemoteInfo remote = new RemoteInfo(
        "http",
        remoteAddress.getHostName(),
        remoteAddress.getPort(),
        null,
        new BytesArray("{\"match_all\":{}}"),
        null,
        null,
        emptyMap(),
        RemoteInfo.DEFAULT_SOCKET_TIMEOUT,
        RemoteInfo.DEFAULT_CONNECT_TIMEOUT
    );
    reindex().source("source").setRemoteInfo(remote).destination("dest").get();

(This was heavily inspired by RetryTests, which is a Java REST test, but I was able to extract the bits that seemed relevant.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that it is tangential to the work in hand — we only care about remote, where slicing is not available — but it definitely seems like something we should fix at some point. Do you want to create a bug in github?

private final LongHistogram reindexSuccessHistogram;
private final LongHistogram reindexSuccessHistogramRemote;
private final LongHistogram reindexFailureHistogram;
private final LongHistogram reindexFailureHistogramRemote;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know the pros and cons of using multiple metrics here, vs using one metric with a couple of attributes, one to indicate local/remote and one to indicate success/failure?

FWIW, OTel seems to recommend setting the error.type attribute to indicate a failure (with the absence of that indicating success). They have advice on the values for that attribute.

For the local/remote one, I guess it'd have to be a domain-specific attribute. I don't know if there are naming conventions for that or anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Indices APIs APIs to create and manage indices and templates >non-issue Team:Data Management Meta label for data/management team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants