-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add more metrics for reindex #137597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add more metrics for reindex #137597
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally there should be a test to verify the metrics for reindex from remote, but I can't get a successful reindex from remote run with internal cluster tests, the existing ones seem to be REST tests. But with REST tests I didn't find a way verify the metrics. So the new non-remote metrics are verified in internal cluster tests, whereas the remote metrics are verified in unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to get a remote reindex (hitting the same cluster, but via the remote mechanism) succeeding in this class with the following changes:
- Add
MainRestPlugin.classto the list returned bynodePlugins(). Needed because we'll be doing REST calls to the remote. - Enable real HTTP, for the same reason:
@Override protected boolean addMockHttpTransport() { return false; }
- Whitelist everything:
@Override protected Settings nodeSettings(int nodeOrdinal, Settings otherSettings) { return Settings.builder() .put(super.nodeSettings(nodeOrdinal, otherSettings)) .put(TransportReindexAction.REMOTE_CLUSTER_WHITELIST.getKey(), "*:*") .build(); }
- Set the remote info on the request, using the hostname and port for an arbitrary node from the cluster:
InetSocketAddress remoteAddress = randomFrom(cluster().httpAddresses()); RemoteInfo remote = new RemoteInfo( "http", remoteAddress.getHostName(), remoteAddress.getPort(), null, new BytesArray("{\"match_all\":{}}"), null, null, emptyMap(), RemoteInfo.DEFAULT_SOCKET_TIMEOUT, RemoteInfo.DEFAULT_CONNECT_TIMEOUT ); reindex().source("source").setRemoteInfo(remote).destination("dest").get();
(This was heavily inspired by RetryTests, which is a Java REST test, but I was able to extract the bits that seemed relevant.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a bug that currently reindex metrics are only recorded if the task is a reindex worker and not a leader, meaning if slicing is enabled there will be not metrics, as this listener is not wrapped with metrics when being used here.
We want to record a single metric for the parent task instead of each child task, I didn't find an easy way to fix this since each slice is an independent BulkByScrollAction.
Since we don't support remote slicing, I don't think this would affect the remote metrics, but would probably make the metrics inaccurate for non-remote reindex if slicing is used.
I am not saying we need to fix it for this PR (I think this PR can go without it), just raising this for awareness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that it is tangential to the work in hand — we only care about remote, where slicing is not available — but it definitely seems like something we should fix at some point. Do you want to create a bug in github?
|
Pinging @elastic/es-data-management (Team:Data Management) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @samxbr . I haven't done a full review on this, since you said you were looking for early feedback, but here are some initial thoughts.
We also talked about attempting to get lost operations (due to node restart). Did you look at how the existing chart for that works? Is it grepping the logs? Do you know where that logging is done? I'm wondering whether we want to try to figure out how to distinguish remote vs local in there, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to get a remote reindex (hitting the same cluster, but via the remote mechanism) succeeding in this class with the following changes:
- Add
MainRestPlugin.classto the list returned bynodePlugins(). Needed because we'll be doing REST calls to the remote. - Enable real HTTP, for the same reason:
@Override protected boolean addMockHttpTransport() { return false; }
- Whitelist everything:
@Override protected Settings nodeSettings(int nodeOrdinal, Settings otherSettings) { return Settings.builder() .put(super.nodeSettings(nodeOrdinal, otherSettings)) .put(TransportReindexAction.REMOTE_CLUSTER_WHITELIST.getKey(), "*:*") .build(); }
- Set the remote info on the request, using the hostname and port for an arbitrary node from the cluster:
InetSocketAddress remoteAddress = randomFrom(cluster().httpAddresses()); RemoteInfo remote = new RemoteInfo( "http", remoteAddress.getHostName(), remoteAddress.getPort(), null, new BytesArray("{\"match_all\":{}}"), null, null, emptyMap(), RemoteInfo.DEFAULT_SOCKET_TIMEOUT, RemoteInfo.DEFAULT_CONNECT_TIMEOUT ); reindex().source("source").setRemoteInfo(remote).destination("dest").get();
(This was heavily inspired by RetryTests, which is a Java REST test, but I was able to extract the bits that seemed relevant.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that it is tangential to the work in hand — we only care about remote, where slicing is not available — but it definitely seems like something we should fix at some point. Do you want to create a bug in github?
| private final LongHistogram reindexSuccessHistogram; | ||
| private final LongHistogram reindexSuccessHistogramRemote; | ||
| private final LongHistogram reindexFailureHistogram; | ||
| private final LongHistogram reindexFailureHistogramRemote; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know the pros and cons of using multiple metrics here, vs using one metric with a couple of attributes, one to indicate local/remote and one to indicate success/failure?
FWIW, OTel seems to recommend setting the error.type attribute to indicate a failure (with the absence of that indicating success). They have advice on the values for that attribute.
For the local/remote one, I guess it'd have to be a domain-specific attribute. I don't know if there are naming conventions for that or anything.
Adds a couple more metrics for reindex that would help tracking usage:
es.reindex.duration.histogram: operation time for all reindexes.reindex.duration.histogram.remote: operation time for reindex from remotees.reindex.completion.success: # of successful reindexes.reindex.completion.success.remote: # of successful reindex from remotees.reindex.completion.failure: # of failed reindexes.reindex.completion.failure.remote: # of failed reindex from remote