Skip to content

Conversation

@jfreden
Copy link
Contributor

@jfreden jfreden commented Nov 4, 2025

This PR adds resilience to the metadata_flattened security migration that was reported to have failed on clusters where concurrent role modifications happened while the migration was running. In the normal case this is not expected to happen, but for a very large number of roles or very frequent role updates a version conflict could occur.

The change adds logic to:

  • Handle version conflicts
  • Handle shard failures
  • Handle timeouts
  • Trigger immediate retries in the framework if a failure occurs
  • Bump the number of retries

Resolves: #110532

@jfreden jfreden added the test-full-bwc Trigger full BWC version matrix tests label Nov 4, 2025
waitForMigrationCompletion(SecurityMigrations.CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION);
// First migration is on a new index, so should skip all migrations. If we reset, it should re-trigger and run all migrations
resetMigration();
// Wait for the first migration to finish
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now the first migration so we don't need this line anymore.

masterNode,
SecurityMigrations.CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION
);
CountDownLatch awaitMigrations = awaitMigrationVersionUpdates(masterNode, SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Order of migrations changed, this should have been lastKey from the start.

project,
migrationsVersion
);
var persistentTaskCustomMetadata = PersistentTasksCustomMetadata.get(project.metadata());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a migration is running, its persistent task will be present in cluster state, when it's not it will not be present in cluster state. When a persistent task completes (failure or success) it's removed from cluster state. We want to make sure that an index state change is triggered when a persistent task fails to make sure it's retried immediately, that's why we need this state here.


public static final Integer ROLE_METADATA_FLATTENED_MIGRATION_VERSION = 1;
public static final Integer CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION = 2;
public static final Integer ROLE_METADATA_FLATTENED_MIGRATION_VERSION = 3;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm bumping the version here to make sure this migration runs again with proper error handling, I'm also "removing" the old migration since we don't need it anymore.

if (response.getHits().getTotalHits().value() > 0) {
logger.info("Preparing to migrate [" + response.getHits().getTotalHits().value() + "] roles");
updateRolesByQuery(indexManager, client, filterQuery, listener);
if (response.isTimedOut() == false && response.getFailedShards() == 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added error handling here to make sure we don't mark as migrated if this initial search fails silently for some reason.

@jfreden jfreden force-pushed the add_cleanup_metadata_flattened branch from 436f987 to 37737b7 Compare November 4, 2025 15:12

public static final IndexVersion REENABLED_TIMESTAMP_DOC_VALUES_SPARSE_INDEX = def(9_042_0_00, Version.LUCENE_10_3_1);
public static final IndexVersion SKIPPERS_ENABLED_BY_DEFAULT = def(9_043_0_00, Version.LUCENE_10_3_1);
public static final IndexVersion SECURITY_MIGRATIONS_METADATA = def(9_044_0_00, Version.LUCENE_10_3_1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New index version is needed to make sure we skip migration for brand new index.

@jfreden jfreden force-pushed the add_cleanup_metadata_flattened branch from 37737b7 to 682da2d Compare November 4, 2025 15:14
public static class Manager {

private static final int MAX_SECURITY_MIGRATION_ATTEMPT_COUNT = 10;
private static final int MAX_SECURITY_MIGRATION_ATTEMPT_COUNT = 1000;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty significant bump because we have no idea how many times the migration would need to be retried before it's successful. In the extreme case where we have 2M roles and frequent updates 1000 doesn't feel like a crazy number, but it's also very difficult to verify this.

There is no good reason to not allow this to be very large. The point of this is to make sure that security migrations are not retried forever.

@jfreden jfreden force-pushed the add_cleanup_metadata_flattened branch from 682da2d to e5f599a Compare November 4, 2025 15:25
@jfreden jfreden force-pushed the add_cleanup_metadata_flattened branch from e5f599a to 77ac083 Compare November 4, 2025 15:25
@jfreden jfreden added the :Security/Security Security issues without another label label Nov 4, 2025
@jfreden jfreden marked this pull request as ready for review November 4, 2025 15:27
@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Nov 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@elasticsearchmachine
Copy link
Collaborator

Hi @jfreden, I've created a changelog YAML for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Security/Security Security issues without another label Team:Security Meta label for security team test-full-bwc Trigger full BWC version matrix tests v8.19.7 v9.1.7 v9.2.1 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve .security index migration resiliency

2 participants