Improve security migration resilience by handling version conflicts #137558

jfreden · 2025-11-04T09:58:31Z

This PR adds resilience to the metadata_flattened security migration that was reported to have failed on clusters where concurrent role modifications happened while the migration was running. In the normal case this is not expected to happen, but for a very large number of roles or very frequent role updates a version conflict could occur.

The change adds logic to:

Handle version conflicts
Handle shard failures
Handle timeouts
Trigger immediate retries in the framework if a failure occurs
Bump the number of retries

Resolves: #110532

jfreden · 2025-11-04T12:39:04Z

...t/java/org/elasticsearch/xpack/security/support/CleanupRoleMappingDuplicatesMigrationIT.java

        waitForMigrationCompletion(SecurityMigrations.CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION);
        // First migration is on a new index, so should skip all migrations. If we reset, it should re-trigger and run all migrations
        resetMigration();
-        // Wait for the first migration to finish


This is now the first migration so we don't need this line anymore.

jfreden · 2025-11-04T12:39:38Z

...t/java/org/elasticsearch/xpack/security/support/CleanupRoleMappingDuplicatesMigrationIT.java

-            masterNode,
-            SecurityMigrations.CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION
-        );
+        CountDownLatch awaitMigrations = awaitMigrationVersionUpdates(masterNode, SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey());


Order of migrations changed, this should have been lastKey from the start.

jfreden · 2025-11-04T12:41:43Z

...in/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityIndexManager.java

            project,
            migrationsVersion
        );
+        var persistentTaskCustomMetadata = PersistentTasksCustomMetadata.get(project.metadata());


When a migration is running, its persistent task will be present in cluster state, when it's not it will not be present in cluster state. When a persistent task completes (failure or success) it's removed from cluster state. We want to make sure that an index state change is triggered when a persistent task fails to make sure it's retried immediately, that's why we need this state here.

jfreden · 2025-11-04T12:43:04Z

...ugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityMigrations.java


-    public static final Integer ROLE_METADATA_FLATTENED_MIGRATION_VERSION = 1;
    public static final Integer CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION = 2;
+    public static final Integer ROLE_METADATA_FLATTENED_MIGRATION_VERSION = 3;


I'm bumping the version here to make sure this migration runs again with proper error handling, I'm also "removing" the old migration since we don't need it anymore.

jfreden · 2025-11-04T12:44:01Z

...ugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityMigrations.java

-                if (response.getHits().getTotalHits().value() > 0) {
-                    logger.info("Preparing to migrate [" + response.getHits().getTotalHits().value() + "] roles");
-                    updateRolesByQuery(indexManager, client, filterQuery, listener);
+                if (response.isTimedOut() == false && response.getFailedShards() == 0) {


Added error handling here to make sure we don't mark as migrated if this initial search fails silently for some reason.

jfreden · 2025-11-04T15:13:20Z

server/src/main/java/org/elasticsearch/index/IndexVersions.java


    public static final IndexVersion REENABLED_TIMESTAMP_DOC_VALUES_SPARSE_INDEX = def(9_042_0_00, Version.LUCENE_10_3_1);
    public static final IndexVersion SKIPPERS_ENABLED_BY_DEFAULT = def(9_043_0_00, Version.LUCENE_10_3_1);
+    public static final IndexVersion SECURITY_MIGRATIONS_METADATA = def(9_044_0_00, Version.LUCENE_10_3_1);


New index version is needed to make sure we skip migration for brand new index.

jfreden · 2025-11-04T15:23:25Z

...ugin/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityMigrations.java

    public static class Manager {

-        private static final int MAX_SECURITY_MIGRATION_ATTEMPT_COUNT = 10;
+        private static final int MAX_SECURITY_MIGRATION_ATTEMPT_COUNT = 1000;


This is a pretty significant bump because we have no idea how many times the migration would need to be retried before it's successful. In the extreme case where we have 2M roles and frequent updates 1000 doesn't feel like a crazy number, but it's also very difficult to verify this.

There is no good reason to not allow this to be very large. The point of this is to make sure that security migrations are not retried forever.

elasticsearchmachine · 2025-11-04T15:28:09Z

Pinging @elastic/es-security (Team:Security)

elasticsearchmachine · 2025-11-04T16:00:53Z

Hi @jfreden, I've created a changelog YAML for you.

…a_flattened

jfreden added the test-full-bwc Trigger full BWC version matrix tests label Nov 4, 2025

elasticsearchmachine added the v9.3.0 label Nov 4, 2025

jfreden commented Nov 4, 2025

View reviewed changes

jfreden force-pushed the add_cleanup_metadata_flattened branch from 436f987 to 37737b7 Compare November 4, 2025 15:12

jfreden commented Nov 4, 2025

View reviewed changes

jfreden force-pushed the add_cleanup_metadata_flattened branch from 37737b7 to 682da2d Compare November 4, 2025 15:14

jfreden commented Nov 4, 2025

View reviewed changes

jfreden force-pushed the add_cleanup_metadata_flattened branch from 682da2d to e5f599a Compare November 4, 2025 15:25

Improve security migration resilience by handling version conflicts

77ac083

jfreden force-pushed the add_cleanup_metadata_flattened branch from e5f599a to 77ac083 Compare November 4, 2025 15:25

jfreden added the :Security/Security Security issues without another label label Nov 4, 2025

jfreden marked this pull request as ready for review November 4, 2025 15:27

elasticsearchmachine added the Team:Security Meta label for security team label Nov 4, 2025

jfreden added branch:9.2 branch:9.1 branch:8.19 and removed Team:Security Meta label for security team labels Nov 4, 2025

jfreden requested a review from a team November 4, 2025 15:28

elasticsearchmachine added Team:Security Meta label for security team v9.2.1 v8.19.7 v9.1.7 and removed branch:9.2 branch:9.1 branch:8.19 labels Nov 4, 2025

jfreden added the >enhancement label Nov 4, 2025

jfreden and others added 5 commits November 4, 2025 17:00

Update docs/changelog/137558.yaml

1e5cb67

Merge branch 'main' into add_cleanup_metadata_flattened

a1287e6

fixup! Simplify and fix name

17d9ebb

Merge remote-tracking branch 'upstream/main' into add_cleanup_metadat…

f371a09

…a_flattened

fixup! Name

49dc7e9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve security migration resilience by handling version conflicts #137558

Improve security migration resilience by handling version conflicts #137558

jfreden commented Nov 4, 2025 •

edited

Loading

Uh oh!

jfreden Nov 4, 2025

Uh oh!

jfreden Nov 4, 2025

Uh oh!

jfreden Nov 4, 2025

Uh oh!

jfreden Nov 4, 2025

Uh oh!

jfreden Nov 4, 2025

Uh oh!

jfreden Nov 4, 2025

Uh oh!

jfreden Nov 4, 2025

Uh oh!

elasticsearchmachine commented Nov 4, 2025

Uh oh!

elasticsearchmachine commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve security migration resilience by handling version conflicts #137558

Are you sure you want to change the base?

Improve security migration resilience by handling version conflicts #137558

Conversation

jfreden commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jfreden Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

jfreden Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

jfreden Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

jfreden Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

jfreden Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

jfreden Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

jfreden Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Nov 4, 2025

Uh oh!

elasticsearchmachine commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jfreden commented Nov 4, 2025 •

edited

Loading