-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Improve security migration resilience by handling version conflicts #137558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| waitForMigrationCompletion(SecurityMigrations.CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION); | ||
| // First migration is on a new index, so should skip all migrations. If we reset, it should re-trigger and run all migrations | ||
| resetMigration(); | ||
| // Wait for the first migration to finish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now the first migration so we don't need this line anymore.
| masterNode, | ||
| SecurityMigrations.CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION | ||
| ); | ||
| CountDownLatch awaitMigrations = awaitMigrationVersionUpdates(masterNode, SecurityMigrations.MIGRATIONS_BY_VERSION.lastKey()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Order of migrations changed, this should have been lastKey from the start.
| project, | ||
| migrationsVersion | ||
| ); | ||
| var persistentTaskCustomMetadata = PersistentTasksCustomMetadata.get(project.metadata()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a migration is running, its persistent task will be present in cluster state, when it's not it will not be present in cluster state. When a persistent task completes (failure or success) it's removed from cluster state. We want to make sure that an index state change is triggered when a persistent task fails to make sure it's retried immediately, that's why we need this state here.
|
|
||
| public static final Integer ROLE_METADATA_FLATTENED_MIGRATION_VERSION = 1; | ||
| public static final Integer CLEANUP_ROLE_MAPPING_DUPLICATES_MIGRATION_VERSION = 2; | ||
| public static final Integer ROLE_METADATA_FLATTENED_MIGRATION_VERSION = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm bumping the version here to make sure this migration runs again with proper error handling, I'm also "removing" the old migration since we don't need it anymore.
| if (response.getHits().getTotalHits().value() > 0) { | ||
| logger.info("Preparing to migrate [" + response.getHits().getTotalHits().value() + "] roles"); | ||
| updateRolesByQuery(indexManager, client, filterQuery, listener); | ||
| if (response.isTimedOut() == false && response.getFailedShards() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added error handling here to make sure we don't mark as migrated if this initial search fails silently for some reason.
436f987 to
37737b7
Compare
|
|
||
| public static final IndexVersion REENABLED_TIMESTAMP_DOC_VALUES_SPARSE_INDEX = def(9_042_0_00, Version.LUCENE_10_3_1); | ||
| public static final IndexVersion SKIPPERS_ENABLED_BY_DEFAULT = def(9_043_0_00, Version.LUCENE_10_3_1); | ||
| public static final IndexVersion SECURITY_MIGRATIONS_METADATA = def(9_044_0_00, Version.LUCENE_10_3_1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New index version is needed to make sure we skip migration for brand new index.
37737b7 to
682da2d
Compare
| public static class Manager { | ||
|
|
||
| private static final int MAX_SECURITY_MIGRATION_ATTEMPT_COUNT = 10; | ||
| private static final int MAX_SECURITY_MIGRATION_ATTEMPT_COUNT = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty significant bump because we have no idea how many times the migration would need to be retried before it's successful. In the extreme case where we have 2M roles and frequent updates 1000 doesn't feel like a crazy number, but it's also very difficult to verify this.
There is no good reason to not allow this to be very large. The point of this is to make sure that security migrations are not retried forever.
682da2d to
e5f599a
Compare
e5f599a to
77ac083
Compare
|
Pinging @elastic/es-security (Team:Security) |
|
Hi @jfreden, I've created a changelog YAML for you. |
This PR adds resilience to the
metadata_flattenedsecurity migration that was reported to have failed on clusters where concurrent role modifications happened while the migration was running. In the normal case this is not expected to happen, but for a very large number of roles or very frequent role updates a version conflict could occur.The change adds logic to:
Resolves: #110532