-
Notifications
You must be signed in to change notification settings - Fork 199
Fix dense fragment domains during global order write with maximum fragment size #5655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dense fragment domains during global order write with maximum fragment size #5655
Conversation
| void FragmentMetadata::set_num_tiles(uint64_t num_tiles) { | ||
| if (dense_) { | ||
| const uint64_t dense_tile_num = tile_num(); | ||
| iassert(num_tiles <= dense_tile_num); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the assert which is tripped without the fix. When the global order writer doesn't update the fragment domain, each fragment gets the whole domain but a fraction of the tiles. tile_num() is computed using the fragment domain, so they don't agree.
| /* CONSTRUCTORS & DESTRUCTORS */ | ||
| /* ****************************** */ | ||
|
|
||
| Query::CoordsInfo::CoordsInfo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing a weird "possible use of uninitialized value" error in release builds, I changed this to try and rule something out. No dice but having a constructor feels better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-const accessors.
| * @return `a * b` if it can be represented as a `uint64_t` without undefined | ||
| * behavior, `std::nullopt` otherwise | ||
| */ | ||
| static std::optional<uint64_t> mul(uint64_t a, uint64_t b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can perhaps use/modify utils::math::safe_mul.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scope of this is broader than it looks at first glance. safe_mul handles floats, this new mul doesn't. I do agree that it would be better to consolidate the functions; I am reluctant to do so in this already very large pull request.
2a643c1 to
13a8d3c
Compare
This pull request is made in support of #5655 which remains in draft; the intent is to pull these pieces out of that in order to have a more focused code review of both pull requests. We have a container class `IndexedList` which is used to store non-movable data (and thus not usable in `std::vector`) with indexing (and thus not usable in `std::list`). The implementation uses both `std::list` and `std::vector` underneath. This pull request makes some improvements to this class: - we add the `splice` method whose semantics are the same as those of the `std::list` `splice` method. This method transfers elements from one `IndexedList` to another. In support of this we add new `iterator` and `const_iterator` types. - we provide the constructor definition in the header so that implementations do not have to declare their own specialization. - we update the `resize` method to accept variadic copyable arguments which will be used to initialize each of the new elements of the container. We also add unit tests for `splice` correctness, and `resize` avoiding the move constructor. --- TYPE: NO_HISTORY DESC: Add `IndexedList::splice`
b22ee46 to
bf43cbe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the definitions used in the file from test/support/src/array_templates.h into test/support/src/array_schema_templates.h.
I did this because the first header had some API stuff which required including tiledb.h which was not compatible with using these structures in a unit test.
The changes to this file follow up on that, applying similar reason to our rapidcheck generators.\
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some actual new code in this file, such as the "bound" arguments to make_extent, make_domain, and make_dimension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the definitions used in the file from test/support/src/array_templates.h into test/support/src/array_schema_templates.h.
I did this because the first header had some API stuff which required including tiledb.h which was not compatible with using these structures in a unit test.
The changes to this file follow up on that, splitting out some show definitions so we can pull in definitions for tiledb::test::templates::Dimension without anything to do with queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the definitions used in the file from test/support/src/array_templates.h into test/support/src/array_schema_templates.h.
I did this because the first header had some API stuff which required including tiledb.h which was not compatible with using these structures in a unit test.
As such there are only trivial changes to this file, except for the relaxation of the AttributeType concept which previously referred to definitions which are not available in this file. Since it's test code only hopefully we can live with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first pass
) This pull request does some refactoring and enhancing of our test data structures `struct Fragment1D` and `struct Fragment2D` which are used as containers for inputs to write queries and outputs of read queries. 1) we move the methods which generically create arrays using these and write fragments using these to a header file instead of `sparse_global_order_reader.cc` 2) we refactor functionality into a `template <typename DimensionTuple, typename AttributeTuple> struct Fragment` which is a common base class for `Fragment1D`, `Fragment2D`, a new `Fragment3D` and can be used with an empty dimension tuple for dense fragments 3) we implement various fixes to enable variable-length dimensions 4) some other incidentals This pull request has been refactored out of #5646 for two reasons: 1) it is all enhancements to test code which may be distracting from the functional components of #5646 ; 2) I anticipate using it to add additional tests to #5655 . Specifically, using the `Fragment` with an empty dimension tuple will be very nice for writing tests on dense arrays. There's no direct testing of this, but since the code was factored out of #5646 all of it was used in testing there. Many components are exercised due to the refactor of the existing tests. --- TYPE: NO_HISTORY DESC: refactors and enhancements to test data structure `struct Fragment`
…solidation-corruption-main-base
|
I have completed all of my TODOs one way or another, as long as CI passes I believe this is ready to go! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look reasonable to me. I won't say I was able to do a thorough review of 4.5k lines of code in one shot but the effort put in testing is reassuring and the code quality is good so I am not too concerned.
One small worry is about Sparse global order writes, in case we don't have enough test coverage to catch a potential accidental regression there. I would appreciate if you could double check that.
Thanks for solving this complex issue!
| tile_num = (tiles.empty() ? 0 : tiles.begin()->second.size()); | ||
|
|
||
| const auto fragments = identify_fragment_tile_boundaries(tiles); | ||
|
|
||
| for (uint64_t f = 0; f < fragments.tile_offsets_.size(); f++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the changes here, in global_write function affect also sparse global order writes, is my understanding correct?
If yes, do we have enough coverage in test to be sure no regression was introduced for sparse arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran tiledb_unit with the following code in the GlobalOrderWriter constructor:
diff --git a/tiledb/sm/query/writers/global_order_writer.cc b/tiledb/sm/query/writers/global_order_writer.cc
index 85abf2281..474fe015a 100644
--- a/tiledb/sm/query/writers/global_order_writer.cc
+++ b/tiledb/sm/query/writers/global_order_writer.cc
@@ -165,6 +165,8 @@ class GlobalOrderWriterException : public StatusException {
/* CONSTRUCTORS & DESTRUCTORS */
/* ****************************** */
+static std::atomic<uint64_t> num_sparse = 0;
+
GlobalOrderWriter::GlobalOrderWriter(
stats::Stats* stats,
shared_ptr<Logger> logger,
(1/2) Discard this hunk from worktree [y,n,q,a,d,j,J,g,/,e,?]? n
@@ -201,6 +203,13 @@ GlobalOrderWriter::GlobalOrderWriter(
"Failed to initialize global order writer. Global order writes to "
"ordered attributes are not yet supported.");
}
+
+ if (!dense()) {
+ num_sparse++;
+ std::cout << "New sparse writer, max_fragment_size = "
+ << max_fragment_size_.value_or(0) << ", " << num_sparse
+ << " writers created" << std::endl;
+ }
}
GlobalOrderWriter::~GlobalOrderWriter() {
It looks like we create 1000+ sparse global order writers. The cases I saw mostly do not use max fragment size, except consolidation which uses std::numeric_limits<uint64_t>::max(), but I did catch one using a smaller value.
So I believe the test coverage we have should be good enough.
This pull request is made in support of #5655 which remains in draft; the intent is to pull these pieces out of that in order to have a more focused code review of both pull requests. We have a container class `IndexedList` which is used to store non-movable data (and thus not usable in `std::vector`) with indexing (and thus not usable in `std::list`). The implementation uses both `std::list` and `std::vector` underneath. This pull request makes some improvements to this class: - we add the `splice` method whose semantics are the same as those of the `std::list` `splice` method. This method transfers elements from one `IndexedList` to another. In support of this we add new `iterator` and `const_iterator` types. - we provide the constructor definition in the header so that implementations do not have to declare their own specialization. - we update the `resize` method to accept variadic copyable arguments which will be used to initialize each of the new elements of the container. We also add unit tests for `splice` correctness, and `resize` avoiding the move constructor. --- TYPE: NO_HISTORY DESC: Add `IndexedList::splice`
) This pull request does some refactoring and enhancing of our test data structures `struct Fragment1D` and `struct Fragment2D` which are used as containers for inputs to write queries and outputs of read queries. 1) we move the methods which generically create arrays using these and write fragments using these to a header file instead of `sparse_global_order_reader.cc` 2) we refactor functionality into a `template <typename DimensionTuple, typename AttributeTuple> struct Fragment` which is a common base class for `Fragment1D`, `Fragment2D`, a new `Fragment3D` and can be used with an empty dimension tuple for dense fragments 3) we implement various fixes to enable variable-length dimensions 4) some other incidentals This pull request has been refactored out of #5646 for two reasons: 1) it is all enhancements to test code which may be distracting from the functional components of #5646 ; 2) I anticipate using it to add additional tests to #5655 . Specifically, using the `Fragment` with an empty dimension tuple will be very nice for writing tests on dense arrays. There's no direct testing of this, but since the code was factored out of #5646 all of it was used in testing there. Many components are exercised due to the refactor of the existing tests. --- TYPE: NO_HISTORY DESC: refactors and enhancements to test data structure `struct Fragment`
…imum fragment size (#5655) In CORE-290 a customer reported issues with corrupt arrays after running consolidation. The symptom was memory allocation errors when opening an array. The root cause turned out to be the consolidation itself was writing new fragments where the fragment domain did not match the number of tiles in the fragment. The fragment metadata domain is a bounding rectangle. This means that the global order writer must split the tiles of its input into fragments at tile boundaries which bisect the bounding rectangle into two smaller bounding rectangles. To do so, we add a first pass identify_fragment_tile_boundaries which returns a list of tile offsets where new fragments will begin. Upon finishing a fragment, we use that tile offset to determine which rectangle within the target subarray the fragment actually represents, and update the fragment metadata accordingly. We use new functions is_rectangular_domain to determine whether a (start_tile, num_tiles) pair identifies a rectangle, and domain_tile_offset to compute that rectangle. Much of the complexity comes from the usage of the global order writer which does happen in consolidation: multi-part writes. A user (or a consolidation operation) can set a domain D which it intends to write into, and then actually fill in all of the cells over multiple submit calls which stream in the cells to write. It is not required for these cells to be tile aligned. Because of that, and the need to write rectangle fragments, a single submit cannot always determine whether a tail of tiles belongs to its current fragment or must be deferred to the next. To get around this we keep those tiles in memory in the global_write_state_ and prepend them to the user input in the next submit.
This pull request is made in support of #5655 which remains in draft; the intent is to pull these pieces out of that in order to have a more focused code review of both pull requests. We have a container class `IndexedList` which is used to store non-movable data (and thus not usable in `std::vector`) with indexing (and thus not usable in `std::list`). The implementation uses both `std::list` and `std::vector` underneath. This pull request makes some improvements to this class: - we add the `splice` method whose semantics are the same as those of the `std::list` `splice` method. This method transfers elements from one `IndexedList` to another. In support of this we add new `iterator` and `const_iterator` types. - we provide the constructor definition in the header so that implementations do not have to declare their own specialization. - we update the `resize` method to accept variadic copyable arguments which will be used to initialize each of the new elements of the container. We also add unit tests for `splice` correctness, and `resize` avoiding the move constructor. --- TYPE: NO_HISTORY DESC: Add `IndexedList::splice`
) This pull request does some refactoring and enhancing of our test data structures `struct Fragment1D` and `struct Fragment2D` which are used as containers for inputs to write queries and outputs of read queries. 1) we move the methods which generically create arrays using these and write fragments using these to a header file instead of `sparse_global_order_reader.cc` 2) we refactor functionality into a `template <typename DimensionTuple, typename AttributeTuple> struct Fragment` which is a common base class for `Fragment1D`, `Fragment2D`, a new `Fragment3D` and can be used with an empty dimension tuple for dense fragments 3) we implement various fixes to enable variable-length dimensions 4) some other incidentals This pull request has been refactored out of #5646 for two reasons: 1) it is all enhancements to test code which may be distracting from the functional components of #5646 ; 2) I anticipate using it to add additional tests to #5655 . Specifically, using the `Fragment` with an empty dimension tuple will be very nice for writing tests on dense arrays. There's no direct testing of this, but since the code was factored out of #5646 all of it was used in testing there. Many components are exercised due to the refactor of the existing tests. --- TYPE: NO_HISTORY DESC: refactors and enhancements to test data structure `struct Fragment`
…imum fragment size (#5655) In CORE-290 a customer reported issues with corrupt arrays after running consolidation. The symptom was memory allocation errors when opening an array. The root cause turned out to be the consolidation itself was writing new fragments where the fragment domain did not match the number of tiles in the fragment. The fragment metadata domain is a bounding rectangle. This means that the global order writer must split the tiles of its input into fragments at tile boundaries which bisect the bounding rectangle into two smaller bounding rectangles. To do so, we add a first pass identify_fragment_tile_boundaries which returns a list of tile offsets where new fragments will begin. Upon finishing a fragment, we use that tile offset to determine which rectangle within the target subarray the fragment actually represents, and update the fragment metadata accordingly. We use new functions is_rectangular_domain to determine whether a (start_tile, num_tiles) pair identifies a rectangle, and domain_tile_offset to compute that rectangle. Much of the complexity comes from the usage of the global order writer which does happen in consolidation: multi-part writes. A user (or a consolidation operation) can set a domain D which it intends to write into, and then actually fill in all of the cells over multiple submit calls which stream in the cells to write. It is not required for these cells to be tile aligned. Because of that, and the need to write rectangle fragments, a single submit cannot always determine whether a tail of tiles belongs to its current fragment or must be deferred to the next. To get around this we keep those tiles in memory in the global_write_state_ and prepend them to the user input in the next submit.
Fixes CORE-290.
Report
In CORE-290 a customer reported issues with corrupt arrays after running consolidation. The symptom was memory allocation errors when opening an array. The root cause turned out to be the consolidation itself was writing new fragments where the fragment domain did not match the number of tiles in the fragment.
Root cause analysis
Stepping through a minimal reproducer revealed that consolidation is not at fault at all - instead it is the global order writer
max_fragment_size_field. This field, presumably meant to be used for consolidation only but valid for other writes, instructs the writer to write multiple fragments each under the requested size as necessary. When a write splits its data into multiple fragments this way, nothing need be done for sparse arrays. But dense fragment metadata relies on the domain to determine the number of tiles, and the global order writer did not update the fragment metadata domain when splitting its write into multiple fragments.This pull request fixes that issue by ensuring that the global order writer updates the fragment metadata domain to reflect what was actually written into that fragment.
This is easier said than done. In CORE-290 I offered four ways we can fix this. The chosen solution:
Design
The fragment metadata domain is a bounding rectangle. This means that the global order writer must split the tiles of its input into fragments at tile boundaries which bisect the bounding rectangle into two smaller bounding rectangles.
To do so, we add a first pass
identify_fragment_tile_boundarieswhich returns a list of tile offsets where new fragments will begin. Upon finishing a fragment, we use that tile offset to determine which rectangle within the target subarray the fragment actually represents, and update the fragment metadata accordingly. We use new functionsis_rectangular_domainto determine whether a(start_tile, num_tiles)pair identifies a rectangle, anddomain_tile_offsetto compute that rectangle.Much of the complexity comes from the usage of the global order writer which does happen in consolidation: multi-part writes. A user (or a consolidation operation) can set a domain
Dwhich it intends to write into, and then actually fill in all of the cells over multiplesubmitcalls which stream in the cells to write. It is not required for these cells to be tile aligned. Because of that, and the need to write rectangle fragments, a singlesubmitcannot always determine whether a tail of tiles belongs to its current fragment or must be deferred to the next. To get around this we keep those tiles in memory in theglobal_write_state_and prepend them to the user input in the nextsubmit.Testing
We add unit tests for
is_rectangular_domainanddomain_tile_offset, including both example tests as well as rapidcheck properties to assert general claims.We add tests for the global order writer to make broad claims about what the writer is supposed to do with respect to the
max_fragment_sizeparameter. We add examples and a rapidcheck test to exercise these claims. In particular:Since the original report originated from consolidation, we also add some consolidation tests (TODO) mimicking the customer input.
TODO
FIXMEcomment inglobal_order_writer.cclast_tiles_buffering which observes the amount of data buffered for a 3D writeTYPE: BUG
DESC: Fix dense global order writer use of
max_fragment_size_