⚡️ Speed up function _merge_extracted_into_inferred_when_almost_the_same by 24% in PR #4112 (feat/track-text-source)
#4114
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
⚡️ This pull request contains optimizations for PR #4112
If you approve this dependent PR, these changes will be merged into the original PR branch
feat/track-text-source.📄 24% (0.24x) speedup for
_merge_extracted_into_inferred_when_almost_the_sameinunstructured/partition/pdf_image/pdfminer_processing.py⏱️ Runtime :
40.6 milliseconds→32.6 milliseconds(best of18runs)📝 Explanation and details
The optimized code achieves a 24% speedup through two key optimizations:
1. Improved
_minimum_containing_coordsfunction:np.vstackwith separate array creation followed bynp.column_stacknp.vstack, causing redundant temporary arrays and inefficient memory access patterns. The optimized version pre-computes each coordinate array once, then combines them efficiently2. Optimized comparison in
boxes_ioufunction:(inter_area / denom) > thresholdtointer_area > (threshold * denom)3. Minor optimization in boolean mask creation:
boxes_almost_same.sum(axis=1).astype(bool)withnp.any(boxes_almost_same, axis=1)np.anyshort-circuits on the first True value and is semantically clearer, though the performance gain is minimalTest case analysis shows the optimizations are particularly effective for:
The optimizations maintain identical functionality while reducing computational overhead through better NumPy usage patterns and mathematical rearrangement.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-pr4112-2025-11-05T21.03.01and push.Note
Speeds up layout merging by optimizing bounding-box aggregation, boolean mask creation, and IOU comparison to avoid divisions.
unstructured/partition/pdf_image/pdfminer_processing.py:/_minimum_containing_coords:x1/y1/x2/y2arrays and usesnp.column_stackto build output; removes extra transpose./_merge_extracted_into_inferred_when_almost_the_same:sum(...).astype(bool)withnp.any(..., axis=1)for match mask./boxes_iou:(x/y) > twithx > t*yto avoid divisions.Written by Cursor Bugbot for commit 8a0335f. This will update automatically on new commits. Configure here.