-
Notifications
You must be signed in to change notification settings - Fork 591
propagate trace processor errors to users instead of internal logging #3211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
ffdb86d to
620043d
Compare
Add optional error handler callbacks to BatchSpanProcessor to provide programmatic access to background errors while maintaining internal SDK logging for observability. Changes: - Add error_handler callback to BatchSpanProcessor for background errors (span drops, export failures during timer-based flushes) - Return errors from shutdown_with_timeout when spans were dropped - Maintain all internal otel_debug/otel_warn/otel_error logging - Update BatchSpanProcessor builder with with_error_handler() method - Change dropped_spans_count to Arc<AtomicUsize> for shared ownership - Both log AND invoke error handler when errors occur This gives users explicit control over error handling via opt-in callbacks while preserving default observability through internal SDK logging. The error handler is a supplement to logging, not a replacement. Addresses error handling requirements from the OpenTelemetry specification which states that SDKs MAY expose callbacks for self-diagnostics AND that errors should be logged when they cannot be returned to the caller.
620043d to
83c695a
Compare
| // This should be changed to return Result so users can be notified of export failures. | ||
| // The OpenTelemetry spec requires on_end to be fast/non-blocking, but doesn't mandate | ||
| // a void return - returning an error doesn't violate that requirement. | ||
| // Until the trait is changed, we cannot return this error to the caller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most spans end via Drop, which can't propagate errors. We should use the similar pattern as for batch processor here too.
| current_batch_size: Arc<AtomicUsize>, | ||
| max_export_batch_size: usize, | ||
| dropped_spans_count: AtomicUsize, | ||
| dropped_spans_count: Arc<AtomicUsize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to be Arc?
| dropped_span_count = dropped_spans, | ||
| max_queue_size = max_queue_size, | ||
| message = "Spans were dropped due to a queue being full. The count represents the total count of spans dropped in the lifetime of this BatchSpanProcessor. Consider increasing the queue size and/or decrease delay between intervals." | ||
| name: "BatchSpanProcessor.Shutdown", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change event name? anything wrong with the existing name? Happy to improve it.
| message = "Spans were dropped due to a full queue. The count represents the total count of span records dropped in the lifetime of the BatchSpanProcessor. Consider increasing the queue size and/or decrease delay between intervals." | ||
| ); | ||
|
|
||
| // Also return error so user code can handle it programmatically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels incorrect. This is the shutdown() call, and that method is not failing here to warrant returning an Err - having items dropped due to buffer full is not a failure of shutdown itself.
Remove internal logging of errors in span processors and tracer provider in favor of returning errors to callers and providing an error handler callback mechanism for background operations.
Changes
This gives users explicit control over error handling rather than relying on internal SDK logging that may be missed or not actionable.
Fixes #3210
Merge requirement checklist
CHANGELOG.mdfiles updated for non-trivial, user-facing changes