Skip to content

Conversation

@svencowart
Copy link
Contributor

Remove internal logging of errors in span processors and tracer provider in favor of returning errors to callers and providing an error handler callback mechanism for background operations.

Changes

  • Add error_handler callback to BatchSpanProcessor for background errors (span drops, export failures during timer-based flushes)
  • Return errors from shutdown_with_timeout when spans were dropped
  • Remove otel_debug/otel_warn/otel_error calls from span processors
  • Simplify export error handling by propagating instead of logging
  • Update BatchSpanProcessor builder with with_error_handler() method
  • Change dropped_spans_count to Arc for shared ownership
  • Remove shutdown error logging from TracerProviderInner

This gives users explicit control over error handling rather than relying on internal SDK logging that may be missed or not actionable.

Fixes #3210

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

@svencowart svencowart requested a review from a team as a code owner October 24, 2025 22:50
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 24, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: lalitb / name: Lalit Kumar Bhasin (35b0004)

Add optional error handler callbacks to BatchSpanProcessor to provide
programmatic access to background errors while maintaining internal
SDK logging for observability.

Changes:
- Add error_handler callback to BatchSpanProcessor for background errors
  (span drops, export failures during timer-based flushes)
- Return errors from shutdown_with_timeout when spans were dropped
- Maintain all internal otel_debug/otel_warn/otel_error logging
- Update BatchSpanProcessor builder with with_error_handler() method
- Change dropped_spans_count to Arc<AtomicUsize> for shared ownership
- Both log AND invoke error handler when errors occur

This gives users explicit control over error handling via opt-in callbacks
while preserving default observability through internal SDK logging. The
error handler is a supplement to logging, not a replacement.

Addresses error handling requirements from the OpenTelemetry specification
which states that SDKs MAY expose callbacks for self-diagnostics AND that
errors should be logged when they cannot be returned to the caller.
@svencowart svencowart force-pushed the propogate-trace-errors branch from 620043d to 83c695a Compare October 28, 2025 22:40
// This should be changed to return Result so users can be notified of export failures.
// The OpenTelemetry spec requires on_end to be fast/non-blocking, but doesn't mandate
// a void return - returning an error doesn't violate that requirement.
// Until the trait is changed, we cannot return this error to the caller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most spans end via Drop, which can't propagate errors. We should use the similar pattern as for batch processor here too.

current_batch_size: Arc<AtomicUsize>,
max_export_batch_size: usize,
dropped_spans_count: AtomicUsize,
dropped_spans_count: Arc<AtomicUsize>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be Arc?

dropped_span_count = dropped_spans,
max_queue_size = max_queue_size,
message = "Spans were dropped due to a queue being full. The count represents the total count of spans dropped in the lifetime of this BatchSpanProcessor. Consider increasing the queue size and/or decrease delay between intervals."
name: "BatchSpanProcessor.Shutdown",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change event name? anything wrong with the existing name? Happy to improve it.

message = "Spans were dropped due to a full queue. The count represents the total count of span records dropped in the lifetime of the BatchSpanProcessor. Consider increasing the queue size and/or decrease delay between intervals."
);

// Also return error so user code can handle it programmatically
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels incorrect. This is the shutdown() call, and that method is not failing here to warrant returning an Err - having items dropped due to buffer full is not a failure of shutdown itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Library crates should propagate errors instead of silently logging them

3 participants