[Flow Control] Expose Backpressure Metrics for Autoscaling

**What would you like to be added**: The Flow Control layer has a centralized view of demand and capacity, including queue lengths, dispatch rates, and saturation status. This issue proposes designing and exposing these internal metrics as stable, public signals that can be used to drive autoscaling.

This feature requires a design phase to determine:

- Which metrics provide the most reliable signal for scaling (e.g., aggregate queue depth, time-in-queue p99, rate of dispatch vs. arrival).
- How these metrics should be aggregated and exposed (e.g., via Prometheus) to be easily consumed by autoscalers like the Kubernetes HPA.
- How we handle heterogenous pools.
- How we handle disaggregated serving architectures (e.g., separate signal for P and D workers).


**Why is this needed**: Autoscaling based on lagging indicators like CPU/GPU utilization can be ineffective for bursty inference workloads. The Flow Control layer's backpressure signals provide a direct, real-time measure of demand exceeding capacity. Exposing these metrics will enable users to create much more responsive and accurate autoscaling policies, preventing overload and improving resource efficiency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Flow Control] Expose Backpressure Metrics for Autoscaling #1798

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Flow Control] Expose Backpressure Metrics for Autoscaling #1798

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions