Skip to content

[Flow Control] Expose Backpressure Metrics for Autoscaling #1798

@LukeAVanDrie

Description

@LukeAVanDrie

What would you like to be added: The Flow Control layer has a centralized view of demand and capacity, including queue lengths, dispatch rates, and saturation status. This issue proposes designing and exposing these internal metrics as stable, public signals that can be used to drive autoscaling.

This feature requires a design phase to determine:

  • Which metrics provide the most reliable signal for scaling (e.g., aggregate queue depth, time-in-queue p99, rate of dispatch vs. arrival).
  • How these metrics should be aggregated and exposed (e.g., via Prometheus) to be easily consumed by autoscalers like the Kubernetes HPA.
  • How we handle heterogenous pools.
  • How we handle disaggregated serving architectures (e.g., separate signal for P and D workers).

Why is this needed: Autoscaling based on lagging indicators like CPU/GPU utilization can be ineffective for bursty inference workloads. The Flow Control layer's backpressure signals provide a direct, real-time measure of demand exceeding capacity. Exposing these metrics will enable users to create much more responsive and accurate autoscaling policies, preventing overload and improving resource efficiency.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions