-
Notifications
You must be signed in to change notification settings - Fork 187
Description
What would you like to be added: The Flow Control layer has a centralized view of demand and capacity, including queue lengths, dispatch rates, and saturation status. This issue proposes designing and exposing these internal metrics as stable, public signals that can be used to drive autoscaling.
This feature requires a design phase to determine:
- Which metrics provide the most reliable signal for scaling (e.g., aggregate queue depth, time-in-queue p99, rate of dispatch vs. arrival).
- How these metrics should be aggregated and exposed (e.g., via Prometheus) to be easily consumed by autoscalers like the Kubernetes HPA.
- How we handle heterogenous pools.
- How we handle disaggregated serving architectures (e.g., separate signal for P and D workers).
Why is this needed: Autoscaling based on lagging indicators like CPU/GPU utilization can be ineffective for bursty inference workloads. The Flow Control layer's backpressure signals provide a direct, real-time measure of demand exceeding capacity. Exposing these metrics will enable users to create much more responsive and accurate autoscaling policies, preventing overload and improving resource efficiency.