feature: add supervisord support for docker image intergration #12

HappyAmazonian · 2025-11-04T21:53:11Z

Add Supervisor Process Management Module

This introduces a supervisor module that wraps ML frameworks with supervisord for automatic crash recovery and robust process management. It can be integrated into any Dockerfile easily.

Integration

Install and use with these commands:

pip install model-hosting-container-standards
standard-supervisor vllm serve model --host 0.0.0.0 --port 8080

Or in a Dockerfile:

COPY model_hosting_container_standards-0.1.2-py3-none-any.whl /tmp/
RUN pip install supervisor
RUN pip install /tmp/model_hosting_container_standards-0.1.2-py3-none-any.whl

# Use supervisor entrypoint for SageMaker
ENV ENGINE_AUTO_RECOVERY=true
ENV ENGINE_MAX_RECOVERY_ATTEMPTS=3
ENTRYPOINT ["standard-supervisor", "./sagemaker-entrypoint.sh"]

Workflow

Parse command and environment → Read ML framework command and supervisor configuration
Generate supervisord config → Create robust configuration with configparser
Start supervisord → Launch supervisor daemon with your framework as managed process
Monitor and restart → Supervisor detects crashes and restarts automatically with configurable limits
Handle failures → After max retries, container exits gracefully with proper error codes

Key Components

Core Modules:

models.py - Configuration data models with comprehensive validation and environment variable parsing
generator.py - Robust supervisord configuration generation using configparser

CLI Tools & Scripts:

scripts/standard_supervisor.py - Main CLI tool for running ML frameworks under supervisor (standard-supervisor)
scripts/generate_supervisor_config.py - Standalone configuration generator CLI

Documentation & Tests:

README.md - Comprehensive setup guide with examples
tests/integration/test_supervisor_cli_integration.py - Real behavior integration tests that verify actual restart and retry behavior
tests/supervisor/ - Comprehensive unit tests for all components

Usage Examples

Simple CLI Usage

# Direct command execution with supervisor
standard-supervisor vllm serve model --host 0.0.0.0 --port 8080

# With custom configuration
MAX_START_RETRIES=5 SUPERVISOR_PROGRAM__LLM_ENGINE_STARTSECS=30 \
standard-supervisor python -m tensorrt_llm.hlapi.llm_api

Dockerfile Integration

FROM vllm/vllm-openai:latest

# Install with supervisor support
RUN pip install model-hosting-container-standards

# Configure your ML framework with supervisor settings
ENV MAX_START_RETRIES=3
ENV SUPERVISOR_PROGRAM__LLM_ENGINE_STARTSECS=30
ENV SUPERVISOR_PROGRAM__LLM_ENGINE_STOPWAITSECS=60
ENV LOG_LEVEL=info

# Use supervisor for process management
ENTRYPOINT ["python", "-m", "model_hosting_container_standards.supervisor.scripts.standard_supervisor"]
CMD ["vllm", "serve", "model", "--host", "0.0.0.0", "--port", "8080"]

Configuration Options

Basic Configuration:

Command line arguments become the supervised process command
MAX_START_RETRIES=3 - Maximum startup attempts before giving up (0-100)
LOG_LEVEL=info - Logging level (debug, info, warn, error, critical)

Advanced Supervisor Settings:

SUPERVISOR_PROGRAM__LLM_ENGINE_STARTSECS=30 - Time process must run to be considered "started"
SUPERVISOR_PROGRAM__LLM_ENGINE_STOPWAITSECS=60 - Time to wait for graceful shutdown
SUPERVISOR_PROGRAM__LLM_ENGINE_AUTORESTART=true - Enable automatic restart on failure
SUPERVISOR_PROGRAM__LLM_ENGINE_STARTRETRIES=3 - Startup retry attempts
SUPERVISOR_CONFIG_PATH=/tmp/supervisord.conf - Custom config file location

Custom Sections:

SUPERVISOR_SUPERVISORD_LOGLEVEL=debug - Supervisord daemon log level
SUPERVISOR_EVENTLISTENER__MEMMON_COMMAND=memmon -a 200MB - Add custom event listeners

Testing & Validation

Comprehensive Test Suite:

Integration Tests - Actual supervisor processes that verify continuous restart and retry limit behavior
Test Coverage:
Continuous restart behavior - Verifies supervisor actually restarts failed processes
Startup retry limits - Confirms supervisor respects retry limits and gives up appropriately
Signal handling - Tests graceful shutdown with SIGTERM
ML framework integration - Tests with realistic ML framework startup patterns
Configuration generation - Validates all supervisor configuration options
Error handling - Tests invalid configurations and edge cases

Manual Testing:

Tested with vLLM dockerfile build
Verified with docker exec process killing to confirm restart behavior
Validated in production-like container environments

- Add comprehensive supervisor process management for ML frameworks - Support for vLLM and TensorRT-LLM with auto-recovery capabilities - Environment variable-based configuration with validation - Supervisord configuration generation and management - Complete test suite with 84 passing tests (73 unit + 11 integration) - Clean documentation and usage examples - Generic supervisor-entrypoint.sh script for any container platform Key features: - Automatic process monitoring and restart on failures - Configurable recovery attempts and backoff timing - Framework-specific command resolution - Comprehensive error handling and logging - Production-ready container integration

The default framework_name should be None, not FrameworkName.GENERIC which doesn't exist.

- Remove hardcoded default commands for vLLM and TensorRT-LLM - Require users to set FRAMEWORK_COMMAND explicitly in their Dockerfiles - Update documentation to show explicit framework command examples - Update all tests to use explicit FRAMEWORK_COMMAND - Simplify framework_config.py to focus on validation only - FRAMEWORK_NAME is now optional and used only for validation This gives users full control over their framework startup commands and removes assumptions about specific framework command patterns.

…ment - Add 47 unit tests covering configuration validation, environment parsing, and framework command resolution - Add 11 integration tests for end-to-end workflows and module consistency - Fix test failures in framework command resolution and multiple framework support - All 58 tests now passing with comprehensive coverage of supervisor functionality - Tests validate configuration generation, error handling, and integration workflows

- Update README with simplified usage guide and explicit framework commands - Remove hardcoded framework commands from framework_config.py - Rename sagemaker-entrypoint.sh to supervisor-entrypoint.sh for generic usage - Consolidate examples into main README for cleaner structure - Require explicit FRAMEWORK_COMMAND environment variable for all frameworks - Improve error handling and logging throughout supervisor modules

- Refactor file structure for clarity: - config.py → models.py (configuration data models) - supervisor_config.py → generator.py (config file generation) - Simplify API and remove redundancy: - Remove redundant launch_command parameter (use config.launch_command) - Remove FRAMEWORK_COMMAND (use LAUNCH_COMMAND consistently) - Remove SUPERVISOR_PROGRAM_NAME (use fixed 'llm-engine' default) - Remove debug functionality (log_debug, SUPERVISOR_DEBUG) - Comment unused recovery_backoff_seconds field - Improve user experience: - Add extract-supervisor-entrypoint CLI tool - Update README with clearer setup instructions - Use /tmp/supervisord.conf as default (more universal than /opt/aws/) - Add path documentation and examples - Code quality improvements: - Remove complex error capture logic in shell script - Remove unnecessary configuration validation steps - Clean up imports and dependencies - Update all tests to match new structure - Add missing validate_environment_variable function - Breaking changes: - File renames require import updates - Some environment variables removed - API signatures simplified

✅ Fixed Integration Test Issues: - Resolved timeout issues in entrypoint script tests - Updated script to use Python modules directly instead of console commands - Fixed test to properly handle supervisord unavailability scenarios 🗂️ Test Structure Reorganization: - Moved unit tests from tests/unit/ to tests/supervisor/ - Removed outdated test files using old APIs - Consolidated supervisor tests in appropriate directories 🧪 Comprehensive Test Coverage: - 38 supervisor tests now passing (28 integration + 11 unit) - Tests cover exit behavior, configuration generation, CLI tools - End-to-end validation of supervisor monitoring functionality 🛠️ Technical Improvements: - Updated entrypoint script to work in test environments - Removed dependencies on installed console scripts - Enhanced error handling and timeout management - Replaced deprecated pkg_resources with importlib.resources All tests passing: 314 passed, 2 skipped

- Add complete vLLM + SageMaker Dockerfile integration example - Fix env var name: ENGINE_MAX_RECOVERY_ATTEMPTS -> ENGINE_MAX_START_RETRIES - Add runtime override examples showing how to override ENV vars at container launch - Add validation ranges and allowed values for configuration options - Include custom entrypoint script example (sagemaker-entrypoint.sh) - Clarify what users get with the integration (SageMaker endpoints, process monitoring, LoRA support)

- Add overview section at the top explaining key benefits and use case - Remove duplicate 'What You Get' sections - Fix API usage example: max_recovery_attempts -> max_start_retries - Add missing custom entrypoint script content in complete example - Reorganize sections for better flow: Overview -> Setup -> Config -> Example -> Usage -> Troubleshooting -> API - Simplify launch command examples to be more realistic - Move troubleshooting after usage examples for better logical flow - Add launch command requirement to quick setup section

- Reduce script from 201 lines to 52 lines (74% reduction) - Remove excessive logging and verbose timestamps - Streamline validation while keeping essential checks - Improve FATAL state monitoring (1-second intervals vs 5-second) - Add required log messages for test compatibility - Maintain all core functionality: env validation, config generation, supervisord startup, failure monitoring - All 425 tests now pass

- Remove unused API documentation sections from README - Clean up formatting in generator.py and models.py - Remove unused import in generator.py - Maintain functionality while improving code readability

- Function was defined but never used anywhere in the codebase - All tests continue to pass after removal - Reduces code complexity and maintenance burden

- Consolidated test_supervisor_exit_behavior.py from 15+ methods to 8 focused tests - Removed redundant test_supervisor_monitoring_logic.py (95% duplicate functionality) - Used parametrized tests to reduce repetition and improve coverage - Simplified TestExitBehaviorLogic class from 12 methods to 4 comprehensive tests - Maintained full test coverage while improving maintainability and execution speed Benefits: - Faster test execution (eliminated 30+ second flaky subprocess test) - Easier maintenance (single source of truth for supervisor tests) - Better readability (focused tests with clear purposes) - Reduced cognitive load for developers

Lokiiiiii · 2025-11-05T21:22:33Z

python/model_hosting_container_standards/supervisor/README.md

+
+**Use Case**: Deploy ML frameworks on SageMaker or any container platform with automatic crash recovery and proper failure signaling.
+
+## Quick Setup


This seems a bit convoluted. There should just be 2-3 steps -

Install

[Optional] Set ENV variables

Launch

The ENTRYPOINT should just require inserting a CLI command in front.

"vllm serve ..."
to

"standard-supervisor vllm serve ..."

Creating a new entrypoint and extracting it seems like excessive exposure from a CX perspective. Let's try to simplify this.

Lokiiiiii · 2025-11-05T21:24:01Z

python/model_hosting_container_standards/supervisor/README.md

+
+### Optional Settings
+```bash
+export ENGINE_AUTO_RECOVERY=true        # Auto-restart on failure (default: true)


is ENGINE_AUTO_RECOVERY really required ? Can we not enable/disable the feature through ENGINE_MAX_START_RETRIES ?

MAX_START_RETRIES only checks the startup failure
AUTO_RECOVERY will restart the program during the run.

Lokiiiiii · 2025-11-05T21:24:50Z

python/model_hosting_container_standards/supervisor/README.md

+### Optional Settings
+```bash
+export ENGINE_AUTO_RECOVERY=true        # Auto-restart on failure (default: true)
+export ENGINE_MAX_START_RETRIES=3       # Max restart attempts (default: 3, range: 0-100)


Let's change to something more generic :ENGINE -> PROCESS

Open to suggestions ..

python/model_hosting_container_standards/supervisor/generator.py

✅ Task 3.2 Complete: Add custom configuration merging to generator ## What's implemented: - Refactored string template to dictionary structure for cleaner code - Added custom SUPERVISOR_* environment variable merging logic - Implemented _merge_custom_sections() for flexible configuration override - Added _dict_to_ini_string() for INI format conversion - Removed complex critical settings validation (user responsibility) - Added comprehensive test coverage for custom configuration scenarios ## Key features: - Merge custom configuration with base template - Override existing settings in any section - Add new settings to existing sections - Add completely new configuration sections - User has full control over supervisor configuration ## Requirements satisfied: - 2.1: Custom SUPERVISOR_* configuration parsing ✅ - 2.2: Merge with base template without override restrictions ✅ - 2.3: Flexible validation approach (user responsibility) ✅ ## Next tasks to implement: - 4.1: Update CLI tools to use new generator - 4.2: Add integration tests for CLI tools - 4.3: Update documentation and examples

- Add standard-supervisor CLI command for simplified ML framework supervision - Replace complex entrypoint extraction with single command approach - Implement unified SUPERVISOR_* environment variable pattern - Add comprehensive CLI integration tests - Update documentation with simplified setup guide - Remove legacy extract-supervisor-entrypoint and supervisor-entrypoint.sh - Change default program name from llm-engine to llm_engine for consistency - Add support for program-specific configuration via SUPERVISOR_PROGRAM__LLM_ENGINE_* Key improvements: - Users can now simply use: standard-supervisor vllm serve model --host 0.0.0.0 --port 8080 - No more complex script extraction or custom entrypoints needed - Unified configuration system with application-level and advanced options - Full process management with signal handling and graceful shutdown - Container-friendly exit codes for orchestrator integration All integration tests pass (19/19 CLI tests + 14/14 behavior tests)

- Add 24 unit tests for StandardSupervisor CLI components - ProcessManager: tool checking, process lifecycle, signal handling - ProcessMonitor: FATAL state detection, error handling - SignalHandler: signal setup and cleanup - StandardSupervisor: argument parsing, logging, main execution flow - Add 21 unit tests for supervisor configuration generator - Base template generation with all required sections - Custom section merging and override logic - INI string formatting and structure - Configuration validation and error handling - File I/O operations and directory creation - Maintain existing 6 unit tests for SupervisorConfig model - Environment variable parsing (AUTO_RECOVERY, MAX_START_RETRIES, etc.) - SUPERVISOR_* pattern parsing with double underscore to colon conversion - Default value handling and validation Total coverage: 51 unit tests + 33 integration tests = 84 tests All tests pass with comprehensive mocking for isolated unit testing

- Replace mock-based tests with actual supervisor process testing - Add file-based logging to verify restart and retry behavior - Implement test_continuous_restart_behavior: proves supervisor continuously restarts processes with autorestart=true - Implement test_startup_retry_limit: verifies supervisor respects startretries limit with exact attempt counting - Simplify test suite from 13 to 6 focused tests, removing redundant configuration checks - Fix subprocess execution issues with proper python executable paths and working directories - All tests now verify real supervisor behavior rather than just configuration generation

- Update supervisor generator to use configparser for robust config generation - Add comprehensive validation and error handling in supervisor models - Remove obsolete test_exit_behavior.py (functionality moved to integration tests) - Enhance test_generator.py with better config parsing validation - Add new test_models.py for supervisor configuration model testing - Update README.md with improved documentation - Fix unused import in generator.py

- Remove obsolete test_supervisor_exit_behavior.py (functionality moved to CLI integration tests) - Update supervisor generator tests to match configparser output format (key = value instead of key=value) - Fix all assertion patterns in test_generator.py to use proper spacing - All supervisor tests now pass (88/88 tests passing) - Maintain backward compatibility while using robust configparser for config generation

- Replace AUTO_RECOVERY with PROCESS_AUTO_RECOVERY throughout README - Replace MAX_START_RETRIES with PROCESS_MAX_START_RETRIES throughout README - Keep LOG_LEVEL unchanged for backward compatibility - Update all examples and documentation to use new variable names - Maintain consistency with code changes and PR description

- Move supervisor from dev to main dependencies for production use - Remove overly restrictive version upper bounds from dev dependencies - Update supervisor constraint from >=4.2.0,<5.0.0 to >=4.2.0 for flexibility - This ensures CI tests pass and users get supervisor in production installs

- Add start_new_session=True to all subprocess calls in supervisor integration tests - Add start_new_session=True to supervisord process creation in standard_supervisor.py - This prevents session conflicts in CI environments where process groups can interfere - Resolves the issue where supervisord.conf files weren't being generated in CI

- Revert all test modifications to original state - Add explicit pip install supervisor step in GitHub workflow - This ensures supervisor tools are available in CI environment - Simpler approach than complex subprocess session management

- Add -s -v flags to pytest in Makefile test command - This ensures debug print statements are visible in CI logs - Will help diagnose supervisor integration test failures

- Remove debug print statements from integration tests - Remove CI workflow debug steps for supervisor verification - Restore normal pytest output in Makefile - Fix config file cleanup logic to preserve user-specified configs - All 473 tests now passing

Lokiiiiii · 2025-11-07T17:49:17Z

python/model_hosting_container_standards/supervisor/models.py

+        raise
+
+
+def _parse_supervisor_custom_sections() -> Dict[str, Dict[str, str]]:


Let's define the expectation as a regular expression and then pattern match.

r"""^SUPERVISOR_ (?P<section>[A-Z0-9](?:[A-Z0-9_]*[A-Z0-9])?) # SECTION (no leading/trailing underscore) _ (?P<key>[A-Z0-9](?:[A-Z0-9_]*[A-Z0-9])?) # KEY (no leading/trailing underscore) $"""

This should simplify the logic.

Lokiiiiii · 2025-11-07T17:51:32Z

python/model_hosting_container_standards/supervisor/scripts/generate_supervisor_config.py

+    )
+
+    parser.add_argument(
+        "-p", "--program-name", default="llm_engine", help="Program name"


Let's just call the default as "app"

Lokiiiiii · 2025-11-07T18:49:56Z

python/model_hosting_container_standards/supervisor/generator.py

+    """Get base supervisord configuration as dictionary structure."""
+    return {
+        "unix_http_server": {
+            "file": f"/tmp/supervisor-{program_name}.sock",


My understanding is that this opens an interface to the supervisord control plane.

Under what circumstance would someone need this functionality ? Why would we enable this by default ?

Same for supervisorctl and unix_http_server

it is one way for us to probe the supervisor process status from the main process.

Let me try to see if we can remove this part

- Remove unix_http_server, supervisorctl, and rpcinterface:supervisor config sections - Remove ProcessMonitor class that used supervisorctl for status checks - Use poll() loop instead of wait() for better signal handling responsiveness - Change autorestart from 'unexpected' to 'true' for LLM server use case - Update tests to use long-running server processes instead of quick-exit commands - All 6 integration tests passing This simplifies the codebase by removing the supervisorctl dependency while maintaining all core functionality for supervising long-running LLM servers.

- Remove ProcessMonitor tests as the class has been removed - Update test expectations to not check for supervisorctl config sections - Fix mock setup for process returncode - All 77 unit tests passing

- Replace manual string parsing with compiled regex pattern - Explicitly validate section/key format (no leading/trailing underscores) - Add comprehensive test coverage for edge cases - Improve code maintainability and clarity

HappyAmazonian added 20 commits November 4, 2025 13:25

fix: correct test assertion for default framework_name

20f6309

The default framework_name should be None, not FrameworkName.GENERIC which doesn't exist.

refactor

54a9f6c

Clean up supervisor module formatting and documentation

f57e015

- Remove unused API documentation sections from README - Clean up formatting in generator.py and models.py - Remove unused import in generator.py - Maintain functionality while improving code readability

Remove unused validate_config_directory function

ac9e3b9

- Function was defined but never used anywhere in the codebase - All tests continue to pass after removal - Reduces code complexity and maintenance burden

update readme

75eb447

readme

028da2f

improve

a0f0501

add test

b0d8de2

fix ci

5b7765f

try ci

19a52c4

Lokiiiiii requested changes Nov 5, 2025

View reviewed changes

HappyAmazonian added 9 commits November 6, 2025 19:14

HappyAmazonian added 5 commits November 6, 2025 20:02

Enable pytest output in CI for debug information

776b27b

- Add -s -v flags to pytest in Makefile test command - This ensures debug print statements are visible in CI logs - Will help diagnose supervisor integration test failures

try ci

c18bed7

ci

5d57d41

Lokiiiiii requested changes Nov 7, 2025

View reviewed changes

HappyAmazonian added 3 commits November 7, 2025 14:03

Update unit tests to remove supervisorctl references

33dcba5

- Remove ProcessMonitor tests as the class has been removed - Update test expectations to not check for supervisorctl config sections - Fix mock setup for process returncode - All 77 unit tests passing

Lokiiiiii approved these changes Nov 10, 2025

View reviewed changes

HappyAmazonian and others added 4 commits November 11, 2025 11:52

Merge branch 'main' into restart

e1d7174

Merge branch 'main' into restart

fa96279

update loc

a9059e9

Merge remote-tracking branch 'upstream/restart' into restart

80b043f

HappyAmazonian merged commit df3a3e8 into main Nov 11, 2025
7 checks passed

HappyAmazonian deleted the restart branch November 11, 2025 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature: add supervisord support for docker image intergration #12

feature: add supervisord support for docker image intergration #12

Uh oh!

HappyAmazonian commented Nov 4, 2025 •

edited

Loading

Uh oh!

Lokiiiiii Nov 5, 2025

Uh oh!

Lokiiiiii Nov 5, 2025

Uh oh!

HappyAmazonian Nov 7, 2025

Uh oh!

Lokiiiiii Nov 5, 2025

Uh oh!

Uh oh!

Lokiiiiii Nov 7, 2025

Uh oh!

Lokiiiiii Nov 7, 2025

Uh oh!

Lokiiiiii Nov 7, 2025

Uh oh!

HappyAmazonian Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Use Case: Deploy ML frameworks on SageMaker or any container platform with automatic crash recovery and proper failure signaling.

		## Quick Setup

		raise


		def _parse_supervisor_custom_sections() -> Dict[str, Dict[str, str]]:

feature: add supervisord support for docker image intergration #12

feature: add supervisord support for docker image intergration #12

Uh oh!

Conversation

HappyAmazonian commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Supervisor Process Management Module

Integration

Workflow

Key Components

Usage Examples

Simple CLI Usage

Dockerfile Integration

Configuration Options

Testing & Validation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HappyAmazonian commented Nov 4, 2025 •

edited

Loading