-
Notifications
You must be signed in to change notification settings - Fork 1
feature: add supervisord support for docker image intergration #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add comprehensive supervisor process management for ML frameworks - Support for vLLM and TensorRT-LLM with auto-recovery capabilities - Environment variable-based configuration with validation - Supervisord configuration generation and management - Complete test suite with 84 passing tests (73 unit + 11 integration) - Clean documentation and usage examples - Generic supervisor-entrypoint.sh script for any container platform Key features: - Automatic process monitoring and restart on failures - Configurable recovery attempts and backoff timing - Framework-specific command resolution - Comprehensive error handling and logging - Production-ready container integration
The default framework_name should be None, not FrameworkName.GENERIC which doesn't exist.
- Remove hardcoded default commands for vLLM and TensorRT-LLM - Require users to set FRAMEWORK_COMMAND explicitly in their Dockerfiles - Update documentation to show explicit framework command examples - Update all tests to use explicit FRAMEWORK_COMMAND - Simplify framework_config.py to focus on validation only - FRAMEWORK_NAME is now optional and used only for validation This gives users full control over their framework startup commands and removes assumptions about specific framework command patterns.
…ment - Add 47 unit tests covering configuration validation, environment parsing, and framework command resolution - Add 11 integration tests for end-to-end workflows and module consistency - Fix test failures in framework command resolution and multiple framework support - All 58 tests now passing with comprehensive coverage of supervisor functionality - Tests validate configuration generation, error handling, and integration workflows
- Update README with simplified usage guide and explicit framework commands - Remove hardcoded framework commands from framework_config.py - Rename sagemaker-entrypoint.sh to supervisor-entrypoint.sh for generic usage - Consolidate examples into main README for cleaner structure - Require explicit FRAMEWORK_COMMAND environment variable for all frameworks - Improve error handling and logging throughout supervisor modules
- Refactor file structure for clarity: - config.py → models.py (configuration data models) - supervisor_config.py → generator.py (config file generation) - Simplify API and remove redundancy: - Remove redundant launch_command parameter (use config.launch_command) - Remove FRAMEWORK_COMMAND (use LAUNCH_COMMAND consistently) - Remove SUPERVISOR_PROGRAM_NAME (use fixed 'llm-engine' default) - Remove debug functionality (log_debug, SUPERVISOR_DEBUG) - Comment unused recovery_backoff_seconds field - Improve user experience: - Add extract-supervisor-entrypoint CLI tool - Update README with clearer setup instructions - Use /tmp/supervisord.conf as default (more universal than /opt/aws/) - Add path documentation and examples - Code quality improvements: - Remove complex error capture logic in shell script - Remove unnecessary configuration validation steps - Clean up imports and dependencies - Update all tests to match new structure - Add missing validate_environment_variable function - Breaking changes: - File renames require import updates - Some environment variables removed - API signatures simplified
✅ Fixed Integration Test Issues: - Resolved timeout issues in entrypoint script tests - Updated script to use Python modules directly instead of console commands - Fixed test to properly handle supervisord unavailability scenarios 🗂️ Test Structure Reorganization: - Moved unit tests from tests/unit/ to tests/supervisor/ - Removed outdated test files using old APIs - Consolidated supervisor tests in appropriate directories 🧪 Comprehensive Test Coverage: - 38 supervisor tests now passing (28 integration + 11 unit) - Tests cover exit behavior, configuration generation, CLI tools - End-to-end validation of supervisor monitoring functionality 🛠️ Technical Improvements: - Updated entrypoint script to work in test environments - Removed dependencies on installed console scripts - Enhanced error handling and timeout management - Replaced deprecated pkg_resources with importlib.resources All tests passing: 314 passed, 2 skipped
- Add complete vLLM + SageMaker Dockerfile integration example - Fix env var name: ENGINE_MAX_RECOVERY_ATTEMPTS -> ENGINE_MAX_START_RETRIES - Add runtime override examples showing how to override ENV vars at container launch - Add validation ranges and allowed values for configuration options - Include custom entrypoint script example (sagemaker-entrypoint.sh) - Clarify what users get with the integration (SageMaker endpoints, process monitoring, LoRA support)
- Add overview section at the top explaining key benefits and use case - Remove duplicate 'What You Get' sections - Fix API usage example: max_recovery_attempts -> max_start_retries - Add missing custom entrypoint script content in complete example - Reorganize sections for better flow: Overview -> Setup -> Config -> Example -> Usage -> Troubleshooting -> API - Simplify launch command examples to be more realistic - Move troubleshooting after usage examples for better logical flow - Add launch command requirement to quick setup section
- Reduce script from 201 lines to 52 lines (74% reduction) - Remove excessive logging and verbose timestamps - Streamline validation while keeping essential checks - Improve FATAL state monitoring (1-second intervals vs 5-second) - Add required log messages for test compatibility - Maintain all core functionality: env validation, config generation, supervisord startup, failure monitoring - All 425 tests now pass
- Remove unused API documentation sections from README - Clean up formatting in generator.py and models.py - Remove unused import in generator.py - Maintain functionality while improving code readability
- Function was defined but never used anywhere in the codebase - All tests continue to pass after removal - Reduces code complexity and maintenance burden
- Consolidated test_supervisor_exit_behavior.py from 15+ methods to 8 focused tests - Removed redundant test_supervisor_monitoring_logic.py (95% duplicate functionality) - Used parametrized tests to reduce repetition and improve coverage - Simplified TestExitBehaviorLogic class from 12 methods to 4 comprehensive tests - Maintained full test coverage while improving maintainability and execution speed Benefits: - Faster test execution (eliminated 30+ second flaky subprocess test) - Easier maintenance (single source of truth for supervisor tests) - Better readability (focused tests with clear purposes) - Reduced cognitive load for developers
|
|
||
| **Use Case**: Deploy ML frameworks on SageMaker or any container platform with automatic crash recovery and proper failure signaling. | ||
|
|
||
| ## Quick Setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit convoluted. There should just be 2-3 steps -
- Install
- [Optional] Set ENV variables
- Launch
The ENTRYPOINT should just require inserting a CLI command in front.
"vllm serve ..."
to
"standard-supervisor vllm serve ..."
Creating a new entrypoint and extracting it seems like excessive exposure from a CX perspective. Let's try to simplify this.
|
|
||
| ### Optional Settings | ||
| ```bash | ||
| export ENGINE_AUTO_RECOVERY=true # Auto-restart on failure (default: true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is ENGINE_AUTO_RECOVERY really required ? Can we not enable/disable the feature through ENGINE_MAX_START_RETRIES ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX_START_RETRIES only checks the startup failure
AUTO_RECOVERY will restart the program during the run.
| ### Optional Settings | ||
| ```bash | ||
| export ENGINE_AUTO_RECOVERY=true # Auto-restart on failure (default: true) | ||
| export ENGINE_MAX_START_RETRIES=3 # Max restart attempts (default: 3, range: 0-100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change to something more generic :ENGINE -> PROCESS
Open to suggestions ..
✅ Task 3.2 Complete: Add custom configuration merging to generator ## What's implemented: - Refactored string template to dictionary structure for cleaner code - Added custom SUPERVISOR_* environment variable merging logic - Implemented _merge_custom_sections() for flexible configuration override - Added _dict_to_ini_string() for INI format conversion - Removed complex critical settings validation (user responsibility) - Added comprehensive test coverage for custom configuration scenarios ## Key features: - Merge custom configuration with base template - Override existing settings in any section - Add new settings to existing sections - Add completely new configuration sections - User has full control over supervisor configuration ## Requirements satisfied: - 2.1: Custom SUPERVISOR_* configuration parsing ✅ - 2.2: Merge with base template without override restrictions ✅ - 2.3: Flexible validation approach (user responsibility) ✅ ## Next tasks to implement: - 4.1: Update CLI tools to use new generator - 4.2: Add integration tests for CLI tools - 4.3: Update documentation and examples
- Add standard-supervisor CLI command for simplified ML framework supervision - Replace complex entrypoint extraction with single command approach - Implement unified SUPERVISOR_* environment variable pattern - Add comprehensive CLI integration tests - Update documentation with simplified setup guide - Remove legacy extract-supervisor-entrypoint and supervisor-entrypoint.sh - Change default program name from llm-engine to llm_engine for consistency - Add support for program-specific configuration via SUPERVISOR_PROGRAM__LLM_ENGINE_* Key improvements: - Users can now simply use: standard-supervisor vllm serve model --host 0.0.0.0 --port 8080 - No more complex script extraction or custom entrypoints needed - Unified configuration system with application-level and advanced options - Full process management with signal handling and graceful shutdown - Container-friendly exit codes for orchestrator integration All integration tests pass (19/19 CLI tests + 14/14 behavior tests)
- Add 24 unit tests for StandardSupervisor CLI components - ProcessManager: tool checking, process lifecycle, signal handling - ProcessMonitor: FATAL state detection, error handling - SignalHandler: signal setup and cleanup - StandardSupervisor: argument parsing, logging, main execution flow - Add 21 unit tests for supervisor configuration generator - Base template generation with all required sections - Custom section merging and override logic - INI string formatting and structure - Configuration validation and error handling - File I/O operations and directory creation - Maintain existing 6 unit tests for SupervisorConfig model - Environment variable parsing (AUTO_RECOVERY, MAX_START_RETRIES, etc.) - SUPERVISOR_* pattern parsing with double underscore to colon conversion - Default value handling and validation Total coverage: 51 unit tests + 33 integration tests = 84 tests All tests pass with comprehensive mocking for isolated unit testing
- Replace mock-based tests with actual supervisor process testing - Add file-based logging to verify restart and retry behavior - Implement test_continuous_restart_behavior: proves supervisor continuously restarts processes with autorestart=true - Implement test_startup_retry_limit: verifies supervisor respects startretries limit with exact attempt counting - Simplify test suite from 13 to 6 focused tests, removing redundant configuration checks - Fix subprocess execution issues with proper python executable paths and working directories - All tests now verify real supervisor behavior rather than just configuration generation
- Update supervisor generator to use configparser for robust config generation - Add comprehensive validation and error handling in supervisor models - Remove obsolete test_exit_behavior.py (functionality moved to integration tests) - Enhance test_generator.py with better config parsing validation - Add new test_models.py for supervisor configuration model testing - Update README.md with improved documentation - Fix unused import in generator.py
- Remove obsolete test_supervisor_exit_behavior.py (functionality moved to CLI integration tests) - Update supervisor generator tests to match configparser output format (key = value instead of key=value) - Fix all assertion patterns in test_generator.py to use proper spacing - All supervisor tests now pass (88/88 tests passing) - Maintain backward compatibility while using robust configparser for config generation
- Replace AUTO_RECOVERY with PROCESS_AUTO_RECOVERY throughout README - Replace MAX_START_RETRIES with PROCESS_MAX_START_RETRIES throughout README - Keep LOG_LEVEL unchanged for backward compatibility - Update all examples and documentation to use new variable names - Maintain consistency with code changes and PR description
- Move supervisor from dev to main dependencies for production use - Remove overly restrictive version upper bounds from dev dependencies - Update supervisor constraint from >=4.2.0,<5.0.0 to >=4.2.0 for flexibility - This ensures CI tests pass and users get supervisor in production installs
- Add start_new_session=True to all subprocess calls in supervisor integration tests - Add start_new_session=True to supervisord process creation in standard_supervisor.py - This prevents session conflicts in CI environments where process groups can interfere - Resolves the issue where supervisord.conf files weren't being generated in CI
- Revert all test modifications to original state - Add explicit pip install supervisor step in GitHub workflow - This ensures supervisor tools are available in CI environment - Simpler approach than complex subprocess session management
- Add -s -v flags to pytest in Makefile test command - This ensures debug print statements are visible in CI logs - Will help diagnose supervisor integration test failures
- Remove debug print statements from integration tests - Remove CI workflow debug steps for supervisor verification - Restore normal pytest output in Makefile - Fix config file cleanup logic to preserve user-specified configs - All 473 tests now passing
| raise | ||
|
|
||
|
|
||
| def _parse_supervisor_custom_sections() -> Dict[str, Dict[str, str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's define the expectation as a regular expression and then pattern match.
r"""^SUPERVISOR_
(?P<section>[A-Z0-9](?:[A-Z0-9_]*[A-Z0-9])?) # SECTION (no leading/trailing underscore)
_
(?P<key>[A-Z0-9](?:[A-Z0-9_]*[A-Z0-9])?) # KEY (no leading/trailing underscore)
$"""
This should simplify the logic.
| ) | ||
|
|
||
| parser.add_argument( | ||
| "-p", "--program-name", default="llm_engine", help="Program name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just call the default as "app"
| """Get base supervisord configuration as dictionary structure.""" | ||
| return { | ||
| "unix_http_server": { | ||
| "file": f"/tmp/supervisor-{program_name}.sock", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that this opens an interface to the supervisord control plane.
Under what circumstance would someone need this functionality ? Why would we enable this by default ?
Same for supervisorctl and unix_http_server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is one way for us to probe the supervisor process status from the main process.
Let me try to see if we can remove this part
- Remove unix_http_server, supervisorctl, and rpcinterface:supervisor config sections - Remove ProcessMonitor class that used supervisorctl for status checks - Use poll() loop instead of wait() for better signal handling responsiveness - Change autorestart from 'unexpected' to 'true' for LLM server use case - Update tests to use long-running server processes instead of quick-exit commands - All 6 integration tests passing This simplifies the codebase by removing the supervisorctl dependency while maintaining all core functionality for supervising long-running LLM servers.
- Remove ProcessMonitor tests as the class has been removed - Update test expectations to not check for supervisorctl config sections - Fix mock setup for process returncode - All 77 unit tests passing
- Replace manual string parsing with compiled regex pattern - Explicitly validate section/key format (no leading/trailing underscores) - Add comprehensive test coverage for edge cases - Improve code maintainability and clarity
Add Supervisor Process Management Module
This introduces a supervisor module that wraps ML frameworks with supervisord for automatic crash recovery and robust process management. It can be integrated into any Dockerfile easily.
Integration
Install and use with these commands:
Or in a Dockerfile:
Workflow
Key Components
Core Modules:
models.py- Configuration data models with comprehensive validation and environment variable parsinggenerator.py- Robust supervisord configuration generation using configparserCLI Tools & Scripts:
scripts/standard_supervisor.py- Main CLI tool for running ML frameworks under supervisor (standard-supervisor)scripts/generate_supervisor_config.py- Standalone configuration generator CLIDocumentation & Tests:
README.md- Comprehensive setup guide with examplestests/integration/test_supervisor_cli_integration.py- Real behavior integration tests that verify actual restart and retry behaviortests/supervisor/- Comprehensive unit tests for all componentsUsage Examples
Simple CLI Usage
Dockerfile Integration
Configuration Options
Basic Configuration:
MAX_START_RETRIES=3- Maximum startup attempts before giving up (0-100)LOG_LEVEL=info- Logging level (debug, info, warn, error, critical)Advanced Supervisor Settings:
SUPERVISOR_PROGRAM__LLM_ENGINE_STARTSECS=30- Time process must run to be considered "started"SUPERVISOR_PROGRAM__LLM_ENGINE_STOPWAITSECS=60- Time to wait for graceful shutdownSUPERVISOR_PROGRAM__LLM_ENGINE_AUTORESTART=true- Enable automatic restart on failureSUPERVISOR_PROGRAM__LLM_ENGINE_STARTRETRIES=3- Startup retry attemptsSUPERVISOR_CONFIG_PATH=/tmp/supervisord.conf- Custom config file locationCustom Sections:
SUPERVISOR_SUPERVISORD_LOGLEVEL=debug- Supervisord daemon log levelSUPERVISOR_EVENTLISTENER__MEMMON_COMMAND=memmon -a 200MB- Add custom event listenersTesting & Validation
Comprehensive Test Suite:
Test Coverage:
Manual Testing:
docker execprocess killing to confirm restart behavior