Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
116fac6
feat: implement supervisor process management system
HappyAmazonian Oct 29, 2025
20f6309
fix: correct test assertion for default framework_name
HappyAmazonian Oct 29, 2025
342f656
refactor: remove hardcoded framework commands
HappyAmazonian Oct 29, 2025
0d62b4c
feat: complete comprehensive test suite for supervisor process manage…
HappyAmazonian Oct 29, 2025
aed019b
refactor: finalize supervisor process management implementation
HappyAmazonian Oct 29, 2025
1c7f066
refactor(supervisor): major cleanup and improvements
HappyAmazonian Oct 29, 2025
8b33a04
Fix supervisor integration tests and reorganize test structure
HappyAmazonian Oct 29, 2025
91f41ad
docs: update supervisor README with accurate vLLM integration example
HappyAmazonian Nov 4, 2025
cd7302a
docs: improve supervisor README structure and remove redundancy
HappyAmazonian Nov 4, 2025
54a9f6c
refactor
HappyAmazonian Nov 4, 2025
f7e308e
Simplify supervisor entrypoint script
HappyAmazonian Nov 4, 2025
f57e015
Clean up supervisor module formatting and documentation
HappyAmazonian Nov 4, 2025
ac9e3b9
Remove unused validate_config_directory function
HappyAmazonian Nov 4, 2025
75eb447
update readme
HappyAmazonian Nov 4, 2025
028da2f
readme
HappyAmazonian Nov 4, 2025
5344074
Simplify supervisor test suite
HappyAmazonian Nov 4, 2025
a0f0501
improve
HappyAmazonian Nov 4, 2025
b0d8de2
add test
HappyAmazonian Nov 4, 2025
5b7765f
fix ci
HappyAmazonian Nov 4, 2025
19a52c4
try ci
HappyAmazonian Nov 4, 2025
6dcdfd0
feat: implement custom configuration merging for supervisor generator
HappyAmazonian Nov 6, 2025
da136fd
feat: implement standard-supervisor CLI simplification
HappyAmazonian Nov 6, 2025
7e65164
test: add comprehensive unit tests for supervisor CLI components
HappyAmazonian Nov 6, 2025
dd0a6d6
Rewrite supervisor CLI integration tests with real behavior verification
HappyAmazonian Nov 7, 2025
d99b728
Complete supervisor improvements and test cleanup
HappyAmazonian Nov 7, 2025
fe99793
Fix supervisor tests and clean up obsolete test files
HappyAmazonian Nov 7, 2025
891bf2e
Update README with new environment variable names
HappyAmazonian Nov 7, 2025
8bb06f2
Fix supervisor dependency management and optimize version constraints
HappyAmazonian Nov 7, 2025
d5ffd18
Fix CI supervisor tests with start_new_session=True
HappyAmazonian Nov 7, 2025
703b0f0
Revert test changes and add supervisor installation in CI
HappyAmazonian Nov 7, 2025
776b27b
Enable pytest output in CI for debug information
HappyAmazonian Nov 7, 2025
c18bed7
try ci
HappyAmazonian Nov 7, 2025
5d57d41
ci
HappyAmazonian Nov 7, 2025
fe51ed8
Clean up debug code and fix supervisor integration tests
HappyAmazonian Nov 7, 2025
7b89a6e
Remove supervisorctl dependency and simplify process management
HappyAmazonian Nov 7, 2025
33dcba5
Update unit tests to remove supervisorctl references
HappyAmazonian Nov 7, 2025
e9e1f20
refactor: use regex pattern for SUPERVISOR_ env var validation
HappyAmazonian Nov 7, 2025
e1d7174
Merge branch 'main' into restart
HappyAmazonian Nov 11, 2025
fa96279
Merge branch 'main' into restart
HappyAmazonian Nov 11, 2025
a9059e9
update loc
HappyAmazonian Nov 11, 2025
80b043f
Merge remote-tracking branch 'upstream/restart' into restart
HappyAmazonian Nov 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions PR_DESCRIPTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Add Supervisor Process Management Module

This introduces a **supervisor module** that wraps ML frameworks with supervisord for automatic crash recovery and robust process management. It can be integrated into any Dockerfile easily.

## Integration

Install and use with these commands:

```bash
pip install model-hosting-container-standards
standard-supervisor vllm serve model --host 0.0.0.0 --port 8080
```

Or in a Dockerfile:
```dockerfile
COPY model_hosting_container_standards-0.1.2-py3-none-any.whl /tmp/
RUN pip install supervisor
RUN pip install /tmp/model_hosting_container_standards-0.1.2-py3-none-any.whl

# Use supervisor entrypoint for SageMaker
ENV ENGINE_AUTO_RECOVERY=true
ENV ENGINE_MAX_RECOVERY_ATTEMPTS=3
ENTRYPOINT ["standard-supervisor", "./sagemaker-entrypoint.sh"]
```

## Workflow

1. **Parse command and environment** → Read ML framework command and supervisor configuration
2. **Generate supervisord config** → Create robust configuration with configparser
3. **Start supervisord** → Launch supervisor daemon with your framework as managed process
4. **Monitor and restart** → Supervisor detects crashes and restarts automatically with configurable limits
5. **Handle failures** → After max retries, container exits gracefully with proper error codes

### **Key Components**

**Core Modules:**
- `models.py` - Configuration data models with comprehensive validation and environment variable parsing
- `generator.py` - Robust supervisord configuration generation using configparser

**CLI Tools & Scripts:**
- `scripts/standard_supervisor.py` - Main CLI tool for running ML frameworks under supervisor (`standard-supervisor`)
- `scripts/generate_supervisor_config.py` - Standalone configuration generator CLI

**Documentation & Tests:**
- `README.md` - Comprehensive setup guide with examples
- `tests/integration/test_supervisor_cli_integration.py` - **Real behavior integration tests** that verify actual restart and retry behavior
- `tests/supervisor/` - Comprehensive unit tests for all components

## Usage Examples

### Simple CLI Usage
```bash
# Direct command execution with supervisor
standard-supervisor vllm serve model --host 0.0.0.0 --port 8080

# With custom configuration
PROCESS_MAX_START_RETRIES=5 SUPERVISOR_PROGRAM__APP_STARTSECS=30 \
standard-supervisor python -m tensorrt_llm.hlapi.llm_api
```

### Dockerfile Integration
```dockerfile
FROM vllm/vllm-openai:latest

# Install with supervisor support
RUN pip install model-hosting-container-standards

# Configure your ML framework with supervisor settings
ENV PROCESS_MAX_START_RETRIES=3
ENV SUPERVISOR_PROGRAM__APP_STARTSECS=30
ENV SUPERVISOR_PROGRAM__APP_STOPWAITSECS=60
ENV LOG_LEVEL=info

# Use supervisor for process management
ENTRYPOINT ["python", "-m", "model_hosting_container_standards.supervisor.scripts.standard_supervisor"]
CMD ["vllm", "serve", "model", "--host", "0.0.0.0", "--port", "8080"]
```

## Configuration Options

**Basic Configuration:**
- Command line arguments become the supervised process command
- `PROCESS_MAX_START_RETRIES=3` - Maximum startup attempts before giving up (0-100)
- `LOG_LEVEL=info` - Logging level (debug, info, warn, error, critical)

**Advanced Supervisor Settings:**
- `SUPERVISOR_PROGRAM__APP_STARTSECS=30` - Time process must run to be considered "started"
- `SUPERVISOR_PROGRAM__APP_STOPWAITSECS=60` - Time to wait for graceful shutdown
- `SUPERVISOR_PROGRAM__APP_AUTORESTART=true` - Enable automatic restart on failure
- `SUPERVISOR_PROGRAM__APP_STARTRETRIES=3` - Startup retry attempts
- `SUPERVISOR_CONFIG_PATH=/tmp/supervisord.conf` - Custom config file location

**Custom Sections:**
- `SUPERVISOR_SUPERVISORD_LOGLEVEL=debug` - Supervisord daemon log level
- `SUPERVISOR_EVENTLISTENER__MEMMON_COMMAND=memmon -a 200MB` - Add custom event listeners

## Testing & Validation

**Comprehensive Test Suite:**
- **Integration Tests** - Actual supervisor processes that verify continuous restart and retry limit behavior
**Test Coverage:**
- **Continuous restart behavior** - Verifies supervisor actually restarts failed processes
- **Startup retry limits** - Confirms supervisor respects retry limits and gives up appropriately
- **Signal handling** - Tests graceful shutdown with SIGTERM
- **ML framework integration** - Tests with realistic ML framework startup patterns
- **Configuration generation** - Validates all supervisor configuration options
- **Error handling** - Tests invalid configurations and edge cases

**Manual Testing:**
- Tested with vLLM dockerfile build
- Verified with `docker exec` process killing to confirm restart behavior
- Validated in production-like container environments
16 changes: 16 additions & 0 deletions python/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Include supervisor scripts
recursive-include model_hosting_container_standards/supervisor/scripts *

# Include documentation
include README.md
include LICENSE

# Include configuration files
include pyproject.toml

# Exclude development files
exclude .gitignore
exclude .pre-commit-config.yaml
recursive-exclude * __pycache__
recursive-exclude * *.py[co]
recursive-exclude * .DS_Store
Loading