salience-editor/api/README.md

# Text Salience API

A Flask API for computing text salience using sentence transformers, with HAProxy-based queue management to handle resource contention.

## Architecture

```
nginx (SSL termination, :443)
    ↓
HAProxy (queue manager, 127.0.0.2:5000)
    ├─► [2 slots available] → Gunicorn workers (127.0.89.34:5000)
    │                          Process request normally
    │                          Track processing span
    │
    └─► [Queue full, 120+] → /overflow endpoint (127.0.89.34:5000)
                              Return 429 with stats
                              Track overflow arrival
```

## Queue Management

- **Processing slots**: 2 concurrent requests
- **Queue depth**: 120 requests
- **Queue timeout**: 10 minutes
- **Processing time**: ~5 seconds per request

When the queue is full, requests are routed to `/overflow` which returns a 429 status with statistics about:
- Recent processing spans (last 5 minutes)
- Overflow arrival times (last 5 minutes)

The frontend can use these statistics to:
- Calculate queue probability using Poisson distribution
- Display estimated wait times
- Show arrival rate trends

## Run API

### Development (without queue)
```bash
uv run flask --app salience run
```

### Production (with HAProxy queue)

1. **Start Gunicorn** with preloaded models (loads models once, forks 3 workers):
```bash
uv run gunicorn \
    --preload \
    --workers 3 \
    --bind 127.0.89.34:5000 \
    --timeout 300 \
    --access-logfile - \
    salience:app
```
(3 workers: 2 for model processing + 1 for overflow/stats responses)

2. **Start HAProxy** (assumes you're including `haproxy.cfg` in your main HAProxy config):
```bash
# If running standalone HAProxy for this service:
# Uncomment the global/defaults sections in haproxy.cfg first
haproxy -f haproxy.cfg

# If using a global HAProxy instance:
# Include the frontend/backend sections from haproxy.cfg in your main config
```

3. **Configure nginx** to proxy to HAProxy:
```nginx
location /api/salience {
    proxy_pass http://127.0.0.2:5000;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_read_timeout 900s;
}
```

## Benchmarks
```bash
# Generate embeddings
uv run python3 benchmarks/generate_embeddings.py

# Run benchmarks
uv run pytest benchmarks/test_bench_cosine_sim.py --benchmark-json=benchmarks/genfiles/benchmark_results.json

# Visualize results
uv run python3 benchmarks/visualize_benchmarks.py benchmarks/genfiles/benchmark_results.json
```
feat: create deployment scripts 2025-11-02 13:09:23 -08:00			`# Text Salience API`
feat: try to get demo working after 2 years 2025-10-30 14:16:04 -07:00
feat: make version deployable 2025-11-29 13:56:55 -08:00			`A Flask API for computing text salience using sentence transformers, with HAProxy-based queue management to handle resource contention.`

			`## Architecture`

			```
			`nginx (SSL termination, :443)`
			`↓`
			`HAProxy (queue manager, 127.0.0.2:5000)`
			`├─► [2 slots available] → Gunicorn workers (127.0.89.34:5000)`
			`│ Process request normally`
			`│ Track processing span`
			`│`
			`└─► [Queue full, 120+] → /overflow endpoint (127.0.89.34:5000)`
			`Return 429 with stats`
			`Track overflow arrival`
			```

			`## Queue Management`

			`- Processing slots: 2 concurrent requests`
			`- Queue depth: 120 requests`
			`- Queue timeout: 10 minutes`
			`- Processing time: ~5 seconds per request`

			When the queue is full, requests are routed to `/overflow` which returns a 429 status with statistics about:
			`- Recent processing spans (last 5 minutes)`
			`- Overflow arrival times (last 5 minutes)`

			`The frontend can use these statistics to:`
			`- Calculate queue probability using Poisson distribution`
			`- Display estimated wait times`
			`- Show arrival rate trends`

feat: create deployment scripts 2025-11-02 13:09:23 -08:00			`## Run API`
feat: make version deployable 2025-11-29 13:56:55 -08:00
			`### Development (without queue)`
feat: create deployment scripts 2025-11-02 13:09:23 -08:00			```bash
feat: try to get demo working after 2 years 2025-10-30 14:16:04 -07:00			`uv run flask --app salience run`
feat: create deployment scripts 2025-11-02 13:09:23 -08:00			```

feat: make version deployable 2025-11-29 13:56:55 -08:00			`### Production (with HAProxy queue)`

			`1. Start Gunicorn with preloaded models (loads models once, forks 3 workers):`
			```bash
			`uv run gunicorn \`
			`--preload \`
			`--workers 3 \`
			`--bind 127.0.89.34:5000 \`
			`--timeout 300 \`
			`--access-logfile - \`
			`salience:app`
			```
			`(3 workers: 2 for model processing + 1 for overflow/stats responses)`

			2. Start HAProxy (assumes you're including `haproxy.cfg` in your main HAProxy config):
			```bash
			`# If running standalone HAProxy for this service:`
			`# Uncomment the global/defaults sections in haproxy.cfg first`
			`haproxy -f haproxy.cfg`

			`# If using a global HAProxy instance:`
			`# Include the frontend/backend sections from haproxy.cfg in your main config`
			```

			`3. Configure nginx to proxy to HAProxy:`
			```nginx
			`location /api/salience {`
			`proxy_pass http://127.0.0.2:5000;`
			`proxy_http_version 1.1;`
			`proxy_set_header Host $host;`
			`proxy_read_timeout 900s;`
			`}`
			```

feat: create deployment scripts 2025-11-02 13:09:23 -08:00			`## Benchmarks`
			```bash
			`# Generate embeddings`
			`uv run python3 benchmarks/generate_embeddings.py`

			`# Run benchmarks`
			`uv run pytest benchmarks/test_bench_cosine_sim.py --benchmark-json=benchmarks/genfiles/benchmark_results.json`

			`# Visualize results`
			`uv run python3 benchmarks/visualize_benchmarks.py benchmarks/genfiles/benchmark_results.json`
			```