History

nobody e50cd9757b fix: port 5000 conflicts with AirPlay on macOS Use port 15000 for the default development port. If you ever cloned the repo on Mac, ran the demo, and saw the models list would never load, or saw 403 errors in browser console. Check the Server headers. Good chances are the request went to AirPlay service which is also listening on port 5000.		2025-12-03 11:08:38 -08:00
..
benchmarks	feat: create deployment scripts	2025-11-02 14:16:56 -08:00
salience	fix: port 5000 conflicts with AirPlay on macOS	2025-12-03 11:08:38 -08:00
.gitignore	feat: deploy model api server to chicago-web01	2025-11-27 11:01:54 -08:00
deploy.sh	feat: make version deployable	2025-11-29 13:56:55 -08:00
pyproject.toml	feat: make version deployable	2025-11-29 13:56:55 -08:00
README.md	feat: make version deployable	2025-11-29 13:56:55 -08:00
salience-editor-api.nomad.hcl	feat: make version deployable	2025-11-29 13:56:55 -08:00
smoke-test.sh	fix: port 5000 conflicts with AirPlay on macOS	2025-12-03 11:08:38 -08:00
transcript-1.txt	refactor: rename ML model python backend folder	2025-10-30 17:55:24 -07:00
transcript.txt	refactor: rename ML model python backend folder	2025-10-30 17:55:24 -07:00
uv.lock	feat: make version deployable	2025-11-29 13:56:55 -08:00

README.md

Text Salience API

A Flask API for computing text salience using sentence transformers, with HAProxy-based queue management to handle resource contention.

Architecture

nginx (SSL termination, :443)
    ↓
HAProxy (queue manager, 127.0.0.2:5000)
    ├─► [2 slots available] → Gunicorn workers (127.0.89.34:5000)
    │                          Process request normally
    │                          Track processing span
    │
    └─► [Queue full, 120+] → /overflow endpoint (127.0.89.34:5000)
                              Return 429 with stats
                              Track overflow arrival

Queue Management

Processing slots: 2 concurrent requests
Queue depth: 120 requests
Queue timeout: 10 minutes
Processing time: ~5 seconds per request

When the queue is full, requests are routed to /overflow which returns a 429 status with statistics about:

Recent processing spans (last 5 minutes)
Overflow arrival times (last 5 minutes)

The frontend can use these statistics to:

Calculate queue probability using Poisson distribution
Display estimated wait times
Show arrival rate trends

Run API

Development (without queue)

uv run flask --app salience run

Production (with HAProxy queue)

Start Gunicorn with preloaded models (loads models once, forks 3 workers):

uv run gunicorn \
    --preload \
    --workers 3 \
    --bind 127.0.89.34:5000 \
    --timeout 300 \
    --access-logfile - \
    salience:app

(3 workers: 2 for model processing + 1 for overflow/stats responses)

Start HAProxy (assumes you're including haproxy.cfg in your main HAProxy config):

# If running standalone HAProxy for this service:
# Uncomment the global/defaults sections in haproxy.cfg first
haproxy -f haproxy.cfg

# If using a global HAProxy instance:
# Include the frontend/backend sections from haproxy.cfg in your main config

Configure nginx to proxy to HAProxy:

location /api/salience {
    proxy_pass http://127.0.0.2:5000;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_read_timeout 900s;
}

Benchmarks

# Generate embeddings
uv run python3 benchmarks/generate_embeddings.py

# Run benchmarks
uv run pytest benchmarks/test_bench_cosine_sim.py --benchmark-json=benchmarks/genfiles/benchmark_results.json

# Visualize results
uv run python3 benchmarks/visualize_benchmarks.py benchmarks/genfiles/benchmark_results.json