salience-editor/api
2025-11-29 13:56:55 -08:00
..
benchmarks feat: create deployment scripts 2025-11-02 14:16:56 -08:00
salience feat: make version deployable 2025-11-29 13:56:55 -08:00
.gitignore feat: deploy model api server to chicago-web01 2025-11-27 11:01:54 -08:00
deploy.sh feat: make version deployable 2025-11-29 13:56:55 -08:00
pyproject.toml feat: make version deployable 2025-11-29 13:56:55 -08:00
README.md feat: make version deployable 2025-11-29 13:56:55 -08:00
salience-editor-api.nomad.hcl feat: make version deployable 2025-11-29 13:56:55 -08:00
smoke-test.sh feat: make version deployable 2025-11-29 13:56:55 -08:00
transcript-1.txt refactor: rename ML model python backend folder 2025-10-30 17:55:24 -07:00
transcript.txt refactor: rename ML model python backend folder 2025-10-30 17:55:24 -07:00
uv.lock feat: make version deployable 2025-11-29 13:56:55 -08:00

Text Salience API

A Flask API for computing text salience using sentence transformers, with HAProxy-based queue management to handle resource contention.

Architecture

nginx (SSL termination, :443)
    ↓
HAProxy (queue manager, 127.0.0.2:5000)
    ├─► [2 slots available] → Gunicorn workers (127.0.89.34:5000)
    │                          Process request normally
    │                          Track processing span
    │
    └─► [Queue full, 120+] → /overflow endpoint (127.0.89.34:5000)
                              Return 429 with stats
                              Track overflow arrival

Queue Management

  • Processing slots: 2 concurrent requests
  • Queue depth: 120 requests
  • Queue timeout: 10 minutes
  • Processing time: ~5 seconds per request

When the queue is full, requests are routed to /overflow which returns a 429 status with statistics about:

  • Recent processing spans (last 5 minutes)
  • Overflow arrival times (last 5 minutes)

The frontend can use these statistics to:

  • Calculate queue probability using Poisson distribution
  • Display estimated wait times
  • Show arrival rate trends

Run API

Development (without queue)

uv run flask --app salience run

Production (with HAProxy queue)

  1. Start Gunicorn with preloaded models (loads models once, forks 3 workers):
uv run gunicorn \
    --preload \
    --workers 3 \
    --bind 127.0.89.34:5000 \
    --timeout 300 \
    --access-logfile - \
    salience:app

(3 workers: 2 for model processing + 1 for overflow/stats responses)

  1. Start HAProxy (assumes you're including haproxy.cfg in your main HAProxy config):
# If running standalone HAProxy for this service:
# Uncomment the global/defaults sections in haproxy.cfg first
haproxy -f haproxy.cfg

# If using a global HAProxy instance:
# Include the frontend/backend sections from haproxy.cfg in your main config
  1. Configure nginx to proxy to HAProxy:
location /api/salience {
    proxy_pass http://127.0.0.2:5000;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_read_timeout 900s;
}

Benchmarks

# Generate embeddings
uv run python3 benchmarks/generate_embeddings.py

# Run benchmarks
uv run pytest benchmarks/test_bench_cosine_sim.py --benchmark-json=benchmarks/genfiles/benchmark_results.json

# Visualize results
uv run python3 benchmarks/visualize_benchmarks.py benchmarks/genfiles/benchmark_results.json