salience-editor/api
nobody e50cd9757b
fix: port 5000 conflicts with AirPlay on macOS
Use port 15000 for the default development port.
If you ever cloned the repo on Mac, ran the demo, and saw the models list
would never load, or saw 403 errors in browser console. Check the Server
headers. Good chances are the request went to AirPlay service which is also
listening on port 5000.
2025-12-03 11:08:38 -08:00
..
benchmarks feat: create deployment scripts 2025-11-02 14:16:56 -08:00
salience fix: port 5000 conflicts with AirPlay on macOS 2025-12-03 11:08:38 -08:00
.gitignore feat: deploy model api server to chicago-web01 2025-11-27 11:01:54 -08:00
deploy.sh feat: make version deployable 2025-11-29 13:56:55 -08:00
pyproject.toml feat: make version deployable 2025-11-29 13:56:55 -08:00
README.md feat: make version deployable 2025-11-29 13:56:55 -08:00
salience-editor-api.nomad.hcl feat: make version deployable 2025-11-29 13:56:55 -08:00
smoke-test.sh fix: port 5000 conflicts with AirPlay on macOS 2025-12-03 11:08:38 -08:00
transcript-1.txt refactor: rename ML model python backend folder 2025-10-30 17:55:24 -07:00
transcript.txt refactor: rename ML model python backend folder 2025-10-30 17:55:24 -07:00
uv.lock feat: make version deployable 2025-11-29 13:56:55 -08:00

Text Salience API

A Flask API for computing text salience using sentence transformers, with HAProxy-based queue management to handle resource contention.

Architecture

nginx (SSL termination, :443)
    ↓
HAProxy (queue manager, 127.0.0.2:5000)
    ├─► [2 slots available] → Gunicorn workers (127.0.89.34:5000)
    │                          Process request normally
    │                          Track processing span
    │
    └─► [Queue full, 120+] → /overflow endpoint (127.0.89.34:5000)
                              Return 429 with stats
                              Track overflow arrival

Queue Management

  • Processing slots: 2 concurrent requests
  • Queue depth: 120 requests
  • Queue timeout: 10 minutes
  • Processing time: ~5 seconds per request

When the queue is full, requests are routed to /overflow which returns a 429 status with statistics about:

  • Recent processing spans (last 5 minutes)
  • Overflow arrival times (last 5 minutes)

The frontend can use these statistics to:

  • Calculate queue probability using Poisson distribution
  • Display estimated wait times
  • Show arrival rate trends

Run API

Development (without queue)

uv run flask --app salience run

Production (with HAProxy queue)

  1. Start Gunicorn with preloaded models (loads models once, forks 3 workers):
uv run gunicorn \
    --preload \
    --workers 3 \
    --bind 127.0.89.34:5000 \
    --timeout 300 \
    --access-logfile - \
    salience:app

(3 workers: 2 for model processing + 1 for overflow/stats responses)

  1. Start HAProxy (assumes you're including haproxy.cfg in your main HAProxy config):
# If running standalone HAProxy for this service:
# Uncomment the global/defaults sections in haproxy.cfg first
haproxy -f haproxy.cfg

# If using a global HAProxy instance:
# Include the frontend/backend sections from haproxy.cfg in your main config
  1. Configure nginx to proxy to HAProxy:
location /api/salience {
    proxy_pass http://127.0.0.2:5000;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_read_timeout 900s;
}

Benchmarks

# Generate embeddings
uv run python3 benchmarks/generate_embeddings.py

# Run benchmarks
uv run pytest benchmarks/test_bench_cosine_sim.py --benchmark-json=benchmarks/genfiles/benchmark_results.json

# Visualize results
uv run python3 benchmarks/visualize_benchmarks.py benchmarks/genfiles/benchmark_results.json