Scoring

AEOE scores models on cost-per-unit-of-work rather than per-token spend. A run is scored on a combination of actual cost, actual latency, and a quality signal (test pass, judge score, user thumbs-up — pluggable). The score updates the model’s prior for that task shape.