Agora Eval Artifact Registry

Maintained by: Agora (agora.so)
Last updated: 2026-03-23

The Agora eval artifact registry is a curated index of temporal degradation benchmarks produced by Agora's eval pipeline. Each artifact is a reproducible, grounded measurement of how AI model performance degrades as training data ages — with real numbers, real datasets, and real degradation curves.

Enterprise buyers can use these artifacts to:

Benchmark their own models against industry baselines
Set evidence-based retraining cadences
Evaluate AI vendors on temporal robustness (not just static benchmarks)
Prioritize retraining budgets on the entity types and domains that actually drift

Available Artifacts

1. TemporalWiki — General Knowledge Drift

| Field | Value | |---|---| | Domain | General knowledge (Wikipedia) | | Task | 12-class topic categorization | | Dataset | seonghyeonye/TemporalWiki (3.3M rows, 4 temporal snapshots) | | Published | 2026-03-22 | | Version | 1.0 |

Key degradation numbers:

| Time Since Training | Accuracy | Δ Accuracy | macro-F1 | Δ F1 | |---|---|---|---|---| | Baseline (in-distribution) | 63.5% | — | 0.6283 | — | | ~4 months | 63.3% | −0.2% | 0.6378 | +0.95% | | ~6 months | 60.7% | −2.8% | 0.6084 | −1.99% | | ~9 months | 59.7% | −3.8% | 0.5989 | −2.94% |

Finding: Degradation is monotonic. At 9 months, ~6% of headroom above chance is gone. The macro-F1 sign flip at ≥6 months confirms it's real signal, not artifact. Content drift from real-world events is enough to meaningfully hurt a static classifier.

Use this artifact for: Model freshness audits, retraining decision triggers (6-month inflection point), vendor comparison baseline, internal SLA calibration.

2. FiNER-139 — Financial NER Temporal Degradation

| Field | Value | |---|---| | Domain | Finance (SEC EDGAR 10-K/8-K filings) | | Task | Binary entity detection + per-category NER F1 (139 XBRL entity types) | | Dataset | nlpaueb/finer-139 (1.12M sentences, CC BY 4.0) | | Published | 2026-03-23 | | Version | 1.0 |

Key degradation numbers:

| Gap | Binary F1 | Δ | SHARES/COUNT F1 | M&A-volatile F1 | |---|---|---|---|---| | In-distribution (2017-18) | 0.907 | — | 0.680 | 0.614 | | 1yr gap (2019) | 0.874 | −0.033 | 0.707 | 0.463 | | 2yr gap (2020) | 0.871 | −0.036 | 0.622 | 0.281 | | 4yr+ gap (2022-24) | 0.840 | −0.067 | 0.000 | 0.182 |

Finding: Binary F1 drops 6.7 points over 4 years. But the category-level story is what matters. SHARES/COUNT entities — share counts, SPAC structures, earn-out disclosures — collapse from F1=0.680 to F1=0.000 at 4-year gap. Total collapse. M&A-volatile entities lose 70% of their relative F1. Domain-pretrained FinBERT shows the same degradation as TF-IDF — this is not a model-quality problem. It's a distribution shift problem.

Use this artifact for: Finance AI model audits, retraining trigger thresholds, vendor evaluation for financial document processing, selective retraining prioritization (not all entity types drift equally).

3. Speech ASR — Multilingual Confidence & Routing Drift

| Field | Value | |---|---| | Domain | Speech / Automatic Speech Recognition | | Task | Whisper confidence calibration + multilingual routing accuracy under distribution shift | | Dataset | LibriSpeech test-other (n=100), EU L2 accent corpus (n=30), East Asian accent corpus (n=90), GMU/Deepgram accent clips (n=13), AMI corpus (n=1), gTTS AR/EN synthetic (n=4) | | Published | 2026-03-23 | | Version | 1.0 |

Key results:

| Dimension | Finding | |---|---| | EU Spanish ECE vs native EN baseline | +155% worse calibration | | East Asian ECE at fixed T=4.0 | 0.022–0.025 (better than native EN) | | Cross-pop ECE (n=156, 7 accent groups) | 0.0394 at T=4.0 | | All adaptive-T designs tested | All regress vs fixed T=4.0 (worst: +21.8%) | | Routing accuracy — accented English (n=13) | 100% correct routing | | Routing accuracy — noisy/compressed audio (n=7) | 71% (G.711 telephony: 100%; fails at ≥15dB noise + Arabic) | | Simulation vs live inference threshold gap | +42 percentage points over-flagging (N(−0.08, 0.055) vs actual median −0.357) |

Finding: ASR confidence calibration degrades measurably as speaker population shifts from training distribution. EU Spanish ECE is 2.5× worse than the cross-pop average. Adaptive confidence-gating designs consistently regress vs a fixed calibration baseline — because no single threshold is optimal across accent groups without group-aware routing. Crucially: simulation-derived thresholds are invalidated by live inference on real audio, with a 42 percentage-point gap in false-positive rate. This is the speech equivalent of temporal drift: any ASR deployment relying on calibration assumptions from a homogeneous eval corpus will degrade on a real-world multilingual user base.

Use this artifact for: Multilingual ASR product audits, confidence threshold calibration validation, adaptive calibration design review, audio quality floor specification.

Cross-Artifact Comparison

| Dimension | TemporalWiki | FiNER-139 | Speech ASR | |---|---|---|---| | Domain | General knowledge | Finance (SEC filings) | Speech / ASR | | Shift axis | Time (months/years) | Time (years) | Speaker population + audio quality | | ID baseline | 63.5% accuracy | 90.4% binary F1 | ECE 0.0515 (native EN) | | Max degradation | −2.94 F1 pts (9mo) | −6.70 F1 pts (4yr+) | +155% ECE (EU Spanish) | | Category collapse | Not measured | SHARES: 100% collapse | Adaptive-T: all designs regress | | Degradation shape | Monotonic | Monotonic (clean) | Group-dependent, non-uniform | | Mechanism | Topic vocabulary drift | Entity type turnover (M&A, IPO) | Accent population shift + audio degradation | | Simulation gap | N/A | N/A | 42pp threshold over-flagging |

Three independent artifacts across three domains. Three different shift mechanisms. Same conclusion: distribution shift is the primary driver of production AI degradation, regardless of domain. Temporal, demographic, and acoustic distribution shifts all follow the same pattern — what the model was calibrated on is not what it faces in production.

How to Use These Artifacts

Benchmark your model

Run your production model against the eval sets in each artifact. Compare your degradation curve against the Agora baseline. Faster decay = your retraining cadence is too slow. Slower decay = you're ahead of the benchmark.

Set retraining thresholds from evidence

Most teams retrain on fixed calendars. These curves give you empirical inflection points instead:

TemporalWiki: Meaningful signal at 6 months, significant at 9 months
FiNER-139: ~3 pts binary F1 drop at 1yr, SHARES/COUNT collapse by 4yr+
Speech ASR: Validate confidence thresholds with live inference before deployment; never use simulation-derived thresholds on heterogeneous populations

Evaluate AI vendors on temporal robustness

When an AI vendor shows you static benchmark numbers, ask them to run against these datasets. A model that holds accuracy better over temporal gaps — and across population shifts — is genuinely better. Not just better at cherry-picked evals.

Coming Soon

Embedding-based TemporalWiki eval — expected to show stronger degradation signal than TF-IDF baseline
FiNER-139 v1.1 — full GPU fine-tuned BERT results (training run in progress)
Speech ASR v1.1 — real human speech corpus validation (gTTS synthetic limitation addressed), English threshold recalibration at live-inference-validated threshold (~−0.55)
OOD sentiment eval — financial sentiment degradation under domain shift

Agora is building the infrastructure to measure and prevent AI model degradation. All artifacts are reproducible and grounded in real-world data. For access, questions, or custom eval runs: agora.so