rachel ✨

Agora East Asian ASR Test Set

2026-03-19 ·

Agora East Asian ASR Test Set — README

A 45-clip benchmark test set for detecting catastrophic ASR failure on East Asian accented English. Clips cover Japanese, Korean, and Mandarin speakers reading the same English passage.


Clip sourcing: Speech Accent Archive

Source: http://accent.gmu.edu/
License: Creative Commons Attribution 4.0 International
Passage used: The standard elicitation paragraph (all speakers read the same text)

All clips use the Stella passage: "Please call Stella. Ask her to bring these things with her from the store: six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station."

Clip selection (45 clips)

  • Japanese L1 speakers: 15 clips
  • Korean L1 speakers: 15 clips
  • Mandarin L1 speakers: 15 clips

Selection criteria: native language confirmed, no background noise, random draw (no cherry-picking), variety of accent strengths.


How to run it

Step 1: Transcribe all 45 clips against your vendor

AssemblyAI example:

import assemblyai as aai
aai.settings.api_key = "YOUR_KEY"
transcriber = aai.Transcriber()
result = transcriber.transcribe("japanese1.mp3")
print(result.text, result.confidence)

Deepgram example:

from deepgram import DeepgramClient
dg = DeepgramClient("YOUR_KEY")
response = dg.listen.rest.v("1").transcribe_file({"buffer": open("japanese1.mp3","rb")}, {"model":"nova-3","smart_format":True})
print(response["results"]["channels"][0]["alternatives"][0])

Whisper example:

import whisper
model = whisper.load_model("large-v3")
result = model.transcribe("japanese1.mp3", verbose_json=True)
print(result["text"], result["language"])  # check language!

Step 2: Compute WER per clip

from jiwer import wer
reference = "please call stella ask her..."  # normalized reference
for clip, transcript in results.items():
    w = wer(reference, transcript.lower())
    print(f"{clip}: WER={w:.3f}")

Step 3: Flag catastrophic failures

A catastrophic failure = WER >= 1.0, OR output language != English.

catastrophic = [clip for clip, w in wer_results.items() if w >= 1.0]
failure_rate = len(catastrophic) / 45
print(f"Catastrophic failure rate: {failure_rate:.1%}")

Step 4: Check confidence on failures

Key question: does confidence drop near zero on failures, or stay elevated (0.5-0.8)? If confidence stays above 0.5 on a WER=1.0 transcript, your quality gate will not catch it.


What to look for

  • 0% catastrophic failure rate — Low tail risk. Proceed to calibration analysis.
  • 5-10% catastrophic failure rate — Meaningful. 1 in 10-20 East Asian users gets garbage output.
  • 15-20% catastrophic failure rate — High risk. Do not deploy without language output validation.
  • Failures with confidence > 0.5 — Gate will not catch — silent production exposure.
  • Failures with confidence < 0.3 — Gate can catch — detectable, manageable.
  • Mandarin failure rate differs from JA/KO — Expected. Report per-language.

Our baseline results (March 2026)

Run on AssemblyAI Universal-2 (auto-detect mode):

  • Japanese (15 clips): 20% catastrophic failure rate, avg confidence on failures 0.615
  • Korean (15 clips): 20% catastrophic failure rate, avg confidence on failures 0.665
  • Mandarin (15 clips): 0% catastrophic failure rate

Mean WER across 45 clips (excluding catastrophic): 6.8%. Deepgram Nova-3 and Whisper large-v3 returned 0 catastrophic failures on the same clips.


Expected time to run

  • Download 45 clips from Speech Accent Archive: ~30 min
  • Transcribe 45 clips (API calls): ~15-20 min
  • Score + analyze: ~30 min
  • Total: ~1 hour

Questions? Reply to the message that sent you this — happy to walk through setup, share raw JSON from our runs, or help interpret results for your vendor.

Agora | agora — vendor eval for AI buyers