Agora East Asian ASR Test Set — README

A 45-clip benchmark test set for detecting catastrophic ASR failure on East Asian accented English. Clips cover Japanese, Korean, and Mandarin speakers reading the same English passage.

Clip sourcing: Speech Accent Archive

Source: http://accent.gmu.edu/
License: Creative Commons Attribution 4.0 International
Passage used: The standard elicitation paragraph (all speakers read the same text)

All clips use the Stella passage: "Please call Stella. Ask her to bring these things with her from the store: six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station."

Clip selection (45 clips)

Japanese L1 speakers: 15 clips
Korean L1 speakers: 15 clips
Mandarin L1 speakers: 15 clips

Selection criteria: native language confirmed, no background noise, random draw (no cherry-picking), variety of accent strengths.

How to run it

Step 1: Transcribe all 45 clips against your vendor

AssemblyAI example:

import assemblyai as aai
aai.settings.api_key = "YOUR_KEY"
transcriber = aai.Transcriber()
result = transcriber.transcribe("japanese1.mp3")
print(result.text, result.confidence)

Deepgram example:

from deepgram import DeepgramClient
dg = DeepgramClient("YOUR_KEY")
response = dg.listen.rest.v("1").transcribe_file({"buffer": open("japanese1.mp3","rb")}, {"model":"nova-3","smart_format":True})
print(response["results"]["channels"][0]["alternatives"][0])

Whisper example:

import whisper
model = whisper.load_model("large-v3")
result = model.transcribe("japanese1.mp3", verbose_json=True)
print(result["text"], result["language"])  # check language!

Step 2: Compute WER per clip

from jiwer import wer
reference = "please call stella ask her..."  # normalized reference
for clip, transcript in results.items():
    w = wer(reference, transcript.lower())
    print(f"{clip}: WER={w:.3f}")

Step 3: Flag catastrophic failures

A catastrophic failure = WER >= 1.0, OR output language != English.

catastrophic = [clip for clip, w in wer_results.items() if w >= 1.0]
failure_rate = len(catastrophic) / 45
print(f"Catastrophic failure rate: {failure_rate:.1%}")

Step 4: Check confidence on failures

Key question: does confidence drop near zero on failures, or stay elevated (0.5-0.8)? If confidence stays above 0.5 on a WER=1.0 transcript, your quality gate will not catch it.

What to look for

0% catastrophic failure rate — Low tail risk. Proceed to calibration analysis.
5-10% catastrophic failure rate — Meaningful. 1 in 10-20 East Asian users gets garbage output.
15-20% catastrophic failure rate — High risk. Do not deploy without language output validation.
Failures with confidence > 0.5 — Gate will not catch — silent production exposure.
Failures with confidence < 0.3 — Gate can catch — detectable, manageable.
Mandarin failure rate differs from JA/KO — Expected. Report per-language.

Our baseline results (March 2026)

Run on AssemblyAI Universal-2 (auto-detect mode):

Japanese (15 clips): 20% catastrophic failure rate, avg confidence on failures 0.615
Korean (15 clips): 20% catastrophic failure rate, avg confidence on failures 0.665
Mandarin (15 clips): 0% catastrophic failure rate

Mean WER across 45 clips (excluding catastrophic): 6.8%. Deepgram Nova-3 and Whisper large-v3 returned 0 catastrophic failures on the same clips.

Expected time to run

Download 45 clips from Speech Accent Archive: ~30 min
Transcribe 45 clips (API calls): ~15-20 min
Score + analyze: ~30 min
Total: ~1 hour

Questions? Reply to the message that sent you this — happy to walk through setup, share raw JSON from our runs, or help interpret results for your vendor.

Agora | agora — vendor eval for AI buyers