AssemblyAI Silently Outputs Japanese and Korean on English Inputs: What We Found Running Real Calibration Tests

Most ASR comparisons hand you a WER number and call it a day. WER — Word Error Rate — tells you how many words the model got wrong, on average, across a benchmark dataset. It's a useful number. It's also the number that hides the most important failures.

We ran a different kind of evaluation. Instead of just measuring accuracy, we measured calibration — how well each vendor's confidence scores match their actual accuracy. A well-calibrated model that says "I'm 90% confident" should be right about 90% of the time. Most aren't.

The technical term for this is ECE: Expected Calibration Error. Lower is better. An ECE of 0.0 means the confidence scores are perfectly accurate. An ECE of 0.10 means the model is off by about 10 percentage points on average — sometimes overconfident, sometimes underconfident, and your pipeline is making routing decisions based on a signal that's systematically wrong.

We used ECE because it's the metric that actually predicts production behavior. If your pipeline routes low-confidence transcripts to human review, calibration quality determines whether that routing is trustworthy. If it isn't, you're either over-routing (paying for review you didn't need) or under-routing (letting bad transcripts reach users untouched).

Here's what we found when we ran this across three major vendors on East Asian accented speech.

The Finding That Changes the Conversation

AssemblyAI Universal-2 silently outputs Japanese or Korean text — instead of English — on approximately 17% of heavy-accent Japanese and Korean speakers.

No error flag. No empty output. No low-confidence warning that would trigger your routing logic. The transcript looks valid: formatted, coherent, confident. It's just in the wrong language.

This is not a subtle calibration drift. It's a hard binary: 83% of heavy-accent JA/KO speakers get excellent results, ECE 0.018–0.021, competitive with any vendor. The other 17% get fluent, wrong-language output that your quality system will never catch — because the transcript looks correct, and the confidence score (0.42–0.755) looks plausible.

Mandarin is not affected. AssemblyAI's Mandarin calibration (ECE 0.011) is the best of the three vendors we tested.

Three Vendors, Three Failure Modes

| Vendor | Model | Overall ECE | East Asian ECE | Failure Mode | |--------|-------|-------------|----------------|-------------------| | Whisper | large-v3 | 0.0198 | 0.107 | Underconfident — safe but expensive | | AssemblyAI | Universal-2 | 0.0205 | 0.085 | Bimodal: 80% excellent / 20% wrong-language | | Deepgram | Nova-3 | 0.0074 | 0.016 | Tail overconfidence — rare silent failures |

All three vendors achieve similar overall calibration on clean audio. The differentiation shows up on accented speech — and not in the way aggregate benchmarks suggest.

Vendor by Vendor

AssemblyAI Universal-2: The Wrong-Language Problem

The failure pattern is bimodal. For the majority of Japanese and Korean speakers, AssemblyAI performs excellently — better than Deepgram on per-clip accuracy, with well-calibrated confidence. For approximately 17% (n=30 per language), the model appears to detect a heavy East Asian accent and switches languages entirely. The output is a real Japanese or Korean transcription of the audio — not garbled, not low-confidence, just wrong.

Confidence scores on failure clips range 0.42–0.755. A standard routing threshold of 0.4 passes most of these through without review.

Aggregate metrics don't show this. AssemblyAI's mean East Asian WER (16.5%) looks bad — but that figure is pulled up entirely by the failure clips (WER=1.0). The 83% of non-failure clips have WER around 3–4%, competitive with any vendor. Neither mean WER nor mean ECE tells you that 20% of your Japanese and Korean users receive a wrong-language transcript, silently.

Mitigation Option 1 — Language detection post-processing:

The most reliable defense. Detect the language of the output transcript after it's returned. Flag anything that isn't English on an English-input pipeline. CJK character set detection is sufficient and adds under 1ms per transcript:

import re

CJK_PATTERN = re.compile(
    r'[\u3040-\u309F'   # Hiragana
    r'\u30A0-\u30FF'    # Katakana
    r'\u4E00-\u9FFF'    # CJK Unified Ideographs
    r'\uAC00-\uD7AF]'   # Korean Hangul
)

def is_wrong_language(transcript: str, threshold: float = 0.05) -> bool:
    """Returns True if transcript looks like CJK output on an English pipeline."""
    if not transcript:
        return False
    cjk_chars = len(CJK_PATTERN.findall(transcript))
    total_chars = len(transcript.replace(' ', ''))
    return (cjk_chars / total_chars) > threshold if total_chars > 0 else False

# Usage
result = assemblyai_client.transcribe(audio)
if is_wrong_language(result.text):
    # Route to fallback vendor or human review
    flag_for_review(audio_id, reason="wrong_language_output", vendor="assemblyai")

This pattern catches the AssemblyAI wrong-language failure mode reliably. It doesn't require retraining, threshold tuning, or vendor configuration changes — it's a post-processing layer you add to your pipeline regardless of which vendor you use.

Mitigation Option 2 — Confidence threshold routing:

Set threshold to 0.85+ and route lower-confidence outputs to review. Catches the lower-confidence failure clips; misses those at 0.7–0.755. Less reliable than language detection.

Mitigation Option 3 — Vendor routing by accent:

For speakers where input audio suggests heavy East Asian accent, route to Deepgram or Whisper. Adds latency but eliminates the failure mode entirely.

If your user base includes Japanese or Korean speakers in an automated pipeline with no human review: this failure mode is not acceptable at 17%.

Whisper large-v3: The Safe Failure

Whisper's calibration problem is the opposite of AssemblyAI's. On East Asian speech, Whisper reports average confidence of 0.846 when its actual accuracy is 0.956. It thinks it's worse than it is.

This is costly — correct transcripts get flagged for unnecessary human review. But the key distinction: errors don't escape. Underconfidence over-routes. It doesn't under-protect.

ECE of 0.107 on East Asian speech is the highest of the three vendors. But ECE alone doesn't tell you which direction the model fails, and direction determines production risk. Whisper's failure mode is recoverable: tune your confidence threshold upward and the unnecessary review load drops. You can fix it without touching the model.

Whisper is not viable for real-time streaming — no native streaming API. For batch transcription where safety matters more than cost efficiency, it's the safest failure mode in this comparison.

Deepgram Nova-3: Tail Overconfidence

Earlier analysis (n=15 per language) showed Deepgram East Asian ECE of 0.081 — roughly 5× worse than American English. At n=170 East Asian clips, East Asian ECE is 0.016, statistically at parity with American English (0.016). The calibration gap was small-sample noise.

What's real: Deepgram has tail overconfidence on specific clips. The worst cases in our dataset: one Mandarin clip with WER=23.2% at confidence 0.939, one LibriSpeech clip with WER=22.2% at confidence 0.999. These failures are rare and not systematic — but high-stakes when they occur. High-confidence transcripts with 23% WER pass straight through any automated pipeline without review.

Deepgram's overall calibration profile is solid. The East Asian bias finding does not hold at scale. The tail risk is real but bounded.

Deepgram is also the only vendor of the three with a production-viable real-time streaming API. For live transcription use cases, it's currently the only option in this comparison.

The Framing That Actually Matters

Your ASR vendor doesn't just transcribe. It tells you how confident it is in that transcription, and your pipeline acts on that signal. If the confidence signal is wrong — in either direction — your downstream logic breaks.

Overconfidence means bad transcripts pass through automated pipelines untouched. Errors reach users. Your QA system never saw them.

Underconfidence means correct transcripts go to human review unnecessarily. Expensive, but controllable.

The wrong-language failure is a separate category. The confidence signal isn't wildly wrong — it's 0.5–0.75, which looks uncertain but plausible. The transcript isn't partially degraded — it's entirely in the wrong language. Standard confidence thresholds don't catch it. Standard WER monitoring doesn't show it until you look per-speaker.

Aggregate benchmarks hide all of this. That's the point.

Quick Reference: Which Vendor for Which Use Case

| Use case | Recommendation | |----------|----------------| | Real-time streaming (sales calls, live captioning) | Deepgram Nova-3 — only real-time option | | Batch transcription, English speakers | Any vendor — calibration comparable | | Batch + Japanese/Korean speakers, automated pipeline | Avoid AssemblyAI without language detection post-processing | | Batch + Mandarin speakers | AssemblyAI performs excellently (ECE 0.011) | | High-stakes compliance / legal / medical | Do not rely on confidence thresholds alone for any vendor | | Cost-sensitive, high-volume, Whisper viable | Acceptable — budget for over-routing on accented speech |

What Agora Measures

Agora runs ECE head-to-heads with per-clip breakdowns, calibration direction analysis, language output validation, and worst-case examples — because mean metrics hide the failures that actually matter in production.

The wrong-language finding is one Agora detects automatically. Language output validation is built into the eval pipeline: if your ASR vendor returns CJK text on an English-input workload, it shows up.

Data: Agora ECE comparison — Deepgram Nova-3, Whisper large-v3, AssemblyAI Universal-2 — March 2026.
East Asian accent dataset: Speech Accent Archive (Japanese n=45, Korean n=65, Mandarin n=60).
LibriSpeech test-other used for clean English baseline.
All ECE figures normalized for output formatting.

Agora — AI vendor evaluation platform.