TheWhisper-Large-V3-Turbo

Model Summary

TheWhisper-Large-V3-Turbo is a fine-tuned, high-performance variant of OpenAI’s Whisper Large V3 model — optimized by TheStage AI for real-time, low-latency, and low-power speech-to-text (ASR) inference across multiple platforms, including NVIDIA GPUs and Apple Silicon (CoreML).

It provides streaming transcription, word timestamps, and scalable performance for use cases like real-time captioning, meetings, and on-device voice interfaces.

📊 Benchmarks

For quality benchmarks, we used the multilingual benchmarks Open ASR Leaderboard.

For comprehensive performance and quality benchmarks see TheWhisper.

To reproduce open asr benchmarks use instructions below:

Install required packages:

git clone https://github.com/TheStageAI/TheWhisper.git
cd TheWhisper
pip install 'thestage-elastic-models[nvidia]' --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple --extra-index-url https://pypi.nvidia.com --extra-index-url https://pypi.org/simple
pip install .[nvidia]
pip install -r benchmark/requirements.txt

Then generate access token on TheStage AI Platform in your profile and execute the following command:

pip install thestage && thestage config set --api-token <YOUR_API_TOKEN>

Run evaluation:

python benchmark/run_evaluation.py --model_name TheStageAI/thewhisper-large-v3-turbo --task open_asr --batch_size 64

Use --task multilingual_open_asr for multilingual evaluation.

Huggingface Open-ASR-Leaderboard (English)

Dataset	TheWhisper	openai/whisper-large-v3-turbo	nvidia/parakeet-tdt-0.6b-v3	ibm-granite/granite-speech-3.3-2b
librispeech_clean_test	1.73	2.1	1.92	1.53
librispeech_other_test	3.69	4.24	3.59	3.26
spgispeech_test	1.89	2.97	3.98	3.87
tedlium_test	3.34	3.57	2.8	3.57
voxpopuli_test	6.52	11.87	6.09	5.93
gigaspeech_test	9.58	10.14	9.57	10.69
earnings22_test	11.01	11.63	11.19	10.25
ami_test	9.52	16.13	11.39	8.9
Mean	5.91	7.83	6.32	6.00

Multilingual Results

The table below presents mean WER values for each language, averaged across three benchmark datasets: FLEURS [8], MLS [9], and Common Voice 23.

Language	TheWhisper	openai/whisper-large-v3-turbo	nvidia/parakeet-tdt-0.6b-v3	nvidia/canary-1b-v2
German	4.15	4.91	5.04	4.96
French	5.08	7.97	5.39	4.86
Italian	4.50	6.40	5.59	5.66
Spanish	3.14	3.94	3.75	3.22
Portuguese	4.07	5.97	5.41	6.23
Indonesian	5.75	6.98	-	-
Russian	5.55	4.42	5.51	-
Arabic	9.31	10.57	-	-
Hindi	9.06	19.25	-	-
English	4.66	4.8	4.85	4.7

Multilingual Open-ASR-Leaderboard

Model	Mean WER
TheWhisper	4.30
microsoft/Phi-4-multimodal-instruct	4.60
nvidia/canary-1b-v2	4.89
nvidia/parakeet-tdt-0.6b-v3	5.05
openai/whisper-large-v3-turbo	5.44

Noisy Audio Evaluation:

We evaluate robustness to background noise by testing across different Signal-to-Noise Ratios (SNR) using noise samples from the MUSAN dataset [6]:

SNR Level (db)	TheWhisper	nvidia/parakeet-tdt-0.6b-v3
Clean	5.91	6.34
10	6.99	7.12
5	8.20	8.23
0	11.10	11.66

Quick start

Installation

Clone the repository

git clone https://github.com/TheStageAI/TheWhisper.git
cd TheWhisper

Install for Apple

pip install .[apple]

Install for Nvidia

pip install .[nvidia]

Install for Nvidia with TheStage AI optmized engines

pip install .[nvidia]
pip install thestage-elastic-models[nvidia] --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install thestage
# additional dependencies
pip install flash_attn==2.8.2 --no-build-isolation

Then generate access token on TheStage AI Platform in your profile and execute the following command:

thestage config set --api-token <YOUR_API_TOKEN>

Apple Usage

import torch
from thestage_speechkit.apple import ASRPipeline

model = ASRPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # optimized model with ANNA
    model_size='S',
    chunk_length_s=10
)

# inference
result = model(
    "path_to_your_audio.wav", 
    chunk_length_s=10,
    generate_kwargs={'do_sample': False, 'use_cache': True}
)

print(result["text"])

Apple Usage with Streaming

from thestage_speechkit.streaming import StreamingPipeline, MicStream, FileStream, StdoutStream

streaming_pipe = StreamingPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # Optimized model by ANNA
    model_size='S',
    # Window length
    chunk_length_s=15,
    platform='apple',
    language='en'
)

# set stride in miliseconds
mic_stream = MicStream(step_size_s=0.5)
output_stream = StdoutStream()

while True:
    chunk = mic_stream.next_chunk()
    if chunk:
        approved_text, assumption = streaming_pipe(chunk)
        output_stream.rewrite(approved_text, assumption)
    else:
        break

Nvidia Usage (HuggingFace Transfomers)

import torch
from thestage_speechkit.nvidia import ASRPipeline

model = ASRPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # allowed: 10s, 15s, 20s, 30s
    chunk_length_s=10,
    # optimized TheStage AI engines
    batch_size=32,
    device='cuda'
)

# inference
result = model(
    "path_to_your_audio.wav", 
    chunk_length_s=10,
    generate_kwargs={'do_sample': False, 'use_cache': True}
)

print(result["text"])

Nvidia Usage (TheStage AI engines)

import torch
from thestage_speechkit.nvidia import ASRPipeline

model = ASRPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # allowed: 10s, 15s, 20s, 30s
    chunk_length_s=10,
    # optimized TheStage AI engines
    model_size='S',
    batch_size=32,
    device='cuda'
)

# inference
result = model(
    "path_to_your_audio.wav", 
    chunk_length_s=10,
    generate_kwargs={'do_sample': False, 'use_cache': True}
)

print(result["text"])

Model Details

Developed by: TheStage AI
Model type: Speech-to-Text (Automatic Speech Recognition)
Languages: Multilingual (same as Whisper Large V3: ~99 languages supported)
License: MIT
Finetuned from: openai/whisper-large-v3-turbo
Frameworks: PyTorch, CoreML
Supported Platforms:
- NVIDIA GPUs (CUDA 11.8+)
- Apple Silicon (M1–M4, macOS 15+)

Model tree for TheStageAI/thewhisper-large-v3-turbo

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Quantized

(19)

this model

Collection including TheStageAI/thewhisper-large-v3-turbo

Elastic Transformers

Collection

Hugging Face Transformers models accelerated by TheStage AI ANNA: Automated NNs Accelerator. • 18 items • Updated 20 days ago • 2

TheStageAI
/

thewhisper-large-v3-turbo