TheWhisper-Large-V3-Turbo
Model Summary
TheWhisper-Large-V3-Turbo is a fine-tuned, high-performance variant of OpenAI’s Whisper Large V3 model — optimized by TheStage AI for real-time, low-latency, and low-power speech-to-text (ASR) inference across multiple platforms, including NVIDIA GPUs and Apple Silicon (CoreML).
It provides streaming transcription, word timestamps, and scalable performance for use cases like real-time captioning, meetings, and on-device voice interfaces.
📊 Benchmarks
For quality benchmarks, we used the multilingual benchmarks Open ASR Leaderboard.
For comprehensive performance and quality benchmarks see TheWhisper.
To reproduce open asr benchmarks use instructions below:
- Install required packages:
git clone https://github.com/TheStageAI/TheWhisper.git
cd TheWhisper
pip install 'thestage-elastic-models[nvidia]' --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple --extra-index-url https://pypi.nvidia.com --extra-index-url https://pypi.org/simple
pip install .[nvidia]
pip install -r benchmark/requirements.txt
- Then generate access token on TheStage AI Platform in your profile and execute the following command:
pip install thestage && thestage config set --api-token <YOUR_API_TOKEN>
- Run evaluation:
python benchmark/run_evaluation.py --model_name TheStageAI/thewhisper-large-v3-turbo --task open_asr --batch_size 64
Use --task multilingual_open_asr for multilingual evaluation.
Huggingface Open-ASR-Leaderboard (English)
| Dataset | TheWhisper | openai/whisper-large-v3-turbo | nvidia/parakeet-tdt-0.6b-v3 | ibm-granite/granite-speech-3.3-2b |
|---|---|---|---|---|
| librispeech_clean_test | 1.73 | 2.1 | 1.92 | 1.53 |
| librispeech_other_test | 3.69 | 4.24 | 3.59 | 3.26 |
| spgispeech_test | 1.89 | 2.97 | 3.98 | 3.87 |
| tedlium_test | 3.34 | 3.57 | 2.8 | 3.57 |
| voxpopuli_test | 6.52 | 11.87 | 6.09 | 5.93 |
| gigaspeech_test | 9.58 | 10.14 | 9.57 | 10.69 |
| earnings22_test | 11.01 | 11.63 | 11.19 | 10.25 |
| ami_test | 9.52 | 16.13 | 11.39 | 8.9 |
| Mean | 5.91 | 7.83 | 6.32 | 6.00 |
Multilingual Results
The table below presents mean WER values for each language, averaged across three benchmark datasets: FLEURS [8], MLS [9], and Common Voice 23.
| Language | TheWhisper | openai/whisper-large-v3-turbo | nvidia/parakeet-tdt-0.6b-v3 | nvidia/canary-1b-v2 |
|---|---|---|---|---|
| German | 4.15 | 4.91 | 5.04 | 4.96 |
| French | 5.08 | 7.97 | 5.39 | 4.86 |
| Italian | 4.50 | 6.40 | 5.59 | 5.66 |
| Spanish | 3.14 | 3.94 | 3.75 | 3.22 |
| Portuguese | 4.07 | 5.97 | 5.41 | 6.23 |
| Indonesian | 5.75 | 6.98 | - | - |
| Russian | 5.55 | 4.42 | 5.51 | - |
| Arabic | 9.31 | 10.57 | - | - |
| Hindi | 9.06 | 19.25 | - | - |
| English | 4.66 | 4.8 | 4.85 | 4.7 |
Multilingual Open-ASR-Leaderboard
| Model | Mean WER |
|---|---|
| TheWhisper | 4.30 |
| microsoft/Phi-4-multimodal-instruct | 4.60 |
| nvidia/canary-1b-v2 | 4.89 |
| nvidia/parakeet-tdt-0.6b-v3 | 5.05 |
| openai/whisper-large-v3-turbo | 5.44 |
Noisy Audio Evaluation:
We evaluate robustness to background noise by testing across different Signal-to-Noise Ratios (SNR) using noise samples from the MUSAN dataset [6]:
| SNR Level (db) | TheWhisper | nvidia/parakeet-tdt-0.6b-v3 |
|---|---|---|
| Clean | 5.91 | 6.34 |
| 10 | 6.99 | 7.12 |
| 5 | 8.20 | 8.23 |
| 0 | 11.10 | 11.66 |
Quick start
Installation
Clone the repository
git clone https://github.com/TheStageAI/TheWhisper.git
cd TheWhisper
Install for Apple
pip install .[apple]
Install for Nvidia
pip install .[nvidia]
Install for Nvidia with TheStage AI optmized engines
pip install .[nvidia]
pip install thestage-elastic-models[nvidia] --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install thestage
# additional dependencies
pip install flash_attn==2.8.2 --no-build-isolation
Then generate access token on TheStage AI Platform in your profile and execute the following command:
thestage config set --api-token <YOUR_API_TOKEN>
Apple Usage
import torch
from thestage_speechkit.apple import ASRPipeline
model = ASRPipeline(
model='TheStageAI/thewhisper-large-v3-turbo',
# optimized model with ANNA
model_size='S',
chunk_length_s=10
)
# inference
result = model(
"path_to_your_audio.wav",
chunk_length_s=10,
generate_kwargs={'do_sample': False, 'use_cache': True}
)
print(result["text"])
Apple Usage with Streaming
from thestage_speechkit.streaming import StreamingPipeline, MicStream, FileStream, StdoutStream
streaming_pipe = StreamingPipeline(
model='TheStageAI/thewhisper-large-v3-turbo',
# Optimized model by ANNA
model_size='S',
# Window length
chunk_length_s=15,
platform='apple',
language='en'
)
# set stride in miliseconds
mic_stream = MicStream(step_size_s=0.5)
output_stream = StdoutStream()
while True:
chunk = mic_stream.next_chunk()
if chunk:
approved_text, assumption = streaming_pipe(chunk)
output_stream.rewrite(approved_text, assumption)
else:
break
Nvidia Usage (HuggingFace Transfomers)
import torch
from thestage_speechkit.nvidia import ASRPipeline
model = ASRPipeline(
model='TheStageAI/thewhisper-large-v3-turbo',
# allowed: 10s, 15s, 20s, 30s
chunk_length_s=10,
# optimized TheStage AI engines
batch_size=32,
device='cuda'
)
# inference
result = model(
"path_to_your_audio.wav",
chunk_length_s=10,
generate_kwargs={'do_sample': False, 'use_cache': True}
)
print(result["text"])
Nvidia Usage (TheStage AI engines)
import torch
from thestage_speechkit.nvidia import ASRPipeline
model = ASRPipeline(
model='TheStageAI/thewhisper-large-v3-turbo',
# allowed: 10s, 15s, 20s, 30s
chunk_length_s=10,
# optimized TheStage AI engines
model_size='S',
batch_size=32,
device='cuda'
)
# inference
result = model(
"path_to_your_audio.wav",
chunk_length_s=10,
generate_kwargs={'do_sample': False, 'use_cache': True}
)
print(result["text"])
Model Details
- Developed by: TheStage AI
- Model type: Speech-to-Text (Automatic Speech Recognition)
- Languages: Multilingual (same as Whisper Large V3: ~99 languages supported)
- License: MIT
- Finetuned from: openai/whisper-large-v3-turbo
- Frameworks: PyTorch, CoreML
- Supported Platforms:
- NVIDIA GPUs (CUDA 11.8+)
- Apple Silicon (M1–M4, macOS 15+)
Links
- Repository: https://github.com/TheStageAI/TheWhisper
- Demo / Docs: https://app.thestage.ai
- Weights: https://huggingface.co/TheStageAI/thewhisper-large-v3-turbo
- Downloads last month
- 2,096
Model tree for TheStageAI/thewhisper-large-v3-turbo
Base model
openai/whisper-large-v3