TheWhisper-Large-V3-Turbo

Model Summary

TheWhisper-Large-V3-Turbo is a fine-tuned, high-performance variant of OpenAI’s Whisper Large V3 model — optimized by TheStage AI for real-time, low-latency, and low-power speech-to-text (ASR) inference across multiple platforms, including NVIDIA GPUs and Apple Silicon (CoreML).

It provides streaming transcription, word timestamps, and scalable performance for use cases like real-time captioning, meetings, and on-device voice interfaces.

📊 Benchmarks

For quality benchmarks, we used the multilingual benchmarks Open ASR Leaderboard.

For comprehensive performance and quality benchmarks see TheWhisper.

To reproduce open asr benchmarks use instructions below:

  1. Install required packages:
git clone https://github.com/TheStageAI/TheWhisper.git
cd TheWhisper
pip install 'thestage-elastic-models[nvidia]' --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple --extra-index-url https://pypi.nvidia.com --extra-index-url https://pypi.org/simple
pip install .[nvidia]
pip install -r benchmark/requirements.txt
  1. Then generate access token on TheStage AI Platform in your profile and execute the following command:
pip install thestage && thestage config set --api-token <YOUR_API_TOKEN>
  1. Run evaluation:
python benchmark/run_evaluation.py --model_name TheStageAI/thewhisper-large-v3-turbo --task open_asr --batch_size 64

Use --task multilingual_open_asr for multilingual evaluation.

vanilla whisper (1) TheStage AI Whisper (1)

Huggingface Open-ASR-Leaderboard (English)

Dataset TheWhisper openai/whisper-large-v3-turbo nvidia/parakeet-tdt-0.6b-v3 ibm-granite/granite-speech-3.3-2b
librispeech_clean_test 1.73 2.1 1.92 1.53
librispeech_other_test 3.69 4.24 3.59 3.26
spgispeech_test 1.89 2.97 3.98 3.87
tedlium_test 3.34 3.57 2.8 3.57
voxpopuli_test 6.52 11.87 6.09 5.93
gigaspeech_test 9.58 10.14 9.57 10.69
earnings22_test 11.01 11.63 11.19 10.25
ami_test 9.52 16.13 11.39 8.9
Mean 5.91 7.83 6.32 6.00

Multilingual Results

The table below presents mean WER values for each language, averaged across three benchmark datasets: FLEURS [8], MLS [9], and Common Voice 23.

Language TheWhisper openai/whisper-large-v3-turbo nvidia/parakeet-tdt-0.6b-v3 nvidia/canary-1b-v2
German 4.15 4.91 5.04 4.96
French 5.08 7.97 5.39 4.86
Italian 4.50 6.40 5.59 5.66
Spanish 3.14 3.94 3.75 3.22
Portuguese 4.07 5.97 5.41 6.23
Indonesian 5.75 6.98 - -
Russian 5.55 4.42 5.51 -
Arabic 9.31 10.57 - -
Hindi 9.06 19.25 - -
English 4.66 4.8 4.85 4.7

Multilingual Open-ASR-Leaderboard

Model Mean WER
TheWhisper 4.30
microsoft/Phi-4-multimodal-instruct 4.60
nvidia/canary-1b-v2 4.89
nvidia/parakeet-tdt-0.6b-v3 5.05
openai/whisper-large-v3-turbo 5.44

Noisy Audio Evaluation:

We evaluate robustness to background noise by testing across different Signal-to-Noise Ratios (SNR) using noise samples from the MUSAN dataset [6]:

SNR Level (db) TheWhisper nvidia/parakeet-tdt-0.6b-v3
Clean 5.91 6.34
10 6.99 7.12
5 8.20 8.23
0 11.10 11.66

Quick start


Installation

Clone the repository

git clone https://github.com/TheStageAI/TheWhisper.git
cd TheWhisper

Install for Apple

pip install .[apple]

Install for Nvidia

pip install .[nvidia]

Install for Nvidia with TheStage AI optmized engines

pip install .[nvidia]
pip install thestage-elastic-models[nvidia] --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install thestage
# additional dependencies
pip install flash_attn==2.8.2 --no-build-isolation

Then generate access token on TheStage AI Platform in your profile and execute the following command:

thestage config set --api-token <YOUR_API_TOKEN>

Apple Usage

import torch
from thestage_speechkit.apple import ASRPipeline

model = ASRPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # optimized model with ANNA
    model_size='S',
    chunk_length_s=10
)

# inference
result = model(
    "path_to_your_audio.wav", 
    chunk_length_s=10,
    generate_kwargs={'do_sample': False, 'use_cache': True}
)

print(result["text"])

Apple Usage with Streaming

from thestage_speechkit.streaming import StreamingPipeline, MicStream, FileStream, StdoutStream

streaming_pipe = StreamingPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # Optimized model by ANNA
    model_size='S',
    # Window length
    chunk_length_s=15,
    platform='apple',
    language='en'
)

# set stride in miliseconds
mic_stream = MicStream(step_size_s=0.5)
output_stream = StdoutStream()

while True:
    chunk = mic_stream.next_chunk()
    if chunk:
        approved_text, assumption = streaming_pipe(chunk)
        output_stream.rewrite(approved_text, assumption)
    else:
        break

Nvidia Usage (HuggingFace Transfomers)

import torch
from thestage_speechkit.nvidia import ASRPipeline

model = ASRPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # allowed: 10s, 15s, 20s, 30s
    chunk_length_s=10,
    # optimized TheStage AI engines
    batch_size=32,
    device='cuda'
)

# inference
result = model(
    "path_to_your_audio.wav", 
    chunk_length_s=10,
    generate_kwargs={'do_sample': False, 'use_cache': True}
)

print(result["text"])

Nvidia Usage (TheStage AI engines)

import torch
from thestage_speechkit.nvidia import ASRPipeline

model = ASRPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # allowed: 10s, 15s, 20s, 30s
    chunk_length_s=10,
    # optimized TheStage AI engines
    model_size='S',
    batch_size=32,
    device='cuda'
)

# inference
result = model(
    "path_to_your_audio.wav", 
    chunk_length_s=10,
    generate_kwargs={'do_sample': False, 'use_cache': True}
)

print(result["text"])

Model Details


  • Developed by: TheStage AI
  • Model type: Speech-to-Text (Automatic Speech Recognition)
  • Languages: Multilingual (same as Whisper Large V3: ~99 languages supported)
  • License: MIT
  • Finetuned from: openai/whisper-large-v3-turbo
  • Frameworks: PyTorch, CoreML
  • Supported Platforms:
    • NVIDIA GPUs (CUDA 11.8+)
    • Apple Silicon (M1–M4, macOS 15+)

Links


Downloads last month
2,096
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for TheStageAI/thewhisper-large-v3-turbo

Quantized
(19)
this model

Collection including TheStageAI/thewhisper-large-v3-turbo