whisper
Model ID: @cf/openai/whisper
Automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data
Properties
Task Type: Automatic Speech Recognition
Code Examples
Workers - TypeScript
export interface Env { AI: Ai;
}
export default { async fetch(request, env): Promise<Response> { const res: any = await fetch( "https://github.com/Azure-Samples/cognitive-services-speech-sdk/raw/master/samples/cpp/windows/console/samples/enrollment_audio_katie.wav" ); const blob = await res.arrayBuffer();
const input = { audio: [...new Uint8Array(blob)], };
const response = await env.AI.run( "@cf/openai/whisper", input );
return Response.json({ input: { audio: [] }, response }); },
} satisfies ExportedHandler<Env>;
curl
curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/@cf/openai/whisper \ -X POST \ -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \ --data-binary "@talking-llama.mp3"
Response
Automatic speech recognition responses return both a single string text
property with the audio transciption and an optional array of words
with start and end timestamps if the model supports that.
Here’s an example of the output from the @cf/openai/whisper
model:
{ "text": "It is a good day", "word_count": 5, "words": [ { "word": "It", "start": 0.5600000023841858, "end": 1 }, { "word": "is", "start": 1, "end": 1.100000023841858 }, { "word": "a", "start": 1.100000023841858, "end": 1.2200000286102295 }, { "word": "good", "start": 1.2200000286102295, "end": 1.3200000524520874 }, { "word": "day", "start": 1.3200000524520874, "end": 1.4600000381469727 } ]
}
API Schema
The following schema is based on JSON SchemaInput JSON Schema
Output JSON Schema