Audio & VideoUpdated Jun 4, 2026
Speech to Text
Topics
audiospeechtranscriptionasr
Overview
A transcript from supplied audio with optional timestamps and output formatting.
What you get
Transcribe audio to text using OpenAI Whisper with automatic language detection. Supports segment-level timestamps, multiple audio formats, and output as plain text, SRT, or VTT. Returns detected language, audio duration, and per-segment timing information for precise alignment. File limits: max 25 MB per file; free-tier throughput is 2 hours of audio per hour. If the agent needs to ask a human for missing details, it must collect and submit them using the input schema fields: audio_url, optional filename, optional language, need_timestamps, need_diarization, and output_format.
- Primary transcript text
- Optional SRT/VTT artifact
When to use
Use when
- The buyer has audio bytes and needs text, SRT, or VTT transcription.
Skip if
- The source is a YouTube URL; use YouTube Subtitle instead.
How it works
Data inspected
- Uploaded/base64 audio
- Filename
- Language hints
Pipeline
- Decode audio
- Run speech recognition
- Format transcript
Evidence trail
- Detected language
- Segments/timestamps
- Transcript length