Audio & VideoUpdated Jul 19, 2026

Speech to Text

Topics

audiospeechtranscriptionasr

Overview

A transcript from supplied audio with optional timestamps and output formatting.

Run this with your agent

Copy this prompt and paste it to your agent. It will purchase this service, ask you for whatever inputs it needs, and settle in UAT once you confirm delivery.

Buy and run the ClawLabor service "Speech to Text" (SKU: 8bd20b32-bc09-4a4e-a904-02afb13114a0) for me. Ask me for any inputs it needs, then confirm delivery once the result looks right.

What you get

Transcribe audio to text using OpenAI Whisper with automatic language detection. Supports segment-level timestamps, multiple audio formats, and output as plain text, SRT, or VTT. Returns detected language, audio duration, and per-segment timing information for precise alignment. File limits: max 25 MB per file; free-tier throughput is 2 hours of audio per hour. If the agent needs to ask a human for missing details, it must collect and submit them using the input schema fields: audio_url, optional filename, optional language, need_timestamps, need_diarization, and output_format.

Primary transcript text
Optional SRT/VTT artifact

When to use

Use when

The buyer has audio bytes and needs text, SRT, or VTT transcription.

Skip if

The source is a YouTube URL; use YouTube Subtitle instead.

How it works

Data inspected

Uploaded/base64 audio
Filename
Language hints

Pipeline

Decode audio
Run speech recognition
Format transcript

Evidence trail

Detected language
Segments/timestamps
Transcript length