This page outlines the end-to-end flow for creating a Voice File Job, uploading audio, and retrieving results. The current implementation uses a client-triggered upload-complete call (no S3 event).
What you need
- API key:
Authorization: Bearer {id}.{secret}
- transcriptLocaleHints: up to 1 (optional)
- If not provided, language will be automatically detected
- translationLocales: up to 5 (optional)
End-to-end checklist
- Create job → receive
{ id, uploadUri }
- Upload audio to
uploadUri
- Call
PUT /v1/external/voice-file/jobs/{jobId}/upload-complete
- Poll job status until completion
- Fetch transcript and (optionally) translations
- Get paragraph summaries for better content understanding
Polling strategy (recommended)
- Poll
GET /v1/external/voice-file/jobs/{jobId} with exponential backoff
- Start at 1–2s interval, then 4s, 8s, up to 30s cap
- Stop conditions:
- Success:
status = COMPLETED
- Failure:
status = FAILED
- After success:
- Always fetch transcript:
GET /v1/external/voice-file/jobs/{jobId}/transcript
- If you requested translations:
GET /v1/external/voice-file/jobs/{jobId}/translations or per-locale endpoint
- Get paragraph summaries:
- For transcript:
GET /v1/external/voice-file/jobs/{jobId}/transcript/paragraph-summary
- For translations:
GET /v1/external/voice-file/jobs/{jobId}/translations/{locale}/paragraph-summary
The final status for successful processing is COMPLETED. This means all processing is complete including transcript, translation (if requested), and paragraph summaries for both.
File Limits & Requirements
Based on industry-standard STT capabilities, Tiro supports the following formats:
Audio Formats:
| Format | MIME Type | Extension |
|---|
| MP3 | audio/mpeg | .mp3 |
| WAV | audio/wav | .wav |
| M4A | audio/mp4 | .m4a |
Video Formats (audio extraction):
| Format | MIME Type | Extension |
|---|
| MP4 | video/mp4 | .mp4 |
File Size & Duration Limits
| Limit Type | Value | Notes |
|---|
| Max File Size | 500 MB | Practical limit for most use cases |
| Max Duration | 3 hours | Covers most meetings and interviews |
| Min Sample Rate | 8 kHz | Minimum for speech recognition |
| Recommended Sample Rate | 16 kHz+ | Optimal for accuracy |
| Max Sample Rate | 48 kHz | Studio quality support |
| Channels | Mono or Stereo | Multi-speaker support |
Processing Time Estimates
Processing times are optimized with parallel processing for longer files:
| File Duration | Typical Processing Time |
|---|
| < 5 minutes | 45-75 seconds |
| 5-20 minutes | 1-3 minutes |
| 20-60 minutes | 3-6 minutes |
| 1-4 hours | 6-18 minutes |
Paragraph Summary Feature
The Paragraph Summary feature provides intelligent summarization of audio content, breaking down the transcript or translation into digestible paragraph-level summaries.
How it works
- Transcript Processing: After transcription is complete, the text is automatically split into logical paragraphs using the CompositeTextSplitter
- Summary Generation: Each paragraph is processed through the ParagraphSummarizer to create concise summaries
- Translation Integration: When translations are requested, summaries are regenerated based on the translated content for each locale
- Asynchronous Processing: Summary generation happens asynchronously after transcript/translation completion
Sequence
See Also