Skip to main content
This page outlines the end-to-end flow for creating a Voice File Job, uploading audio, and retrieving results. The current implementation uses a client-triggered upload-complete call (no S3 event).

What you need

  • API key: Authorization: Bearer {id}.{secret}
  • transcriptLocaleHints: up to 1 (optional)
    • If not provided, language will be automatically detected
  • translationLocales: up to 5 (optional)

End-to-end checklist

  1. Create job → receive { id, uploadUri }
  2. Upload audio to uploadUri
  3. Call PUT /v1/external/voice-file/jobs/{jobId}/upload-complete
  4. Poll job status until completion
  5. Fetch transcript and (optionally) translations
  6. Get paragraph summaries for better content understanding
  • Poll GET /v1/external/voice-file/jobs/{jobId} with exponential backoff
    • Start at 1–2s interval, then 4s, 8s, up to 30s cap
    • Stop conditions:
      • Success: status = COMPLETED
      • Failure: status = FAILED
    • After success:
      • Always fetch transcript: GET /v1/external/voice-file/jobs/{jobId}/transcript
      • If you requested translations: GET /v1/external/voice-file/jobs/{jobId}/translations or per-locale endpoint
      • Get paragraph summaries:
        • For transcript: GET /v1/external/voice-file/jobs/{jobId}/transcript/paragraph-summary
        • For translations: GET /v1/external/voice-file/jobs/{jobId}/translations/{locale}/paragraph-summary
The final status for successful processing is COMPLETED. This means all processing is complete including transcript, translation (if requested), and paragraph summaries for both.

File Limits & Requirements

Supported Audio Formats

Based on industry-standard STT capabilities, Tiro supports the following formats: Audio Formats:
FormatMIME TypeExtension
MP3audio/mpeg.mp3
WAVaudio/wav.wav
M4Aaudio/mp4.m4a
Video Formats (audio extraction):
FormatMIME TypeExtension
MP4video/mp4.mp4

File Size & Duration Limits

Limit TypeValueNotes
Max File Size500 MBPractical limit for most use cases
Max Duration3 hoursCovers most meetings and interviews
Min Sample Rate8 kHzMinimum for speech recognition
Recommended Sample Rate16 kHz+Optimal for accuracy
Max Sample Rate48 kHzStudio quality support
ChannelsMono or StereoMulti-speaker support

Processing Time Estimates

Processing times are optimized with parallel processing for longer files:
File DurationTypical Processing Time
< 5 minutes45-75 seconds
5-20 minutes1-3 minutes
20-60 minutes3-6 minutes
1-4 hours6-18 minutes

Paragraph Summary Feature

The Paragraph Summary feature provides intelligent summarization of audio content, breaking down the transcript or translation into digestible paragraph-level summaries.

How it works

  1. Transcript Processing: After transcription is complete, the text is automatically split into logical paragraphs using the CompositeTextSplitter
  2. Summary Generation: Each paragraph is processed through the ParagraphSummarizer to create concise summaries
  3. Translation Integration: When translations are requested, summaries are regenerated based on the translated content for each locale
  4. Asynchronous Processing: Summary generation happens asynchronously after transcript/translation completion

Sequence

See Also