SwiftyAISwiftyAI

Search documentation

Find a docs page by title or section

3

Audio

SwiftyAI exposes two dedicated audio paths: transcribe for audio-to-text and generateSpeech for text-to-audio.

APIDirectionOutput
transcribeAudio to textTranscriptionResponse with text and optional metadata
generateSpeechText to audioSpeechResponse with audio bytes, format, and media type
Multimodal audio inputAudio inside a model promptText or object response from the language model

Transcription

let transcriptionModel = OpenAICompatibleProvider(
    baseURL: "https://api.openai.com/v1",
    apiKey: ProcessInfo.processInfo.environment["OPENAI_API_KEY"]!,
    model: "gpt-4o-transcribe"
)
 
let audio = AIAudioInput(
    data: audioData,
    filename: "meeting.wav",
    mediaType: .wav
)
 
let transcript = try await transcribe(
    model: transcriptionModel,
    audio: audio,
    options: TranscriptionOptions(
        language: "en",
        prompt: "This is a product planning meeting.",
        responseFormat: .json
    )
)
 
print(transcript.text)

TranscriptionResponse includes text plus optional language, duration, and model fields.

Speech Generation

let speechModel = OpenAICompatibleProvider(
    baseURL: "https://api.openai.com/v1",
    apiKey: ProcessInfo.processInfo.environment["OPENAI_API_KEY"]!,
    model: "gpt-4o-mini-tts"
)
 
let speech = try await generateSpeech(
    model: speechModel,
    text: "Your export is ready.",
    options: SpeechOptions(
        voice: "alloy",
        format: .mp3,
        speed: 1.0,
        instructions: "Calm, clear, and brief."
    )
)
 
try speech.data.write(to: outputURL)

SpeechResponse returns the audio bytes, format, media type, and model.

OptionUse
voicePick the provider voice or speaker style
formatChoose MP3, WAV, or another supported audio container
speedSlow down or speed up narration where supported
instructionsGuide tone, pacing, or style

Gemini Audio

Gemini can also be used through GeminiProvider for supported audio operations:

let gemini = GeminiProvider(
    apiKey: ProcessInfo.processInfo.environment["GEMINI_API_KEY"]!,
    model: "gemini-audio-model"
)
 
let transcript = try await transcribe(model: gemini, audio: audio)

Use provider documentation to choose the exact model that supports your desired operation.

Related docs

For audio inside a prompt, read multimodal input. For generated media, read image generation and video.