Audio

SwiftyAI exposes two dedicated audio paths: transcribe for audio-to-text and generateSpeech for text-to-audio.

API	Direction	Output
`transcribe`	Audio to text	`TranscriptionResponse` with text and optional metadata
`generateSpeech`	Text to audio	`SpeechResponse` with audio bytes, format, and media type
Multimodal audio input	Audio inside a model prompt	Text or object response from the language model

Transcription

let transcriptionModel = OpenAICompatibleProvider(
    baseURL: "https://api.openai.com/v1",
    apiKey: ProcessInfo.processInfo.environment["OPENAI_API_KEY"]!,
    model: "gpt-4o-transcribe"
)
 
let audio = AIAudioInput(
    data: audioData,
    filename: "meeting.wav",
    mediaType: .wav
)
 
let transcript = try await transcribe(
    model: transcriptionModel,
    audio: audio,
    options: TranscriptionOptions(
        language: "en",
        prompt: "This is a product planning meeting.",
        responseFormat: .json
    )
)
 
print(transcript.text)

TranscriptionResponse includes text plus optional language, duration, and model fields.

Speech Generation

let speechModel = OpenAICompatibleProvider(
    baseURL: "https://api.openai.com/v1",
    apiKey: ProcessInfo.processInfo.environment["OPENAI_API_KEY"]!,
    model: "gpt-4o-mini-tts"
)
 
let speech = try await generateSpeech(
    model: speechModel,
    text: "Your export is ready.",
    options: SpeechOptions(
        voice: "alloy",
        format: .mp3,
        speed: 1.0,
        instructions: "Calm, clear, and brief."
    )
)
 
try speech.data.write(to: outputURL)

SpeechResponse returns the audio bytes, format, media type, and model.

Option	Use
`voice`	Pick the provider voice or speaker style
`format`	Choose MP3, WAV, or another supported audio container
`speed`	Slow down or speed up narration where supported
`instructions`	Guide tone, pacing, or style

Gemini Audio

Gemini can also be used through GeminiProvider for supported audio operations:

let gemini = GeminiProvider(
    apiKey: ProcessInfo.processInfo.environment["GEMINI_API_KEY"]!,
    model: "gemini-audio-model"
)
 
let transcript = try await transcribe(model: gemini, audio: audio)

Use provider documentation to choose the exact model that supports your desired operation.

Related docs

For audio inside a prompt, read multimodal input. For generated media, read image generation and video.