Multimodal Input
Multimodal prompts use [AIMessageContent] instead of a single String. The same prompt parts can be sent to generateText, streamText, and generateObject when the provider supports the media type.
| Part type | Constructors | Common use |
|---|---|---|
| Text | .text | Instructions, questions, surrounding context |
| Images | .imageURL, .imageData, .imageBase64 | Screenshot review, chart description, visual QA |
| PDFs | .pdfURL, .pdfData, .pdfBase64 | Report summarization, policy extraction, document Q&A |
| Files | .fileURL, .fileData, .fileBase64 | Provider-supported file prompts with explicit media type |
| Audio | .audioData, .audioBase64 | Audio understanding inside a prompt |
| Video | .videoURL, .videoData, .videoBase64 | Video understanding inside a prompt |
Images
let response = try await generateText(
model: model,
prompt: [
.text("List the visible accessibility issues in this screen."),
.imageURL(URL(string: "https://example.com/screen.png")!, detail: .high)
]
)For local images, pass data:
let data = try Data(contentsOf: screenshotURL)
let response = try await generateText(
model: model,
prompt: [
.text("Describe this chart in one paragraph."),
.imageData(data, mediaType: .png, detail: .auto)
]
)PDFs And Files
let report = try Data(contentsOf: reportURL)
let summary = try await generateText(
model: model,
prompt: [
.text("Summarize the risks in this report."),
.pdfData(report, filename: "q4-risk-report.pdf")
]
)Generic files can be sent with fileURL, fileData, or fileBase64 and an explicit AIMediaType.
Audio And Video Parts
let response = try await generateText(
model: model,
prompt: [
.text("Extract action items from this audio clip."),
.audioData(audioData, mediaType: .wav, filename: "standup.wav")
]
)let response = try await generateText(
model: model,
prompt: [
.text("Describe what happens in this product demo."),
.videoURL(videoURL)
]
)There is no dedicated .audioURL prompt constructor in the current package. For audio prompts, load the bytes into Data or pass an existing base64 string. Use audio for dedicated transcription and speech APIs. Multimodal input is for model understanding; media APIs are for producing or transcribing media.
Provider Support
AIMessageContent can represent many media types, but the provider decides which ones are accepted. If a model does not support a media part, the provider call can fail with an API error or AIError.unsupportedFeature.
| Provider behavior | How to handle it |
|---|---|
| Accepts only text | Use generateText with a String prompt |
| Accepts images but not files | Split document workflows into extraction and generation |
| Accepts files with upload ids | Use the file URL/base64/data constructor that matches the provider implementation |
| Rejects a media type | Catch provider errors and offer a text-only fallback |