Multimodal Tools
Multimodal tools analyze text files, images, videos, and audio with model support. URLs can be S3, HTTP, or HTTPS.
🧭 Tool List
analyze_text_file: Download and extract text, then analyze per questionanalyze_image: Download images and interpret them with a vision-language modelanalyze_video: Download videos and analyze them with a video understanding modelanalyze_audio: Download audio and analyze it with an audio understanding model
🧰 Example Use Cases
- Summarize documents stored in buckets
- Explain screenshots, product photos, or chart images
- Understand video content, such as extracting key frame information, human actions, or scene descriptions
- Analyze audio content, such as transcription, speaker identification, or content summarization
- Produce per-file or per-image/video/audio answers aligned with the input order
🧾 Parameters & Behavior
analyze_text_file
file_url_list: List of URLs (s3://bucket/key,/bucket/key,http(s)://).query: User question/analysis goal.- Downloads each file, extracts text, and returns an array of analyses in input order.
analyze_image
image_urls_list: List of URLs (s3://bucket/key,/bucket/key,http(s)://).query: User focus/question.- Downloads each image, runs VLM analysis, and returns an array matching input order.
analyze_video
video_url: Video URL (s3://bucket/key,/bucket/key,http(s)://).query: User focus/question.- Downloads the video, runs video understanding model analysis, and returns the result.
analyze_audio
audio_url: Audio URL (s3://bucket/key,/bucket/key,http(s)://).query: User focus/question.- Downloads the audio, runs audio understanding model analysis, and returns the result.
⚙️ Prerequisites
- Configure storage access (e.g., MinIO/S3) and data processing service to fetch files.
- Provide an LLM for
analyze_text_file, a VLM foranalyze_image, and a video understanding model foranalyze_videoandanalyze_audio(must support audio/video input, e.g., Qwen3-Omni series).
🛠️ How to Use
- Prepare accessible URLs for files, images, videos, or audio; confirm permissions.
- Call the corresponding tool with the URL and question; multiple resources are supported at once.
- Verify results before using them in follow-up steps.
💡 Best Practices
- For large files, preprocess or chunk them to reduce timeouts.
- For multiple images, be explicit about the focus (e.g., “focus on chart trends”) to improve answers.
- If results are empty or errors occur, verify URL accessibility and model readiness.
