Multimodal Tools

Multimodal tools analyze text files and images with model support. URLs can be S3, HTTP, or HTTPS.

🧭 Tool List

analyze_text_file: Download and extract text, then analyze per question
analyze_image: Download images and interpret them with a vision-language model

🧰 Example Use Cases

Summarize documents stored in buckets
Explain screenshots, product photos, or chart images
Produce per-file or per-image answers aligned with the input order

🧾 Parameters & Behavior

analyze_text_file

file_url_list: List of URLs (s3://bucket/key, /bucket/key, http(s)://).
query: User question/analysis goal.
Downloads each file, extracts text, and returns an array of analyses in input order.

analyze_image

image_urls_list: List of URLs (s3://bucket/key, /bucket/key, http(s)://).
query: User focus/question.
Downloads each image, runs VLM analysis, and returns an array matching input order.

⚙️ Prerequisites

Configure storage access (e.g., MinIO/S3) and data processing service to fetch files.
Provide an LLM for analyze_text_file and a VLM for analyze_image.

🛠️ How to Use

Prepare accessible URLs and confirm permissions.
Call the corresponding tool with the URL list and question; multiple resources are supported at once.
Use results in the same order as inputs for display or follow-up steps.

💡 Best Practices

For large files, preprocess or chunk them to reduce timeouts.
For multiple images, be explicit about the focus (e.g., “focus on chart trends”) to improve answers.
If results are empty or errors occur, verify URL accessibility and model readiness.