PaliGemma
Google's open-source multimodal model combining SigLIP vision encoder with Gemma LLM. Strong at document understanding, OCR, image captioning, and visual QA. Available via HuggingFace.
Qwen-VL
Qwen Visual Language model series from Alibaba. As of 2026 the frontier OSS multimodal model is Qwen3-VL-235B-A22B-Instruct, which rivals Gemini 2.5 Pro and GPT-5 on visual reasoning. Strong at multilingual visual understanding, document parsing, and chart QA.