Vision-Language Model (VLM)

Les Vision-Language Models (VLM) sont des modèles combinant compréhension d'images et de texte, permettant tâches comme image captioning, visual question answering (VQA), document understanding, image generation guided by text.

Families de VLMs :
(1) **Contrastive models** — CLIP (OpenAI 2021), ALIGN, SigLIP — embed text et images dans espace commun pour zero-shot classification, image search, et grounding.
(2) **Generative VLMs avec LLM backbone** — GPT-4V/4o, Claude 3+ Vision, Gemini, LLaVA, Qwen-VL, Pixtral, InternVL — pre-trained LLM + vision encoder, fine-tuned conjointement.
(3) **Image generation guided by text** — DALL-E 3, Midjourney, Stable Diffusion, FLUX, Imagen — text-to-image diffusion models.
(4) **Text generation from image** — BLIP, BLIP-2, InstructBLIP, OFA.
(5) **Video VLMs** — Gemini 1.5 longue duration video, Video-LLaMA, VideoChat.

Capabilities typiques : (1) describe image (caption) ; (2) answer questions about image ("how many people are in this photo?") ; (3) read text from image (OCR + reasoning) ; (4) understand charts/diagrams ; (5) compare multiple images ; (6) UI/screenshot understanding (Claude Computer Use, Anthropic Claude controls desktops via screenshots) ; (7) document analysis (PDFs avec tableaux, formulaires) ; (8) anomaly detection in images ; (9) accessibility tooling.

Evaluation benchmarks : MMMU (Massive Multi-discipline), MMBench, MM-Vet, ScienceQA, ChartQA, DocVQA, MathVista, VQAv2.

Use cases entreprise : (1) document AI (Klippa, Mindee, Rossum) ; (2) e-commerce visual search ; (3) medical imaging assistance ; (4) industrial inspection ; (5) content moderation ; (6) UI testing automation ; (7) accessibility apps. Compétences AI-102, AIF-C01.

Préparez vos certifications IT gratuitement