With BLIP (Bootstrapping Language-Image Pre-training) you get an unified vision-language understanding and can generate text based on an image. BLIP can be used for
- Image captioning
- Open-ended visual question answering
- Multimodal / unimodal feature extraction
- Image-text matching