Humans understand the world by perceiving and fusing information from multiple channels, such as images viewed by the eyes, voices heard by the ears, and other forms of sensory input. One of the core aspirations in AI is to develop algorithms that endow computers with a similar ability: to effectively learn from multimodal data like vision language to make sense of the world around us. For example, vision language VL systems allow searching the relevant images for a text query or vice versa and...

Read the full article at Microsoft Press