
PaLM-E transfers knowledge from visual-language domains into embodied reasoning – from robot planning in environments with complex dynamics and physical constraints, to answering questions about the observable world. PaLM-E operates on multimodal sentences, i.e. sequences of tokens where inputs from arbitrary modalities (e.g. images, neural 3D representations, or states, in green and blue) are inserted alongside text tokens (in orange) as input to an LLM, trained end-to-end.
Zero-Shot Learning: 零样本学习。旨在让模型学会对从来没有见过的类别完成分类。
Zero-Shot learning: Contrastive Language-Image Pretraining (CLIP)