4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

*Equal Contributors

Present multimodal and multitask basis fashions like 4M or UnifiedIO present promising outcomes, however in follow their out-of-the-box skills to simply accept numerous inputs and carry out numerous duties are restricted by the (often moderately small) variety of modalities and duties they’re skilled on. On this paper, we considerably develop upon the capabilities of 4M by coaching it on tens of extremely numerous modalities and by performing co-training on large-scale multimodal datasets and textual content corpora. This consists of coaching on a number of semantic and geometric modalities, characteristic maps from latest state-of-the-art fashions like DINOv2 and ImageBind, pseudo labels of specialist fashions like SAM and 4DHumans, and a spread of latest modalities that permit for novel methods to work together with the mannequin and steer the technology, for instance picture metadata or colour palettes.

A vital step on this course of is performing tokenization on numerous modalities, whether or not they’re image-like, neural community characteristic maps, vectors, structured knowledge like occasion segmentation or human poses, or knowledge that may be represented as textual content.

By way of this, we’re capable of develop on the out-of-the-box capabilities of multimodal fashions. This permits extra fine-grained and controllable multimodal technology capabilities and permits us to review the distillation of fashions skilled on numerous knowledge and aims right into a unified mannequin. We efficiently scale the coaching to a 3 billion parameter mannequin utilizing tens of modalities and totally different datasets, observing promising scaling developments.