*=Equal Contributors
Multimodal datasets are a essential part in current breakthroughs equivalent to Secure Diffusion and GPT-4, but their design doesn’t obtain the identical analysis consideration as mannequin architectures or coaching algorithms. To handle this shortcoming within the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered round a brand new candidate pool of 12.8 billion image-text pairs from Widespread Crawl. Members in our benchmark design new filtering strategies or curate new knowledge sources after which consider their new dataset by working our standardized CLIP coaching code and testing the ensuing mannequin on 38 downstream check units. Our benchmark consists of a number of compute scales spanning 4 orders of magnitude, which allows the research of scaling tendencies and makes the benchmark accessible to researchers with various assets. Our baseline experiments present that the DataComp workflow results in higher coaching units. Specifically, our greatest baseline, DataComp-1B, allows coaching a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI’s CLIP ViT-L/14 by 3.7 share factors whereas utilizing the identical coaching process and compute.