Utility builders promote their Apps by creating product pages with App pictures, and bidding on search phrases. It’s then essential for App pictures to be extremely related with the search phrases. Options to this drawback require an image-text matching mannequin to foretell the standard of the match between the chosen picture and the search phrases. On this work, we current a novel method to matching an App picture to go looking phrases primarily based on fine-tuning a pre-trained LXMERT mannequin. We present that in comparison with the CLIP mannequin and a baseline utilizing a Transformer mannequin for search phrases, and a ResNet mannequin for pictures, we considerably enhance the matching accuracy. We consider our method utilizing two units of labels: advertiser related (picture, search time period) pairs for a given software, and human scores for the relevance between (picture, search time period) pairs. Our method achieves 0.96 AUC rating for advertiser related floor reality, outperforming the transformer+ResNet baseline and the fine-tuned CLIP mannequin by 8% and 14%. For human labeled floor reality, our method achieves 0.95 AUC rating, outperforming the transformer+ResNet baseline and the fine-tuned CLIP mannequin by 16% and 17%.