Reinforcement Studying from Human Suggestions (RLHF) is an efficient method for aligning language fashions to human preferences. Central to RLHF is studying a reward perform for scoring human preferences. Two most important approaches for studying a reward mannequin are 1) coaching an specific reward mannequin as in RLHF, and a pair of) utilizing an implicit reward discovered from desire knowledge via strategies equivalent to Direct Desire Optimization (DPO). Prior work has proven that the implicit reward mannequin of DPO can approximate a educated reward mannequin, however it’s unclear to what extent DPO can generalize to distribution shifts, a difficulty which may happen as a consequence of restricted desire knowledge, or altering language from the educated mannequin. We handle this query by evaluating the accuracy at distinguishing most well-liked and rejected solutions utilizing each DPO and RLHF rewards. Our findings point out that DPO’s implicit reward performs equally to RLHF rewards on in-distribution knowledge, however severely under-performs RLHF reward fashions. Throughout 5 out-of-domain settings, DPO has a imply drop in accuracy of three% and a most drop of seven%, highlighting the shortcomings of DPO’s implicit reward mannequin for desire optimization. These findings spotlight that DPO’s implicit reward mannequin has restricted generalization capability and substantiates the combination of an specific reward mannequin in iterative DPO approaches.