這個方法背后的主要思想是:如果我們試圖從分布p(x)中取樣,我們會使用另一個工具分布q(x)來幫助從p(x)中取樣。唯一的限制是對于某個M>1,p(x) < Mq(x)。它主要用于當p(x)的形式使其難以直接取樣,但可以在任何點x評估它的情況。
- 從q(x)中取樣x。
- 從U(0, Mq(x))(均勻分布)中取樣y。
- 如果 y < p(x),則接受x作為p(x)的一個樣本,否則返回第1步。
WebGPT: Browser-assisted question-answering with human feedback
Rejection sampling (best-of-n). We sampled a fixed number of answers (4, 16 or 64) from either the BC model or the RL model (if left unspecified, we used the BC model), and selected the one that was ranked highest by the reward model. We used this as an alternative method of optimizing against the reward model, which requires no additional training, but instead uses more inference-time compute.
Even though both rejection sampling and RL optimize against the same reward model, there are several possible reasons why rejection sampling outperforms RL:
- 1.It may help to have many answering attempts, simply to make use of more inference-time compute.
- 2.The environment is unpredictable: with rejection sampling, the model can try visiting many more websites, and then evaluate the information it finds with the benefit of hindsight.
- 3.The reward model was trained primarily on data collected from BC and rejection sampling policies, which may have made it more robust to over optimization by rejection sampling than by RL.
- 4..The reward model was trained primarily on data collected from BC and rejection sampling policies, which may have made it more robust to over optimization by rejection sampling than by RL.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Rejection Sampling (RS) with a 52B preference model, where samples were generated from a 52B context-distilled LM. In this case the number k of samples was a parameter, but most often we used k = 16.
We also test our online models' performance during training (Figure 15), compare various levels of rejection sampling .
In Figure 36 we show helpfulness Elo scores for a 52B context distilled model with rejection sampling (utilizing a 52B preference model trained on pure helpfulness) for k = 1, 4, 16, 64, showing that higher values of k clearly perform better. Note that the context distilled model and the preference models discussed here were trained during an earlier stage of our research with different datasets and settings from those discussed elsewhere in the paper, so they are not directly comparable with other Elo results, though very roughly and heuristically, our online models seem to perform about as well or better than k = 64 rejection sampling. Note that k = 64 rejection sampling corresponds to DKL = log(64) ≈ 4.2.
總結一下依然是在推理階段使用拒絕采樣,然后采樣的時候K值越大效果越好,online RLHF 模型似乎表現的比拒絕采樣更好。
Aligning Large Language Models through Synthetic Feedback
An important additional component is that we leverage the synthetic RM from the previous stage to ensure the quality of the model-tomodel conversations with rejection sampling over the generated outputs (Ouyang et al., 2022). We train LLaMA-7B on the synthetic demonstrations (SFT) and further optimize the model with rewards from the synthetic RM, namely, Reinforcement Learning from Synthetic Feedback (RLSF).
To ensure a more aligned response from the assistant, we suggest including the synthetic RM, trained in the first stage, in the loop, namely Reward-Model-guided SelfPlay (RMSP). In this setup, the assistant model,LLaMA-30B-Faithful-3shot, first samples N responses for a given conversational context. Then, the RM scores the N responses, and the best-scored response is chosen as the final response for the simulation, i.e., the RM performs rejection sampling (best-of-N sampling) (Nakano et al., 2021; Ouyang et al., 2022). Other procedures are the same as the Self-Play. Please see Figure 8 for the examples.
Llama 2: Open Foundation and Fine-Tuned Chat Models
This process begins with the pretraining of Llama 2 using publicly available online sources. Following this, we create an initial version of Llama 2-Chat through the application of supervised fine-tuning. Subsequently, the model is iteratively refined using Reinforcement Learning with Human Feedback (RLHF) methodologies, specifically through rejection sampling and Proximal Policy Optimization (PPO). Throughout the RLHF stage, the accumulation of iterative reward modeling data in parallel with model enhancements is crucial to ensure the reward models remain within distribution.
Rejection Sampling fine-tuning. We sample K outputs from the model and select the best candidate with our reward, consistent with Bai et al. (2022b). The same re-ranking strategy for LLMs was also proposed in Deng et al. (2019), where the reward is seen as an energy function. Here, we go one step further, and use the selected outputs for a gradient update. For each prompt, the sample obtaining the highest reward score is considered the new gold standard. Similar to Scialom et al. (2020a), we then fine-tune our model on the new set of ranked samples, reinforcing the reward.
The two RL algorithms mainly differ in:
- Breadth — in Rejection Sampling, the model explores K samples for a given prompt, while only one generation is done for PPO.
- Depth — in PPO, during training at step t the sample is a function of the updated model policy fromt ? 1 after the gradient update of the previous step. In Rejection Sampling fine-tuning, we sample all the outputs given the initial policy of our model to collect a new dataset, before applying the fine-tuning similar to SFT. However, since we applied iterative model updates, the fundamental differences between the two RL algorithms are less pronounced.
總結一下使用的RLHF基準是PPO和拒絕采樣(RS)微調(類似于N次采樣中的最佳值)。PPO是最受歡迎 on policy RL算法(可以說是試錯學習)。這里重點提到了Here, we go one step further, and use the selected outputs for a gradient update. For each prompt, the sample obtaining the highest reward score is considered the new gold standard. Similar to Scialom et al. (2020a), we then fine-tune our model on the new set of ranked samples, reinforcing the reward.
說明了llama用rm進行拒絕采樣生成的樣本進行了SFT訓練,更新策略模型的梯度,同時,他們還將拒絕采樣生成的樣本作為gold 在舊的checkpoint上面重新訓練RM模型,加強rm模型獎勵。所以筆者認為這里的拒絕采樣微調是同時對SFT和RM模型進行微調迭代。
To augment more data samples for improving model performances without any human effort, we propose to apply Rejection sampling Fine-Tuning (RFT). RFT uses supervised models to generate and collect correct reasoning paths as augmented fine-tuning datasets. We find with augmented samples containing more distinct reasoning paths, RFT improves mathematical reasoning performance more for LLMs. We also find RFT brings more improvement for less performant LLMs. Furthermore, we combine rejection samples from multiple models which push LLaMA-7B to an accuracy of 49.3% and outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
總的來說了在沒有任何人力的情況下增加更多數據樣本以提高模型性能,我們建議應用拒絕采樣微調 (RFT)。RFT 使用監督模型生成和收集正確的推理路徑作為增強微調數據集。我們發現使用包含更多不同推理路徑的增強樣本,RFT 對 LLM 提高了數學推理性能。我們還發現 RFT 為性能較低的 LLM 帶來了更多改進。此外,我們結合了來自多個模型的拒絕樣本,將 LLAMA-7B 推向 49.3% 的準確率,并且顯著優于 35.9% 的監督微調 (SFT) 準確度。值得注意的上不同于上面使用的是RM模型來執行拒絕采樣選出最好的response,這里直接使用的模型reponse給出答案和正確的答案比較,選出推理正確的結果。
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment of generative models, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models more effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently assembles a streaming dataset. This dataset serves as the basis for aligning the generative model and can be employed under both offline and online settings. Notably, the sample generation process within RAFT is gradient-free, rendering it compatible with black-box generators. Through extensive experiments, we demonstrate that our proposed algorithm exhibits strong performance in the context of both large language models and diffusion models.
93356 -
3169 -
發布評論請先 登錄