Training Language Models to Autonomously Enhance Their Capabilities Through Unspoken Learning Processes
In a groundbreaking development, a new approach called PIT (Preference-based Iterative Training) has been proposed to enable large language models (LLMs) to learn self-improvement from human preference data rather than relying on prompts. This innovative method could pave the way for LLMs that continuously align better with human values as they learn from experience.
The PIT approach, as detailed in a recent paper, employs curriculum reinforcement learning with two key stages. Initially, it focuses on improving easy references, such as human-labeled bad responses. Subsequently, it switches to improving samples drawn from the LLM itself. The core insight is that the preference data used to train the LLM already provides implicit guidance on what constitutes an improvement in quality.
Unlike prompt optimization methods that iteratively adjust input instructions to improve outputs without changing the model weights, PIT leverages human feedback data (such as preference comparisons) to guide continual alignment and adaptation of the model itself. This allows the model to internalize preferences and improve its responses autonomously, without requiring continual prompt engineering or external rewrites.
The standard Reinforcement Learning from Human Feedback (RLHF) objective optimizes an LLM policy to maximize the expected quality of generated responses. PIT improves response quality by 7-34% compared to the original LLM samples, as measured by third-party evaluator models, and also significantly outperforms the prompting method Self-Refine in human evaluations.
The researchers conducted experiments on real and synthetic datasets to demonstrate PIT's capabilities for self-improvement and advantages over prompting approaches. Lower temperatures around 0.4-0.6 work best for PIT, restricting diversity to focus on improvement. In contrast, prompting methods need higher diversity to avoid just re-generating the original response.
Ablation studies confirm the importance of the full curriculum reinforcement learning procedure. Removing either the first stage of easy examples or the second stage of improving the LLM's own samples substantially degrades performance.
This work enables LLMs to refine themselves without direct human oversight, reducing reliance on human intervention and facilitating expanding access to LLMs. Moreover, the implicit information within the preference data can be leveraged instead of manually distilling criteria into prompts.
The PIT objective is to maximize the gap in quality between the original response and an improved response, conditioned on having the original as a reference point. This approach provides a promising way forward to learn nuanced goals like improving helpfulness, harmlessness, and accuracy by tapping into the implicit guidance within training data.
In conclusion, the PIT approach presents a significant step forward in enabling large language models to learn self-improvement from human preference data, offering a robust and dynamic alignment of LLMs in practical applications.
Technology and artificial-intelligence are integral to the PIT (Preference-based Iterative Training) approach, a groundbreaking development in large language models (LLMs) self-improvement. By employing curriculum reinforcement learning, PIT leverages both human feedback data and the implicit guidance within this data to guide the continual alignment and adaptation of the model itself, ultimately enabling the model to improve its responses autonomously using artificial-intelligence.