File size: 6,300 Bytes
2c88d70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
341dcc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2c88d70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
577cf9f
 
2c88d70
577cf9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2c88d70
 
 
 
 
 
577cf9f
 
 
 
 
 
 
 
 
 
 
 
2c88d70
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
license: apache-2.0
datasets:
- Dahoas/synthetic-instruct-gptj-pairwise
language:
- en
base_model:
- openai-community/gpt2
pipeline_tag: text-generation
library_name: transformers
tags:
- gpt2
- rlhf
- reinforcement-learning
- ppo
- reward-model
- instruction-tuning

model-index:
- name: ppo_aligned_final
  results:
  - task:
      type: text-generation
    dataset:
      type: Dahoas/synthetic-instruct-gptj-pairwise
      name: Dahoas/synthetic-instruct-gptj-pairwise
      split: evaluation
    metrics:
    - type: average_reward
      value: 2.37
      name: Average Reward Score
    - type: rouge
      value: 0.337
      name: ROUGE-1
    - type: rouge
      value: 0.139
      name: ROUGE-2
    - type: rouge
      value: 0.252
      name: ROUGE-L

- name: reward_model_final
  results:
  - task:
      type: text-classification
    dataset:
      type: Dahoas/synthetic-instruct-gptj-pairwise
      name: Dahoas/synthetic-instruct-gptj-pairwise
      split: evaluation
    metrics:
    - type: accuracy
      value: 0.98
      name: Preference Accuracy

- name: sft_full_final
  results:
  - task:
      type: text-generation
    dataset:
      type: Dahoas/synthetic-instruct-gptj-pairwise
      name: Dahoas/synthetic-instruct-gptj-pairwise
      split: evaluation
    metrics:
    - type: rouge
      value: 0.353
      name: ROUGE-1
    - type: rouge
      value: 0.149
      name: ROUGE-2
    - type: rouge
      value: 0.262
      name: ROUGE-L
---

# RLHF-Aligned GPT-2 Pipeline Models

This repository contains the three key models from an end-to-end, from-scratch implementation of the **Reinforcement Learning from Human Feedback (RLHF)** pipeline. The project's goal was to align a base `gpt2` model with human preferences, following the same three-stage process popularized by models like ChatGPT.

The complete training code, notebooks, and in-depth analysis can be found in the primary GitHub repository:
[**nabeelshan78/reinforcement-learning-human-feedback-scratch**](https://github.com/nabeelshan78/reinforcement-learning-human-feedback-scratch)

## 🎯 Models in this Repository

This repository hosts the final checkpoint for each stage of the RLHF pipeline. You can load each model independently using the `subfolder` argument.

1.  `sft_full_final` - **Supervised Fine-Tuned (SFT) Model**: The base `gpt2` model after being fine-tuned on an instruction dataset (`Dahoas/synthetic-instruct-gptj-pairwise`) to learn a helpful response style.

2.  `reward_model_final` - **Reward Model (RM)**: A `gpt2`-based model trained to predict human preferences. It takes a prompt and a response and outputs a scalar *reward score*, indicating how "good" the response is. This model acts as an automated human preference judge.

3.  `ppo_aligned_final` - **PPO-Aligned Model**: The final, alignment-tuned model. This is the SFT model further trained using Proximal Policy Optimization (PPO) and the Reward Model to generate responses that maximize the reward score. **This is the main model intended for generation tasks.**

---

## 🚀 How to Use

### 1. Using the Final PPO-Aligned Model (for Text Generation)

This is the recommended model for generating helpful, aligned responses.

```python
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

# Define the model ID and the specific model subfolder
model_id = "nabeelshan/rlhf-gpt2-pipeline"
subfolder = "ppo_aligned_final"

# Load the tokenizer and model from the subfolder
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder=subfolder)
model = AutoModelForCausalLM.from_pretrained(model_id, subfolder=subfolder)

# Set up the text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate a response
prompt = "How do I price my artwork?"
output = generator(prompt, max_new_tokens=100, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)

print(output[0]['generated_text'])
# Expected Output (example):
# To price your art, start by researching the artist and their portfolio to determine what
# other artists are making... Consider also researching dealerships at the same time... Good luck.
2. Using the Reward Model (for Scoring Responses)
You can use the reward model to score how much a human might prefer a given response.

Python

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
from huggingface_hub import snapshot_download # Import the downloader tool

# --- CONFIGURATION ---
BASE_MODEL_ID = "openai-community/gpt2"
HF_MODEL_ID = "nabeelshan/rlhf-gpt2-pipeline"
SUBFOLDER = "reward_model_final"

print(f"Downloading model files from '{HF_MODEL_ID}'...")
local_model_path = snapshot_download(
    repo_id=HF_MODEL_ID,
    allow_patterns=f"{SUBFOLDER}/*"
)
local_adapter_path = f"{local_model_path}/{SUBFOLDER}"
print(f"   Successfully downloaded to: {local_adapter_path}")


print("Loading model from local path...")
tokenizer = AutoTokenizer.from_pretrained(local_adapter_path)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForSequenceClassification.from_pretrained(
    BASE_MODEL_ID,
    num_labels=1,
    pad_token_id=tokenizer.pad_token_id
)

model = PeftModel.from_pretrained(base_model, local_adapter_path)
model.eval()
print("   Model loaded successfully!")


prompt = "What diet should I follow to lose weight healthily?"
good_response = "A balanced, nutritious plan based on eating whole foods is best. Limit processed and sugary foods."
bad_response = "Just eat less lol."

def get_reward_score(prompt_text: str, response_text: str) -> float:
    """Tokenizes and calculates the reward score for a given prompt and response."""
    inputs = tokenizer(prompt_text, response_text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        result = model(**inputs)
        return result.logits[0].item()

score_good = get_reward_score(prompt, good_response)
score_bad = get_reward_score(prompt, bad_response)

print(f"\nScore for good response: {score_good:.2f}")
print(f"Score for bad response:  {score_bad:.2f}")



# The model should give a higher score to the better response.
# Expected: Score for good response: 2.15
# Expected: Score for bad response: -1.50