update
Browse files- evaluation/intro.txt +19 -1
evaluation/intro.txt
CHANGED
|
@@ -16,7 +16,25 @@ In most papers, 200 candidate program completions are sampled, and pass@1, pass@
|
|
| 16 |
|GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
|
| 17 |
|GPT-J (6B)| 11.62% | 15.74% | 27.74% |
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
#### Problem 1:
|
| 21 |
|
| 22 |
```python
|
|
|
|
| 16 |
|GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
|
| 17 |
|GPT-J (6B)| 11.62% | 15.74% | 27.74% |
|
| 18 |
|
| 19 |
+
We can load HumanEval dataset and pass@k metric from the hub:
|
| 20 |
+
|
| 21 |
+
```python
|
| 22 |
+
human_eval = load_dataset("openai_humaneval")
|
| 23 |
+
code_eval_metric = load_metric("code_eval")
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
We can easily compute the pass@k for a problem that asks for the implementation of a function that sums two integers:
|
| 27 |
+
|
| 28 |
+
```python
|
| 29 |
+
from datasets import load_metric
|
| 30 |
+
test_cases = ["assert add(2,3)==5"]
|
| 31 |
+
candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
|
| 32 |
+
pass_at_k, results = code_eval_metric.compute(references=test_cases, predictions=candidates, k=[1, 2])
|
| 33 |
+
print(pass_at_k)
|
| 34 |
+
{'pass@1': 0.5, 'pass@2': 1.0}
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
To better understand how pass@k metric works, we will illustrate it with some concrete examples. We select two problems from the HumanEval dataset and see how CodeParrot 🦜 (110M) performs and which code completions pass the unit tests of the two problems below:
|
| 38 |
#### Problem 1:
|
| 39 |
|
| 40 |
```python
|