code-generation-space

Paused

loubnabnl HF Staff commited on May 25, 2022

Commit

78b2b7f

1 Parent(s): cbac22d

update

Files changed (1) hide show

evaluation/intro.txt CHANGED Viewed

@@ -16,7 +16,25 @@ In most papers, 200 candidate program completions are sampled, and pass@1, pass@
 |GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
 |GPT-J (6B)| 11.62% | 15.74% | 27.74% |
-To better understand how pass@k metric works, we will illustrate it with some examples. We select two problems from the HumanEval dataset and see how the model performs and which code completions pass the unit tests. We will use CodeParrot 🦜 (110M)  with the two problems below:
 #### Problem 1:
 ```python

 |GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
 |GPT-J (6B)| 11.62% | 15.74% | 27.74% |
+We can load HumanEval dataset and pass@k metric from the hub:
+```python
+human_eval = load_dataset("openai_humaneval")
+code_eval_metric = load_metric("code_eval")
+```
+We can easily compute the pass@k for a problem that asks for the implementation of a function that sums two integers:
+```python
+from datasets import load_metric
+test_cases = ["assert add(2,3)==5"]
+candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
+pass_at_k, results = code_eval_metric.compute(references=test_cases, predictions=candidates, k=[1, 2])
+print(pass_at_k)
+{'pass@1': 0.5, 'pass@2': 1.0}
+```
+To better understand how pass@k metric works, we will illustrate it with some concrete examples. We select two problems from the HumanEval dataset and see how CodeParrot 🦜 (110M) performs and which code completions pass the unit tests of the two problems below:
 #### Problem 1:
 ```python