update architecture
Browse files- evaluation/intro.txt +8 -1
evaluation/intro.txt
CHANGED
|
@@ -67,7 +67,14 @@ def truncate_number(number: float) -> float:
|
|
| 67 |
"""
|
| 68 |
````
|
| 69 |
|
| 70 |
-
For each problem, instead of 200 candidate solutions, we will only generate 20 samples for illustration purposes. We use
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
```
|
| 73 |
|
|
|
|
| 67 |
"""
|
| 68 |
````
|
| 69 |
|
| 70 |
+
For each problem, instead of 200 candidate solutions, we will only generate 20 samples for illustration purposes. We use nucleus sampling with top-p where `p=0.95`, `temperature=0.2`, and sample tokens from the model until we encounter a stop sequence indicating the end of a method: ‘\nclass’, ‘\ndef’, ‘\n#’, ‘\nif’, or ‘\nprint’. For more details about decoding strategies for language generation, we recommend this [blog](https://huggingface.co/blog/how-to-generate).
|
| 71 |
+
|
| 72 |
+
**Remark**:
|
| 73 |
+
|
| 74 |
+
Regarding the temperature parameter, in [CodeGen](https://github.com/salesforce/CodeGen) paper, the authors observed that the best performing temperature increases as the number of samples permitted k increases. When a model is only allowed a few samples to pass unit tests, it is beneficial to use the learned distribution, through a low temperature, to select candidates that are likely to pass. But when a model is allowed for more chances with a high k, using a higher sampling temperature to tilt the learned model distribution lets it explore diverse samples and thus more likely to synthesize a correct program.
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
For our experiment, we compute pass@1, pass@10 and pass@20, each correspending to unit test pass rate when selecting respectively 1, 10 and 20 samples from the candidate solutions.
|
| 78 |
|
| 79 |
```
|
| 80 |
|