code-generation-space

Paused

loubnabnl HF Staff commited on May 27, 2022

Commit

1378d9b

1 Parent(s): 75fc24e

update datasets

Files changed (2) hide show

datasets/codegen.txt ADDED Viewed

+Codegen is a model for conversational program synthesis, where each problem is interactively solved in multiple steps, each consisting of a natural language specification from the user and a synthesized subprogram from the system.
+It was was sequentially trained on three datasets:
+- The Pile
+- A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
+- 217GB of Python data from Github repositories
+The second and third datasets used the following preprocessing:
+- Exact match deduplication
+- Filtering:
+    - Exact match deduplication
+    - Average line length < 100 tokens
+    - Maximum line length < 1000 MB
+    - >90% of the characters being decimal or hexadecimal digits
+**Remark**:
+The reported data sizes are after preprocessing.

datasets/polycoder.txt ADDED Viewed

+[PolyCoder paper ](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The model was trained on **254GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
+- Exact match deduplication
+- Filtering:
+    - Average line length < 100 tokens
+    - Maximum line length < 1000 MB