Sonja Topf commited on
Commit
1ce331f
·
1 Parent(s): 2d32753

README, utils to src folder

Browse files
README.md CHANGED
@@ -15,16 +15,28 @@ This repository hosts a Hugging Face Space that provides an examplary API for su
15
 
16
  In this example, we trained a GIN classifier on the Tox21 targets and saved the trained model in the `assets/` folder.
17
 
18
- **Important:** For leaderboard submission, your Space does not need to include training code. It only needs to implement inference in the `predict()` function inside `predict.py`. The `predict()` function must keep the provided skeleton: it should take a list of SMILES strings as input and return a nested prediction dictionary as output, with SMILES as keys and dictionaries containing targetname-prediction pairs as values. Therefore, any preprocessing of SMILES strings must be executed on-the-fly during inference.
 
 
 
 
 
 
 
19
 
20
  # Repository Structure
21
  - `predict.py` - Defines the `predict()` function required by the leaderboard (entry point for inference).
22
  - `app.py` - FastAPI application wrapper (can be used as-is).
 
 
 
 
23
 
24
  - `src/` - Core model & preprocessing logic:
25
- - `preprocess.py` - SMILES preprocessing pipeline
26
- - `model.py` - GIN classifier
27
- - `seed.py` - used to ensure reproducibility
 
28
 
29
  # Quickstart with Spaces
30
 
@@ -34,10 +46,15 @@ You can easily adapt this project in your own Hugging Face account:
34
 
35
  - Click "Duplicate this Space" (top-right corner).
36
 
 
 
37
  - Modify `src/` for your preprocessing pipeline and model class
38
 
39
  - Modify `predict()` inside `predict.py` to perform model inference while keeping the function skeleton unchanged to remain compatible with the leaderboard.
40
 
 
 
 
41
  That’s it, your model will be available as an API endpoint for the Tox21 Leaderboard.
42
 
43
  # Installation
@@ -77,6 +94,6 @@ The output will be a nested dictionary in the format:
77
 
78
  # Notes
79
 
80
- - Only adapting `predict.py` for your model inference is required for leaderboard submission.
81
 
82
- - Preprocessing (here inside `src/preprocess.py`) must be applied at inference time, not just predicting.
 
15
 
16
  In this example, we trained a GIN classifier on the Tox21 targets and saved the trained model in the `assets/` folder.
17
 
18
+ **Important:** For leaderboard submission, your Space needs to include training code. The file `train.py` should train the model using the config specified inside the `config/` folder and save the final model parameters into a file inside the `checkpoints/` folder. The model should be trained using the [Tox21_dataset](https://huggingface.co/datasets/tschouis/tox21) provided on Hugging Face. The datasets can be loaded like this:
19
+ ```python
20
+ from datasets import load_dataset
21
+ ds = load_dataset("tschouis/tox21", token=token)
22
+ train_df = ds["train"].to_pandas()
23
+ val_df = ds["validation"].to_pandas()
24
+ ```
25
+ Additionally, the Space needs to implement inference in the `predict()` function inside `predict.py`. The `predict()` function must keep the provided skeleton: it should take a list of SMILES strings as input and return a nested prediction dictionary as output, with SMILES as keys and dictionaries containing targetname-prediction pairs as values. Therefore, any preprocessing of SMILES strings must be executed on-the-fly during inference.
26
 
27
  # Repository Structure
28
  - `predict.py` - Defines the `predict()` function required by the leaderboard (entry point for inference).
29
  - `app.py` - FastAPI application wrapper (can be used as-is).
30
+ - `train.py` - trains and saves a model using the config in the `config/` folder.
31
+ - `config/` - the config file used by `train.py`.
32
+ - `logs/` - all the logs of `train.py`, the saved model, and predictions on the validation set.
33
+ - `checkpoints/` - the saved model that is used in `predict.py` is here.
34
 
35
  - `src/` - Core model & preprocessing logic:
36
+ - `preprocess.py` - SMILES preprocessing pipeline and dataset creation
37
+ - `train_evaluate.py` - train and evaluate model, compute metrics
38
+ - `seed.py` - set seed for everything
39
+ - `model.py` - contains the model class
40
 
41
  # Quickstart with Spaces
42
 
 
46
 
47
  - Click "Duplicate this Space" (top-right corner).
48
 
49
+ - Create a `.env` according to `.example.env`.
50
+
51
  - Modify `src/` for your preprocessing pipeline and model class
52
 
53
  - Modify `predict()` inside `predict.py` to perform model inference while keeping the function skeleton unchanged to remain compatible with the leaderboard.
54
 
55
+ - Modify `train.py` according to your model and preprocessing pipeline.
56
+
57
+ - Modify the file inside `config/` to contain all hyperparameters that are set in `train.py`.
58
  That’s it, your model will be available as an API endpoint for the Tox21 Leaderboard.
59
 
60
  # Installation
 
94
 
95
  # Notes
96
 
97
+ - Adapting `predict.py`, `train.py`, `config/`, and `checkpoints/` is required for leaderboard submission.
98
 
99
+ - Preprocessing (here inside `src/preprocess.py`) must be done inside `predict.py` not just `train.py`.
checkpoints/best_gin_model.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:ae536ad4a4c51b097a6411bb4ad68ebcc49ab70d570d2f38a8917f42efc768f0
3
- size 631874
 
 
 
 
predict.py CHANGED
@@ -2,9 +2,9 @@ from torch_geometric.data import Batch
2
  from torch_geometric.utils import from_rdmol
3
  import torch
4
 
5
- from utils.model import GIN
6
- from utils.preprocess import create_clean_mol_objects
7
- from utils.seed import set_seed
8
 
9
  def predict(smiles_list):
10
  """
 
2
  from torch_geometric.utils import from_rdmol
3
  import torch
4
 
5
+ from src.model import GIN
6
+ from src.preprocess import create_clean_mol_objects
7
+ from src.seed import set_seed
8
 
9
  def predict(smiles_list):
10
  """
{utils → src}/model.py RENAMED
File without changes
{utils → src}/preprocess.py RENAMED
File without changes
{utils → src}/seed.py RENAMED
File without changes
{utils → src}/train_evaluate.py RENAMED
File without changes
train.py CHANGED
@@ -6,10 +6,10 @@ import json
6
  import os
7
  from dotenv import load_dotenv
8
 
9
- from utils.model import GIN
10
- from utils.preprocess import get_graph_datasets
11
- from utils.train_evaluate import train_model, evaluate, compute_roc_auc_avg_and_per_class
12
- from utils.seed import set_seed
13
 
14
 
15
  def train(config):
 
6
  import os
7
  from dotenv import load_dotenv
8
 
9
+ from src.model import GIN
10
+ from src.preprocess import get_graph_datasets
11
+ from src.train_evaluate import train_model, evaluate, compute_roc_auc_avg_and_per_class
12
+ from src.seed import set_seed
13
 
14
 
15
  def train(config):