File size: 4,350 Bytes
101ce13
f484830
 
 
 
101ce13
 
25fddff
ec6833b
101ce13
 
1241a9f
f484830
25fddff
f484830
1241a9f
 
 
 
f484830
25fddff
1ce331f
 
25fddff
1ce331f
 
 
 
f484830
 
 
 
1ce331f
 
 
f484830
 
1ce331f
 
 
 
f484830
 
 
 
 
 
 
 
 
1ce331f
 
f484830
 
 
 
1ce331f
 
 
f484830
 
 
 
 
 
25fddff
f484830
 
 
 
54e89de
 
 
 
 
 
 
 
 
 
 
 
f484830
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ce331f
f484830
1ce331f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
title: Tox21 GIN Classifier
emoji: 🤖
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
license: cc-by-nc-4.0
short_description: Graph Isomorphism Network Baseline Classifier for Tox21
---

# Tox21 Graph Isomorphism Network (GIN) Classifier

This repository hosts a Hugging Face Space that provides an examplary API for submitting models to the [Tox21 Leaderboard](https://huggingface.co/spaces/ml-jku/tox21_leaderboard).

Here a [Graph Isomorphism Network(GIN)](https://arxiv.org/abs/1810.00826) is trained on the Tox21 dataset, and the trained models are provided for 
inference. Model input is a SMILES string of the small molecule, and the output are 12 numeric values for 
each of the toxic effects of the Tox21 dataset. 


**Important:** For leaderboard submission, your Space needs to include training code. The file `train.py` should train the model using the config specified inside the `config/` folder and save the final model parameters into a file inside the `checkpoints/` folder. The model should be trained using the [Tox21_dataset](https://huggingface.co/datasets/ml-jku/tox21) provided on Hugging Face. The datasets can be loaded like this:
```python
from datasets import load_dataset
ds = load_dataset("ml-jku/tox21", token=token)
train_df = ds["train"].to_pandas()
val_df = ds["validation"].to_pandas()
```
 Additionally, the Space needs to implement inference in the `predict()` function inside `predict.py`. The `predict()` function must keep the provided skeleton: it should take a list of SMILES strings as input and return a nested prediction dictionary as output, with SMILES as keys and dictionaries containing targetname-prediction pairs as values. Therefore, any preprocessing of SMILES strings must be executed on-the-fly during inference.

# Repository Structure
- `predict.py` - Defines the `predict()` function required by the leaderboard (entry point for inference).
- `app.py` - FastAPI application wrapper (can be used as-is).
- `train.py` - trains and saves a model using the config in the `config/` folder.
- `config/` - the config file used by `train.py`. 
- `checkpoints/` - the saved model that is used in `predict.py` is here.

- `src/` - Core model & preprocessing logic:
    - `preprocess.py` - SMILES preprocessing pipeline and dataset creation
    - `train_evaluate.py` - train and evaluate model, compute metrics
    - `seed.py` - set seed for everything
    - `model.py` - contains the model class

# Quickstart with Spaces

You can easily adapt this project in your own Hugging Face account:

- Open this Space on Hugging Face.

- Click "Duplicate this Space" (top-right corner).

- Create a `.env` according to `.example.env`.

- Modify `src/` for your preprocessing pipeline and model class

- Modify `predict()` inside `predict.py` to perform model inference while keeping the function skeleton unchanged to remain compatible with the leaderboard.

- Modify `train.py` according to your model and preprocessing pipeline.

- Modify the file inside `config/` to contain all hyperparameters that are set in `train.py`.
That’s it, your model will be available as an API endpoint for the Tox21 Leaderboard.

# Installation
To run the GIN classifier, clone the repository and install dependencies:

```bash
git clone https://huggingface.co/spaces/ml-jku/tox21_gin_classifier
cd tox21_gin_classifier
pip install -r requirements.txt
```

# Training

To train the GIN model from scratch, run:

```bash
python train.py
```

These commands will:
1. Load and preprocess the Tox21 training dataset
2. Train a GIN classifier
3. Store the resulting model in the `checkpoints/` directory.

# Inference

For inference, you only need `predict.py`.

Example usage inside Python:

```python
from predict import predict

smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
results = predict(smiles_list)

print(results)
```

The output will be a nested dictionary in the format:

```python
{
    "CCO": {"target1": 0, "target2": 1, ..., "target12": 0},
    "c1ccccc1": {"target1": 1, "target2": 0, ..., "target12": 1},
    "CC(=O)O": {"target1": 0, "target2": 0, ..., "target12": 0}
}
```

# Notes

- Adapting `predict.py`, `train.py`, `config/`, and `checkpoints/` is required for leaderboard submission.

- Preprocessing (here inside `src/preprocess.py`) must be done inside `predict.py` not just `train.py`.