Spaces:

ml-jku
/

tox21_gin_classifier

Running

App Files Files Community

Sonja Topf commited on Oct 2

Commit

f484830

1 Parent(s): 101ce13

initial commit

Browse files

Files changed (9) hide show

Dockerfile +16 -0
README.md +76 -6
app.py +78 -0
assets/best_gin_model.pt +3 -0
predict.py +68 -0
requirements.txt +9 -0
src/model.py +53 -0
src/preprocess.py +101 -0
src/seed.py +19 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,16 @@

+# Read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker
+# you will also find guides on how best to write your Dockerfile
+FROM python:3.11.4
+RUN useradd -m -u 1000 user
+USER user
+ENV PATH="/home/user/.local/bin:$PATH"
+WORKDIR /app
+COPY --chown=user ./requirements.txt requirements.txt
+RUN pip install --no-cache-dir --upgrade -r requirements.txt
+COPY --chown=user . /app
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,12 +1,82 @@
 ---
-title: Tox21 Gin Classifier
-emoji: 💻
-colorFrom: pink
-colorTo: pink
 sdk: docker
 pinned: false
 license: apache-2.0
-short_description: GIN baseline for Tox21 dataset
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Tox21 GIN Classifier
+emoji: 🤖
+colorFrom: green
+colorTo: blue
 sdk: docker
 pinned: false
 license: apache-2.0
+short_description: Graph Isomorphism Network
 ---
+# Tox21 Graph Isomorphism Network Classifier
+This repository hosts a Hugging Face Space that provides an examplary API for submitting models to the [Tox21 Leaderboard](https://huggingface.co/spaces/tschouis/tox21_leaderboard).
+In this example, we trained a GIN classifier on the Tox21 targets and saved the trained model in the `assets/` folder.
+**Important:** For leaderboard submission, your Space does not need to include training code. It only needs to implement inference in the `predict()` function inside `predict.py`. The `predict()` function must keep the provided skeleton: it should take a list of SMILES strings as input and return a nested prediction dictionary as output, with SMILES as keys and dictionaries containing targetname-prediction pairs as values. Therefore, any preprocessing of SMILES strings must be executed on-the-fly during inference.
+# Repository Structure
+- `predict.py` - Defines the `predict()` function required by the leaderboard (entry point for inference).
+- `app.py` - FastAPI application wrapper (can be used as-is).
+- `src/` - Core model & preprocessing logic:
+    - `preprocess.py` - SMILES preprocessing pipeline
+    - `model.py` - GIN classifier
+    - `seed.py` - used to ensure reproducibility
+# Quickstart with Spaces
+You can easily adapt this project in your own Hugging Face account:
+- Open this Space on Hugging Face.
+- Click "Duplicate this Space" (top-right corner).
+- Modify `src/` for your preprocessing pipeline and model class
+- Modify `predict()` inside `predict.py` to perform model inference while keeping the function skeleton unchanged to remain compatible with the leaderboard.
+That’s it, your model will be available as an API endpoint for the Tox21 Leaderboard.
+# Installation
+To run the GIN classifier, clone the repository and install dependencies:
+```bash
+git clone https://huggingface.co/spaces/tschouis/tox21_gin_classifier
+cd tox21_gin_classifier
+pip install -r requirements.txt
+```
+# Inference
+For inference, you only need `predict.py`.
+Example usage inside Python:
+```python
+from predict import predict
+smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
+results = predict(smiles_list)
+print(results)
+```
+The output will be a nested dictionary in the format:
+```python
+{
+    "CCO": {"target1": 0, "target2": 1, ..., "target12": 0},
+    "c1ccccc1": {"target1": 1, "target2": 0, ..., "target12": 1},
+    "CC(=O)O": {"target1": 0, "target2": 0, ..., "target12": 0}
+}
+```
+# Notes
+- Only adapting `predict.py` for your model inference is required for leaderboard submission.
+- Preprocessing (here inside `src/preprocess.py`) must be applied at inference time, not just predicting.

app.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""
+This is the main entry point for the FastAPI application.
+The app handles the request to predict toxicity for a list of SMILES strings.
+"""
+# ---------------------------------------------------------------------------------------
+# Dependencies and global variable definition
+import os
+from typing import List, Dict, Optional
+from fastapi import FastAPI, Header, HTTPException
+from pydantic import BaseModel, Field
+from predict import predict as predict_func
+API_KEY = os.getenv("API_KEY")  # set via Space Secrets
+# ---------------------------------------------------------------------------------------
+class Request(BaseModel):
+    smiles: List[str] = Field(min_items=1, max_items=1000)
+class Response(BaseModel):
+    predictions: dict
+    model_info: Dict[str, str] = {}
+app = FastAPI(title="toxicity-api")
+@app.get("/")
+def root():
+    return {
+        "message": "Toxicity Prediction API",
+        "endpoints": {
+            "/metadata": "GET - API metadata and capabilities",
+            "/healthz": "GET - Health check",
+            "/predict": "POST - Predict toxicity for SMILES",
+        },
+        "usage": "Send POST to /predict with {'smiles': ['your_smiles_here']} and Authorization header",
+    }
+@app.get("/metadata")
+def metadata():
+    return {
+        "name": "AwesomeTox",
+        "version": "1.0.0",
+        "max_batch_size": 256,
+        "tox_endpoints": [
+            "NR-AR",
+            "NR-AR-LBD",
+            "NR-AhR",
+            "NR-Aromatase",
+            "NR-ER",
+            "NR-ER-LBD",
+            "NR-PPAR-gamma",
+            "SR-ARE",
+            "SR-ATAD5",
+            "SR-HSE",
+            "SR-MMP",
+            "SR-p53",
+        ],
+    }
+@app.get("/healthz")
+def healthz():
+    return {"ok": True}
+@app.post("/predict", response_model=Response)
+def predict(request: Request):
+    predictions = predict_func(request.smiles)
+    return {
+        "predictions": predictions,
+        "model_info": {"name": "random_clf", "version": "1.0.0"},
+    }

assets/best_gin_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6e49790b36de8674646c3c3eb6b35818f9d7319a61dbc0a483c13c2f78bcb210
+size 634178

predict.py ADDED Viewed

	@@ -0,0 +1,68 @@

+from torch_geometric.data import Batch
+from torch_geometric.utils import from_rdmol
+import torch
+from src.model import GIN
+from src.preprocess import create_clean_mol_objects
+from src.seed import set_seed
+def predict_from_smiles(smiles_list):
+    """
+    Predict toxicity targets for a list of SMILES strings.
+    Args:
+        smiles_list (list[str]): SMILES strings
+    Returns:
+        dict: {smiles: {target_name: prediction_prob}}
+    """
+    set_seed(42)
+    # tox21 targets
+    TARGET_NAMES = [
+            "NR-AR",
+            "NR-AR-LBD",
+            "NR-AhR",
+            "NR-Aromatase",
+            "NR-ER",
+            "NR-ER-LBD",
+            "NR-PPAR-gamma",
+            "SR-ARE",
+            "SR-ATAD5",
+            "SR-HSE",
+            "SR-MMP",
+            "SR-p53",
+        ]
+    DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"Received {len(smiles_list)} SMILES strings")
+    # setup model
+    model = GIN(num_features=9, num_classes=12, dropout=0.1, hidden_dim=128, num_layers=5, add_or_mean="mean")
+    model_path = "./assets/best_gin_model.pt"
+    model.load_state_dict(torch.load(model_path, map_location=DEVICE))
+    print(f"Loaded model from {model_path}")
+    model.to(DEVICE)
+    model.eval()
+    predictions = {}
+    for smiles in smiles_list:
+        try:
+            # Convert SMILES to graph
+            mol, _ = create_clean_mol_objects([smiles])
+            data = from_rdmol(mol[0]).to(DEVICE)
+            batch = Batch.from_data_list([data])
+            # Forward pass
+            with torch.no_grad():
+                logits = model(batch.x, batch.edge_index, batch.batch)
+                probs = torch.sigmoid(logits).cpu().numpy().flatten()
+            # Map predictions to targets
+            pred_dict = {t: float(p) for t, p in zip(TARGET_NAMES, probs)}
+            predictions[smiles] = pred_dict
+        except Exception as e:
+            # If SMILES fails, return zeros
+            pred_dict = {t: 0.0 for t in TARGET_NAMES}
+            predictions[smiles] = pred_dict
+    return predictions

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+fastapi
+uvicorn[standard]
+torch=2.3.0
+torch-geometric=2.6.1
+numpy=1.26.2
+pandas=2.2.2
+rdkit-pypi=2024.3.6
+pydantic
+typing-extensions

src/model.py ADDED Viewed

	@@ -0,0 +1,53 @@

+import torch
+import torch.nn as nn
+from torch_geometric.nn import GINConv, global_add_pool, global_mean_pool
+import torch.nn.functional as F
+import numpy as np
+class GIN(torch.nn.Module):
+    def __init__(self, num_features, num_classes, dropout, hidden_dim=64, num_layers=5, add_or_mean="add"):
+        super().__init__()
+        self.num_layers = num_layers
+        self.hidden_dim = hidden_dim
+        self.add_or_mean = add_or_mean
+        self.dropout = dropout
+        self.conv_layers = nn.ModuleList()
+        # input features → hidden_dim
+        mlp = nn.Sequential(
+            nn.Linear(num_features, hidden_dim),
+            nn.ReLU(),
+            nn.Linear(hidden_dim, hidden_dim),
+            nn.BatchNorm1d(hidden_dim)
+        )
+        self.conv_layers.append(GINConv(mlp, train_eps=True))
+        # hidden GIN layers
+        for _ in range(num_layers - 1):
+            mlp = nn.Sequential(
+                nn.Linear(hidden_dim, hidden_dim),
+                nn.ReLU(),
+                nn.Linear(hidden_dim, hidden_dim),
+                nn.BatchNorm1d(hidden_dim)
+            )
+            self.conv_layers.append(GINConv(mlp, train_eps=True))
+        # Final classifier (after pooling)
+        self.fc = nn.Linear(hidden_dim, num_classes)
+    def forward(self, x, edge_index, batch):
+        for conv in self.conv_layers:
+            x = conv(x, edge_index)
+            x = F.relu(x)
+            x = F.dropout(x, p=self.dropout, training=self.training)
+        # Pool to get graph-level representation
+        if self.add_or_mean == "mean":
+            x = global_mean_pool(x, batch)
+        elif self.add_or_mean == "add":
+            x = global_add_pool(x, batch)
+        x = F.dropout(x, p=0.5, training=self.training)
+        return self.fc(x)

src/preprocess.py ADDED Viewed

	@@ -0,0 +1,101 @@

+import numpy as np
+import torch
+import pandas as pd
+from rdkit import Chem
+from rdkit.Chem.MolStandardize import rdMolStandardize
+from rdkit import Chem
+from torch_geometric.data import InMemoryDataset
+from torch_geometric.utils import from_rdmol
+def create_clean_mol_objects(smiles: list[str]) -> tuple[list[Chem.Mol], np.ndarray]:
+    """Create cleaned RDKit Mol objects from SMILES.
+       Returns (list of mols, mask of valid mols).
+    """
+    clean_mol_mask = []
+    mols = []
+    # Standardizer components
+    cleaner = rdMolStandardize.CleanupParameters()
+    tautomer_enumerator = rdMolStandardize.TautomerEnumerator()
+    for smi in smiles:
+        try:
+            mol = Chem.MolFromSmiles(smi)
+            if mol is None:
+                clean_mol_mask.append(False)
+                continue
+            # Cleanup and canonicalize
+            mol = rdMolStandardize.Cleanup(mol, cleaner)
+            mol = tautomer_enumerator.Canonicalize(mol)
+            # Recompute canonical SMILES & reload
+            can_smi = Chem.MolToSmiles(mol)
+            mol = Chem.MolFromSmiles(can_smi)
+            if mol is not None:
+                mols.append(mol)
+                clean_mol_mask.append(True)
+            else:
+                clean_mol_mask.append(False)
+        except Exception as e:
+            print(f"Failed to standardize {smi}: {e}")
+            clean_mol_mask.append(False)
+    return mols, np.array(clean_mol_mask, dtype=bool)
+class Tox21Dataset(InMemoryDataset):
+    def __init__(self, dataframe):
+        super().__init__()
+        data_list = []
+        # Clean molecules & filter dataframe
+        mols, clean_mask = create_clean_mol_objects(dataframe["smiles"].tolist())
+        dataframe = dataframe[clean_mask].reset_index(drop=True)
+        # Now mols and dataframe are aligned, so we can zip
+        for mol, (_, row) in zip(mols, dataframe.iterrows()):
+            try:
+                data = from_rdmol(mol)
+                # Extract labels as a pandas Series
+                drop_cols = ["ID","smiles","inchikey","sdftitle","order","set","CVfold"]
+                labels = row.drop(drop_cols)
+                # Mask for valid labels
+                mask = ~labels.isna()
+                # Explicit numeric conversion, replaces NaN with 0.0 safely
+                labels = pd.to_numeric(labels, errors="coerce").fillna(0.0).astype(float).values
+                # Convert to tensors
+                y = torch.tensor(labels, dtype=torch.float).unsqueeze(0)
+                m = torch.tensor(mask.values, dtype=torch.bool).unsqueeze(0)
+                data.y = y
+                data.mask = m
+                data_list.append(data)
+            except Exception as e:
+                print(f"Skipping molecule {row['smiles']} due to error: {e}")
+        # Collate into dataset
+        self.data, self.slices = self.collate(data_list)
+def get_graph_dataset(filepath:str):
+    """returns an InMemoryDataset that can be used in dataloaders
+    Args:
+        filepath (str): the filepath of the data csv
+    Returns:
+        Tox21Dataset: dataset for dataloaders
+    """
+    df = pd.read_csv(filepath)
+    dataset = Tox21Dataset(df)
+    return dataset

src/seed.py ADDED Viewed

	@@ -0,0 +1,19 @@

+import random
+import torch
+import numpy as np
+import os
+def set_seed(seed: int = 42):
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed(seed)       # current GPU
+    torch.cuda.manual_seed_all(seed)   # all GPUs
+    # Ensure deterministic behavior
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+    # For PyTorch >= 1.8
+    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
+    torch.use_deterministic_algorithms(True, warn_only=True)