Spaces:

tschouis
/

tox21_snn_classifier

Sleeping

App Files Files Community

antoniaebner commited on Oct 2

Commit

9af3c0c

1 Parent(s): 506cb3a

add code

Browse files

Files changed (10) hide show

.gitignore +1 -0
Dockerfile +16 -0
README.md +93 -2
app.py +78 -0
predict.py +83 -0
requirements.txt +10 -0
src/__init__.py +0 -0
src/model.py +122 -0
src/preprocess.py +285 -0
src/utils.py +444 -0

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ __pycache__/

Dockerfile ADDED Viewed

	@@ -0,0 +1,16 @@

+# Read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker
+# you will also find guides on how best to write your Dockerfile
+FROM python:3.11
+RUN useradd -m -u 1000 user
+USER user
+ENV PATH="/home/user/.local/bin:$PATH"
+WORKDIR /app
+COPY --chown=user ./requirements.txt requirements.txt
+RUN pip install --no-cache-dir --upgrade -r requirements.txt
+COPY --chown=user . /app
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Tox21 Snn Classifier
 emoji: 🌖
 colorFrom: green
 colorTo: pink
@@ -9,4 +9,95 @@ license: apache-2.0
 short_description: Self-Normalizing Neural Network Baseline for Tox21
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Tox21 SNN Classifier
 emoji: 🌖
 colorFrom: green
 colorTo: pink
 short_description: Self-Normalizing Neural Network Baseline for Tox21
 ---
+# Tox21 XGBoost Classifier
+This repository hosts a Hugging Face Space that provides an examplary API for submitting models to the [Tox21 Leaderboard](https://huggingface.co/spaces/tschouis/tox21_leaderboard).
+In this example, we train a XGBoost classifier on the Tox21 targets and save the trained model in the `assets/` folder.
+**Important:** For leaderboard submission, your Space does not need to include training code. It only needs to implement inference in the `predict()` function inside `predict.py`. The `predict()` function must keep the provided skeleton: it should take a list of SMILES strings as input and return a prediction dictionary as output, with SMILES and targets as keys. Therefore, any preprocessing of SMILES strings must be executed on-the-fly during inference.
+# Repository Structure
+- `predict.py` - Defines the `predict()` function required by the leaderboard (entry point for inference).
+- `app.py` - FastAPI application wrapper (can be used as-is).
+- `src/` - Core model & preprocessing logic:
+    - `data.py` - SMILES preprocessing pipeline
+    - `model.py` - XGBoost classifier wrapper
+    - `train.py` - Script to train the classifier
+    - `utils.py` – Constants and Helper functions
+# Quickstart with Spaces
+You can easily adapt this project in your own Hugging Face account:
+- Open this Space on Hugging Face.
+- Click "Duplicate this Space" (top-right corner).
+- Modify `src/` for your preprocessing pipeline and model class
+- Modify `predict()` inside `predict.py` to perform model inference while keeping the function skeleton unchanged to remain compatible with the leaderboard.
+That’s it, your model will be available as an API endpoint for the Tox21 Leaderboard.
+# Installation
+To run (and train) the XGBoost, clone the repository and install dependencies:
+```bash
+git clone https://huggingface.co/spaces/tschouis/tox21_xgboost_classifier
+cd tox_21_xgb_classifier
+conda create -n tox21_xgb_cls python=3.11
+conda activate tox21_xgb_cls
+pip install -r requirements.txt
+```
+# Training
+To train the XGBoost model from scratch:
+```bash
+python -m src/train.py
+```
+This will:
+1. Load and preprocess the Tox21 training dataset.
+2. Train a XGBoost classifier.
+3. Save the trained model to the assets/ folder.
+4. Evaluate the trained XGBoost classifier on the validation split.
+# Inference
+For inference, you only need `predict.py`.
+Example usage inside Python:
+```python
+from predict import predict
+smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
+results = predict(smiles_list)
+print(results)
+```
+The output will be a nested dictionary in the format:
+```python
+{
+    "CCO": {"target1": 0, "target2": 1, ..., "target12": 0},
+    "c1ccccc1": {"target1": 1, "target2": 0, ..., "target12": 1},
+    "CC(=O)O": {"target1": 0, "target2": 0, ..., "target12": 0}
+}
+```
+# Notes
+- Only adapting `predict.py` for your model inference is required for leaderboard submission.
+- Training (`src/train.py`) is provided for reproducibility.
+- Preprocessing (here inside `src/data.py`) must be applied at inference time, not just training.

app.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""
+This is the main entry point for the FastAPI application.
+The app handles the request to predict toxicity for a list of SMILES strings.
+"""
+# ---------------------------------------------------------------------------------------
+# Dependencies and global variable definition
+import os
+from typing import List, Dict, Optional
+from fastapi import FastAPI, Header, HTTPException
+from pydantic import BaseModel, Field
+from predict import predict as predict_func
+API_KEY = os.getenv("API_KEY")  # set via Space Secrets
+# ---------------------------------------------------------------------------------------
+class Request(BaseModel):
+    smiles: List[str] = Field(min_items=1, max_items=1000)
+class Response(BaseModel):
+    predictions: dict
+    model_info: Dict[str, str] = {}
+app = FastAPI(title="toxicity-api")
+@app.get("/")
+def root():
+    return {
+        "message": "Toxicity Prediction API",
+        "endpoints": {
+            "/metadata": "GET - API metadata and capabilities",
+            "/healthz": "GET - Health check",
+            "/predict": "POST - Predict toxicity for SMILES",
+        },
+        "usage": "Send POST to /predict with {'smiles': ['your_smiles_here']} and Authorization header",
+    }
+@app.get("/metadata")
+def metadata():
+    return {
+        "name": "SNN",
+        "version": "1.0.0",
+        "max_batch_size": 256,
+        "tox_endpoints": [
+            "NR-AR",
+            "NR-AR-LBD",
+            "NR-AhR",
+            "NR-Aromatase",
+            "NR-ER",
+            "NR-ER-LBD",
+            "NR-PPAR-gamma",
+            "SR-ARE",
+            "SR-ATAD5",
+            "SR-HSE",
+            "SR-MMP",
+            "SR-p53",
+        ],
+    }
+@app.get("/healthz")
+def healthz():
+    return {"ok": True}
+@app.post("/predict", response_model=Response)
+def predict(request: Request):
+    predictions = predict_func(request.smiles)
+    return {
+        "predictions": predictions,
+        "model_info": {"name": "random_clf", "version": "1.0.0"},
+    }

predict.py ADDED Viewed

	@@ -0,0 +1,83 @@

+"""
+This files includes a predict function for the Tox21.
+As an input it takes a list of SMILES and it outputs a nested dictionary with
+SMILES and target names as keys.
+"""
+# ---------------------------------------------------------------------------------------
+# Dependencies
+from collections import defaultdict
+import numpy as np
+import torch
+from src.preprocess import create_descriptors
+from src.model import Tox21SNNClassifier, SNNConfig
+from src.utils import load_pickle
+# ---------------------------------------------------------------------------------------
+def predict(smiles_list: list[str]) -> dict[str, dict[str, float]]:
+    """Applies the classifier to a list of SMILES strings. Returns prediction=0.0 for
+    any molecule that could not be cleaned.
+    Args:
+        smiles_list (list[str]): list of SMILES strings
+    Returns:
+        dict: nested prediction dictionary, following {'<smiles>': {'<target>': <pred>}}
+    """
+    print(f"Received {len(smiles_list)} SMILES strings")
+    # preprocessing pipeline
+    ecdfs_path = "assets/ecdfs.pkl"
+    scaler_path = "assets/scaler.pkl"
+    ecdfs = load_pickle(ecdfs_path)
+    scaler = load_pickle(scaler_path)
+    print(f"Loaded ecdfs from {ecdfs_path}")
+    print(f"Loaded scaler from {scaler_path}")
+    descriptors = ["rdkit_descr_quantiles", "tox"]
+    features, mol_mask = create_descriptors(
+        smiles,
+        ecdfs=ecdfs,
+        scaler=scaler,
+        descriptors=descriptors,
+    )
+    print(f"Created descriptors {descriptors} for molecules.")
+    print(f"{len(mol_mask) - sum(mol_mask)} molecules removed during cleaning")
+    # setup model
+    cfg = SNNConfig(
+        hidden_dim=1024,
+        n_layers=8,
+        dropout=0.05,
+        layer_form="conic",
+        in_features=features.shape[0],
+        out_features=12,
+    )
+    model = Tox21SNNClassifier(cfg)
+    model_path = "assets/snn_best.pth"
+    model.load_model(model_path)
+    model.eval()
+    print(f"Loaded model from {model_path}")
+    # make predicitons
+    predictions = defaultdict(dict)
+    # create a list with same length as smiles_list to obtain indices for respective features
+    feat_indices = np.cumsum(mol_mask) - 1
+    mask = ~np.isnan(features).any(axis=1)
+    dataset = torch.utils.data.TensorDataset(torch.FloatTensor(features[mask]))
+    loader = torch.utils.data.DataLoader(dataset, 128, shuffle=False, num_workers=0)
+    with torch.no_grad():
+        preds = np.concatenate([model.predict(batch) for batch in loader], axis=0)
+    for i, target in enumerate(model.tasks):
+        for smiles, is_clean, j in zip(smiles_list, mol_mask, feat_indices):
+            predictions[smiles][target] = float(preds[j, i]) if is_clean else 0.5
+    return predictions

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+fastapi
+uvicorn[standard]
+statsmodels
+rdkit
+numpy
+scikit-learn==1.7.1
+joblib
+tabulate
+datasets
+torch==2.8.0

src/__init__.py ADDED Viewed

File without changes

src/model.py ADDED Viewed

	@@ -0,0 +1,122 @@

+"""
+This files includes a XGBoost model for Tox21.
+As an input it takes a list of SMILES and it outputs a nested dictionary with
+SMILES and target names as keys.
+"""
+# ---------------------------------------------------------------------------------------
+# Dependencies
+from typing import Literal
+from dataclasses import dataclass
+import numpy as np
+import torch
+import torch.nn as nn
+from .utils import TASKS
+# ---------------------------------------------------------------------------------------
+@dataclass
+class SNNConfig:
+    hidden_dim: int
+    n_layers: int
+    dropout: float
+    layer_form: Literal["conic", "rect"]
+    in_features: int
+    out_features: int
+class Tox21SNNClassifier(nn.Module):
+    """A XGBoost classifier that assigns a toxicity score to a given SMILES string."""
+    def __init__(self, config: SNNConfig):
+        """Initialize an XGBoost classifier for each of the 12 Tox21 tasks.
+        Args:
+            seed (int, optional): seed for XGBoost to ensure reproducibility. Defaults to 42.
+        """
+        super(Tox21SNNClassifier, self).__init__()
+        self.tasks = TASKS
+        self.num_tasks = len(TASKS)
+        activation = nn.SELU()
+        dropout = nn.AlphaDropout(p=config.dropout)
+        n_hidden = (
+            (
+                config.hidden_dim
+                * np.power(
+                    np.power(
+                        config.out_features / config.hidden_dim, 1 / (config.n_layers)
+                    ),
+                    range(-1, config.n_layers),
+                )
+            ).astype(int)
+            if config.layer_form == "conic"
+            else [config.hidden_dim] * (config.n_layers + 1)
+        )
+        n_hidden[0] = config.in_features
+        n_hidden[config.n_layers] = config.out_features
+        layers = []
+        for l in range(config.n_layers + 1):
+            fc = nn.Linear(
+                in_features=n_hidden[l],
+                out_features=(
+                    n_hidden[config.n_layers]
+                    if l == config.n_layers
+                    else n_hidden[l + 1]
+                ),
+            )
+            if l < config.n_layers:
+                block = [
+                    fc,
+                    activation,
+                    dropout,
+                ]
+            else:  # last layer
+                block = [fc]
+            layers.extend(block)
+        self.model = nn.Sequential(*layers)
+        self.reset_parameters()
+    def reset_parameters(self):
+        for param in self.model.parameters():
+            # biases zero
+            if len(param.shape) == 1:
+                nn.init.constant_(param, 0)
+            # others using lecun-normal initialization
+            else:
+                nn.init.kaiming_normal_(param, mode="fan_in", nonlinearity="linear")
+    def forward(self, x) -> torch.Tensor:
+        x = self.model(x)
+        return x  # x.view(x.size(0), self.num_tasks)
+    def load_model(self, path: str):
+        self.load_state_dict(torch.load(path, weights_only=True)["model"])
+        self.eval()
+    @torch.no_grad()
+    def predict(self, features: torch.tensor) -> np.ndarray:
+        """Predicts labels for a given Tox21 target using molecule features
+        Args:
+            task (str): the Tox21 target to predict for
+            features (torch.tensor): molecule features used for prediction
+        Returns:
+            np.ndarray: predicted probability for positive class
+        """
+        assert (
+            len(features.shape) == 2
+        ), f"Function expects 2D torch.tensor. Current shape: {features.shape}"
+        return torch.nn.functional.sigmoid(self.model(features)).detach().cpu().numpy()

src/preprocess.py ADDED Viewed

	@@ -0,0 +1,285 @@

+# pipeline taken from https://huggingface.co/spaces/ml-jku/mhnfs/blob/main/src/data_preprocessing/create_descriptors.py
+"""
+This files includes a the data processing for Tox21.
+As an input it takes a list of SMILES and it outputs a nested dictionary with
+SMILES and target names as keys.
+"""
+import json
+from typing import Iterable
+import numpy as np
+from sklearn.preprocessing import StandardScaler
+from statsmodels.distributions.empirical_distribution import ECDF
+from rdkit import Chem, DataStructs
+from rdkit.Chem import Descriptors, rdFingerprintGenerator, MACCSkeys
+from rdkit.Chem.rdchem import Mol
+from .utils import (
+    KNOWN_DESCR,
+    USED_200_DESCR,
+    Standardizer,
+    write_pickle,
+)
+def create_cleaned_mol_objects(smiles: list[str]) -> tuple[list[Mol], np.ndarray]:
+    """This function creates cleaned RDKit mol objects from a list of SMILES.
+    Args:
+        smiles (list[str]): list of SMILES
+    Returns:
+        list[Mol]: list of cleaned molecules
+        np.ndarray[bool]: mask that contains False at index `i`, if molecule in `smiles` at
+            index `i` could not be cleaned and was removed.
+    """
+    sm = Standardizer(canon_taut=True)
+    clean_mol_mask = list()
+    mols = list()
+    for i, smile in enumerate(smiles):
+        mol = Chem.MolFromSmiles(smile)
+        standardized_mol, _ = sm.standardize_mol(mol)
+        is_cleaned = standardized_mol is not None
+        clean_mol_mask.append(is_cleaned)
+        if not is_cleaned:
+            continue
+        can_mol = Chem.MolFromSmiles(Chem.MolToSmiles(standardized_mol))
+        mols.append(can_mol)
+    return mols, np.array(clean_mol_mask)
+def create_ecfp_fps(mols: list[Mol]) -> np.ndarray:
+    """This function ECFP fingerprints for a list of molecules.
+    Args:
+        mols (list[Mol]): list of molecules
+    Returns:
+        np.ndarray: ECFP fingerprints of molecules
+    """
+    ecfps = list()
+    for mol in mols:
+        fp_sparse_vec = rdFingerprintGenerator.GetCountFPs(
+            [mol], fpType=rdFingerprintGenerator.MorganFP
+        )[0]
+        fp = np.zeros((0,), np.int8)
+        DataStructs.ConvertToNumpyArray(fp_sparse_vec, fp)
+        ecfps.append(fp)
+    return np.array(ecfps)
+def create_maccs_keys(mols: list[Mol]) -> np.ndarray:
+    maccs = [MACCSkeys.GenMACCSKeys(x) for x in mols]
+    return np.array(maccs)
+def get_tox_patterns(filepath: str):
+    """This calculates tox features defined in tox_smarts.json.
+    Args:
+        mols: A list of Mol
+        n_jobs: If >1 multiprocessing is used
+    """
+    # load patterns
+    with open(filepath) as f:
+        smarts_list = [s[1] for s in json.load(f)]
+    # Code does not work for this case
+    assert len([s for s in smarts_list if ("AND" in s) and ("OR" in s)]) == 0
+    # Chem.MolFromSmarts takes a long time so it pays of to parse all the smarts first
+    # and then use them for all molecules. This gives a huge speedup over existing code.
+    # a list of patterns, whether to negate the match result and how to join them to obtain one boolean value
+    all_patterns = []
+    for smarts in smarts_list:
+        patterns = []  # list of smarts-patterns
+        # value for each of the patterns above. Negates the values of the above later.
+        negations = []
+        if " AND " in smarts:
+            smarts = smarts.split(" AND ")
+            merge_any = False  # If an ' AND ' is found all 'subsmarts' have to match
+        else:
+            # If there is an ' OR ' present it's enough is any of the 'subsmarts' match.
+            # This also accumulates smarts where neither ' OR ' nor ' AND ' occur
+            smarts = smarts.split(" OR ")
+            merge_any = True
+        # for all subsmarts check if they are preceded by 'NOT '
+        for s in smarts:
+            neg = s.startswith("NOT ")
+            if neg:
+                s = s[4:]
+            patterns.append(Chem.MolFromSmarts(s))
+            negations.append(neg)
+        all_patterns.append((patterns, negations, merge_any))
+    return all_patterns
+def create_tox_features(mols: list[Mol], patterns: list) -> np.ndarray:
+    """Matches the tox patterns against a molecule. Returns a boolean array"""
+    tox_data = []
+    for mol in mols:
+        mol_features = []
+        for patts, negations, merge_any in patterns:
+            matches = [mol.HasSubstructMatch(p) for p in patts]
+            matches = [m != n for m, n in zip(matches, negations)]
+            if merge_any:
+                pres = any(matches)
+            else:
+                pres = all(matches)
+            mol_features.append(pres)
+        tox_data.append(np.array(mol_features))
+    return np.array(tox_data)
+def create_rdkit_descriptors(mols: list[Mol]) -> np.ndarray:
+    """This function creates RDKit descriptors for a list of molecules.
+    Args:
+        mols (list[Mol]): list of molecules
+    Returns:
+        np.ndarray: RDKit descriptors of molecules
+    """
+    rdkit_descriptors = list()
+    for mol in mols:
+        descrs = []
+        for _, descr_calc_fn in Descriptors._descList:
+            descrs.append(descr_calc_fn(mol))
+        descrs = np.array(descrs)
+        descrs = descrs[USED_200_DESCR]
+        rdkit_descriptors.append(descrs)
+    return np.array(rdkit_descriptors)
+def create_quantiles(raw_features: np.ndarray, ecdfs: list) -> np.ndarray:
+    """Create quantile values for given features using the columns
+    Args:
+        raw_features (np.ndarray): values to put into quantiles
+        ecdfs (list): ECDFs to use
+    Returns:
+        np.ndarray: computed quantiles
+    """
+    quantiles = np.zeros_like(raw_features)
+    for column in range(raw_features.shape[1]):
+        raw_values = raw_features[:, column].reshape(-1)
+        ecdf = ecdfs[column]
+        q = ecdf(raw_values)
+        quantiles[:, column] = q
+    return quantiles
+def fill(features, mask, value=np.nan):
+    n_mols = len(mask)
+    n_features = features.shape[1]
+    data = np.zeros(shape=(n_mols, n_features))
+    data.fill(value)
+    data[~mask] = features
+    return data
+def normalize_features(
+    raw_features,
+    scaler=None,
+    save_scaler_path: str = "",
+    verbose=True,
+):
+    if scaler is None:
+        scaler = StandardScaler()
+        scaler.fit(raw_features)
+        if verbose:
+            print("Fitted the StandardScaler")
+        if save_scaler_path:
+            write_pickle(save_scaler_path, scaler)
+            if verbose:
+                print(f"Saved the StandardScaler under {save_scaler_path}")
+    # Normalize feature vectors
+    normalized_features = scaler.transform(raw_features)
+    if verbose:
+        print("Normalized molecule features")
+    return normalized_features, scaler
+def create_descriptors(
+    smiles,
+    ecdfs=None,
+    scaler=None,
+    descriptors: Iterable = KNOWN_DESCR,
+):
+    # Create cleanded rdkit mol objects
+    mols, clean_mol_mask = create_cleaned_mol_objects(smiles)
+    print("Cleaned molecules")
+    features = []
+    if "ecfps" in descriptors:
+        # Create fingerprints and descriptors
+        ecfps = create_ecfp_fps(mols)
+        # expand using mol_mask
+        ecfps = fill(ecfps, ~clean_mol_mask)
+        features.append(ecfps)
+        print("Created ECFP fingerprints")
+    if "rdkit_descr_quantiles" in descriptors:
+        rdkit_descrs = create_rdkit_descriptors(mols)
+        print("Created RDKit descriptors")
+        # Create and save ecdfs
+        if ecdfs is None:
+            print("Create ECDFs")
+            ecdfs = []
+            for column in range(rdkit_descrs.shape[1]):
+                raw_values = rdkit_descrs[:, column].reshape(-1)
+                ecdfs.append(ECDF(raw_values))
+        # Create quantiles
+        rdkit_descr_quantiles = create_quantiles(rdkit_descrs, ecdfs)
+        # expand using mol_mask
+        rdkit_descr_quantiles = fill(rdkit_descr_quantiles, ~clean_mol_mask)
+        features.append(rdkit_descr_quantiles)
+        print("Created quantiles of RDKit descriptors")
+    if "maccs" in descriptors:
+        maccs = create_maccs_keys(mols)
+        maccs = fill(maccs, ~clean_mol_mask)
+        features.append(rdkit_descr_quantiles)
+        print("Created MACCS keys")
+    if "tox" in descriptors:
+        tox_patterns = get_tox_patterns("assets/tox_smarts.json")
+        tox = create_tox_features(mols, tox_patterns)
+        tox = fill(tox, ~clean_mol_mask)
+        features.append(rdkit_descr_quantiles)
+        print("Created Tox features")
+    # concatenate features
+    raw_features = np.concatenate(features, axis=1)
+    # normalize with scaler if scaler is passed, else create scaler
+    features = normalize_features(
+        raw_features,
+        scaler=scaler,
+        verbose=True,
+    )
+    return features, clean_mol_mask

src/utils.py ADDED Viewed

	@@ -0,0 +1,444 @@

+## These MolStandardizer classes are due to Paolo Tosco
+## It was taken from the FS-Mol github
+## (https://github.com/microsoft/FS-Mol/blob/main/fs_mol/preprocessing/utils/
+##  standardizer.py)
+## They ensure that a sequence of standardization operations are applied
+## https://gist.github.com/ptosco/7e6b9ab9cc3e44ba0919060beaed198e
+import os
+import pickle
+from rdkit import Chem
+from rdkit.Chem.MolStandardize import rdMolStandardize
+HF_TOKEN = os.environ.get("HF_TOKEN")
+PAD_VALUE = -100
+TASKS = [
+    "NR-AR",
+    "NR-AR-LBD",
+    "NR-AhR",
+    "NR-Aromatase",
+    "NR-ER",
+    "NR-ER-LBD",
+    "NR-PPAR-gamma",
+    "SR-ARE",
+    "SR-ATAD5",
+    "SR-HSE",
+    "SR-MMP",
+    "SR-p53",
+]
+KNOWN_DESCR = ["ecfps", "rdkit_descr_quantiles", "maccs", "tox"]
+USED_200_DESCR = [
+    0,
+    1,
+    2,
+    3,
+    4,
+    5,
+    6,
+    7,
+    8,
+    9,
+    10,
+    11,
+    12,
+    13,
+    14,
+    15,
+    16,
+    25,
+    26,
+    27,
+    28,
+    29,
+    30,
+    31,
+    32,
+    33,
+    34,
+    35,
+    36,
+    37,
+    38,
+    39,
+    40,
+    41,
+    42,
+    43,
+    44,
+    45,
+    46,
+    47,
+    48,
+    49,
+    50,
+    51,
+    52,
+    53,
+    54,
+    55,
+    56,
+    57,
+    58,
+    59,
+    60,
+    61,
+    62,
+    63,
+    64,
+    65,
+    66,
+    67,
+    68,
+    69,
+    70,
+    71,
+    72,
+    73,
+    74,
+    75,
+    76,
+    77,
+    78,
+    79,
+    80,
+    81,
+    82,
+    83,
+    84,
+    85,
+    86,
+    87,
+    88,
+    89,
+    90,
+    91,
+    92,
+    93,
+    94,
+    95,
+    96,
+    97,
+    98,
+    99,
+    100,
+    101,
+    102,
+    103,
+    104,
+    105,
+    106,
+    107,
+    108,
+    109,
+    110,
+    111,
+    112,
+    113,
+    114,
+    115,
+    116,
+    117,
+    118,
+    119,
+    120,
+    121,
+    122,
+    123,
+    124,
+    125,
+    126,
+    127,
+    128,
+    129,
+    130,
+    131,
+    132,
+    133,
+    134,
+    135,
+    136,
+    137,
+    138,
+    139,
+    140,
+    141,
+    142,
+    143,
+    144,
+    145,
+    146,
+    147,
+    148,
+    149,
+    150,
+    151,
+    152,
+    153,
+    154,
+    155,
+    156,
+    157,
+    158,
+    159,
+    160,
+    161,
+    162,
+    163,
+    164,
+    165,
+    166,
+    167,
+    168,
+    169,
+    170,
+    171,
+    172,
+    173,
+    174,
+    175,
+    176,
+    177,
+    178,
+    179,
+    180,
+    181,
+    182,
+    183,
+    184,
+    185,
+    186,
+    187,
+    188,
+    189,
+    190,
+    191,
+    192,
+    193,
+    194,
+    195,
+    196,
+    197,
+    198,
+    199,
+    200,
+    201,
+    202,
+    203,
+    204,
+    205,
+    206,
+    207,
+]
+class Standardizer:
+    """
+    Simple wrapper class around rdkit Standardizer.
+    """
+    DEFAULT_CANON_TAUT = False
+    DEFAULT_METAL_DISCONNECT = False
+    MAX_TAUTOMERS = 100
+    MAX_TRANSFORMS = 100
+    MAX_RESTARTS = 200
+    PREFER_ORGANIC = True
+    def __init__(
+        self,
+        metal_disconnect=None,
+        canon_taut=None,
+    ):
+        """
+        Constructor.
+        All parameters are optional.
+        :param metal_disconnect:    if True, metallorganic complexes are
+                                    disconnected
+        :param canon_taut:          if True, molecules are converted to their
+                                    canonical tautomer
+        """
+        super().__init__()
+        if metal_disconnect is None:
+            metal_disconnect = self.DEFAULT_METAL_DISCONNECT
+        if canon_taut is None:
+            canon_taut = self.DEFAULT_CANON_TAUT
+        self._canon_taut = canon_taut
+        self._metal_disconnect = metal_disconnect
+        self._taut_enumerator = None
+        self._uncharger = None
+        self._lfrag_chooser = None
+        self._metal_disconnector = None
+        self._normalizer = None
+        self._reionizer = None
+        self._params = None
+    @property
+    def params(self):
+        """Return the MolStandardize CleanupParameters."""
+        if self._params is None:
+            self._params = rdMolStandardize.CleanupParameters()
+            self._params.maxTautomers = self.MAX_TAUTOMERS
+            self._params.maxTransforms = self.MAX_TRANSFORMS
+            self._params.maxRestarts = self.MAX_RESTARTS
+            self._params.preferOrganic = self.PREFER_ORGANIC
+            self._params.tautomerRemoveSp3Stereo = False
+        return self._params
+    @property
+    def canon_taut(self):
+        """Return whether tautomer canonicalization will be done."""
+        return self._canon_taut
+    @property
+    def metal_disconnect(self):
+        """Return whether metallorganic complexes will be disconnected."""
+        return self._metal_disconnect
+    @property
+    def taut_enumerator(self):
+        """Return the TautomerEnumerator object."""
+        if self._taut_enumerator is None:
+            self._taut_enumerator = rdMolStandardize.TautomerEnumerator(self.params)
+        return self._taut_enumerator
+    @property
+    def uncharger(self):
+        """Return the Uncharger object."""
+        if self._uncharger is None:
+            self._uncharger = rdMolStandardize.Uncharger()
+        return self._uncharger
+    @property
+    def lfrag_chooser(self):
+        """Return the LargestFragmentChooser object."""
+        if self._lfrag_chooser is None:
+            self._lfrag_chooser = rdMolStandardize.LargestFragmentChooser(
+                self.params.preferOrganic
+            )
+        return self._lfrag_chooser
+    @property
+    def metal_disconnector(self):
+        """Return the MetalDisconnector object."""
+        if self._metal_disconnector is None:
+            self._metal_disconnector = rdMolStandardize.MetalDisconnector()
+        return self._metal_disconnector
+    @property
+    def normalizer(self):
+        """Return the Normalizer object."""
+        if self._normalizer is None:
+            self._normalizer = rdMolStandardize.Normalizer(
+                self.params.normalizationsFile, self.params.maxRestarts
+            )
+        return self._normalizer
+    @property
+    def reionizer(self):
+        """Return the Reionizer object."""
+        if self._reionizer is None:
+            self._reionizer = rdMolStandardize.Reionizer(self.params.acidbaseFile)
+        return self._reionizer
+    def charge_parent(self, mol_in):
+        """Sequentially apply a series of MolStandardize operations:
+        * MetalDisconnector
+        * Normalizer
+        * Reionizer
+        * LargestFragmentChooser
+        * Uncharger
+        The net result is that a desalted, normalized, neutral
+        molecule with implicit Hs is returned.
+        """
+        params = Chem.RemoveHsParameters()
+        params.removeAndTrackIsotopes = True
+        mol_in = Chem.RemoveHs(mol_in, params, sanitize=False)
+        if self._metal_disconnect:
+            mol_in = self.metal_disconnector.Disconnect(mol_in)
+        normalized = self.normalizer.normalize(mol_in)
+        Chem.SanitizeMol(normalized)
+        normalized = self.reionizer.reionize(normalized)
+        Chem.AssignStereochemistry(normalized)
+        normalized = self.lfrag_chooser.choose(normalized)
+        normalized = self.uncharger.uncharge(normalized)
+        # need this to reassess aromaticity on things like
+        # cyclopentadienyl, tropylium, azolium, etc.
+        Chem.SanitizeMol(normalized)
+        return Chem.RemoveHs(Chem.AddHs(normalized))
+    def standardize_mol(self, mol_in):
+        """
+        Standardize a single molecule.
+        :param mol_in:  a Chem.Mol
+        :return:        * (standardized Chem.Mol, n_taut) tuple
+                          if success. n_taut will be negative if
+                          tautomer enumeration was aborted due
+                          to reaching a limit
+                        * (None, error_msg) if failure
+        This calls self.charge_parent() and, if self._canon_taut
+        is True, runs tautomer canonicalization.
+        """
+        n_tautomers = 0
+        if isinstance(mol_in, Chem.Mol):
+            name = None
+            try:
+                name = mol_in.GetProp("_Name")
+            except KeyError:
+                pass
+            if not name:
+                name = "NONAME"
+        else:
+            error = f"Expected SMILES or Chem.Mol as input, got {str(type(mol_in))}"
+            return None, error
+        try:
+            mol_out = self.charge_parent(mol_in)
+        except Exception as e:
+            error = f"charge_parent FAILED: {str(e).strip()}"
+            return None, error
+        if self._canon_taut:
+            try:
+                res = self.taut_enumerator.Enumerate(mol_out, False)
+            except TypeError:
+                # we are still on the pre-2021 RDKit API
+                res = self.taut_enumerator.Enumerate(mol_out)
+            except Exception as e:
+                # something else went wrong
+                error = f"canon_taut FAILED: {str(e).strip()}"
+                return None, error
+            n_tautomers = len(res)
+            if hasattr(res, "status"):
+                completed = (
+                    res.status == rdMolStandardize.TautomerEnumeratorStatus.Completed
+                )
+            else:
+                # we are still on the pre-2021 RDKit API
+                completed = len(res) < 1000
+            if not completed:
+                n_tautomers = -n_tautomers
+            try:
+                mol_out = self.taut_enumerator.PickCanonical(res)
+            except AttributeError:
+                # we are still on the pre-2021 RDKit API
+                mol_out = max(
+                    [(self.taut_enumerator.ScoreTautomer(m), m) for m in res]
+                )[1]
+            except Exception as e:
+                # something else went wrong
+                error = f"canon_taut FAILED: {str(e).strip()}"
+                return None, error
+        mol_out.SetProp("_Name", name)
+        return mol_out, n_tautomers
+def load_pickle(path: str):
+    with open(path, "rb") as file:
+        content = pickle.load(file)
+    return content
+def write_pickle(path: str, obj: object):
+    with open(path, "wb") as file:
+        pickle.dump(obj, file)