User Guide
The project uses Python 3.11.5.
git clone --branch main --recurse-submodules git@github.com:gmum/counterfactual-masking.git
cd counterfactual-masking/DiffLinker
# Download the DiffLinker model checkpoint
mkdir -p models
wget "https://zenodo.org/record/7121300/files/zinc_difflinker_given_anchors.ckpt?download=1" -O models/zinc_difflinker_given_anchors.ckpt
cd ../
# Download and extract the CReM dataset (ChEMBL22)
mkdir data
wget "https://www.qsar4u.com/files/cremdb/chembl22_sa2.db.gz" -O data/chembl22_sa2.db.gz
gunzip data/chembl22_sa2.db.gz
# Libraries
pip install torch==2.5.1+cu124 --index-url "https://download.pytorch.org/whl/cu124"
pip install -r requirements.txt
# DiffLinker
pip install -e .
Usage
To mask selected atoms in your molecules, you can use one of the following functions:
from source.linksGenerator import crem_fragment_replacement, diffLinker_fragment_replacement
output = diffLinker_fragment_replacement(mol: rdkit.Chem.Mol, toDelete: list[int])
output = crem_fragment_replacement(mol: rdkit.Chem.Mol, toDelete: list[int])
The output is a set of molecules where the specified atoms have been replaced according to the chosen masking strategy.
Reproducing the Results
Dataset: Common Substructure Pair
Note: A preprocessed and filtered version of the Common Substructure Pair dataset is included in the repository and can be found in the data_pubchem directory.
Regenerate the Dataset
To regenerate the dataset:
python -m scripts.superstructures.fetch_superstructures --output_folder data_pubchem --dataset_name common_substructure_pair_dataset
This script fetches superstructures from PubChem, processes them, and saves the results in the specified directory.
Pairs Experiment
This experiment evaluates different masking strategies over pairs of molecules that share common substructures.
Step 1: Train a Model
Step 2: Run Pairs Experiment
- Single anchor
python -m scripts.pairs_experiment.pairs_prediction --output_folder single_anchor_output --model_path checkpoints/gin/model_trained_without_salts_hidden_512_dropout_0.3_seed_5.pth --pairs_dataset data_pubchem/common_substructure_pair_dataset.json --size_model 512 --same_anchors --number_of_anchors 1
- Multiple anchors
python -m scripts.pairs_experiment.pairs_prediction --output_folder 2_or_more_anchors_output --model_path checkpoints/gin/model_trained_without_salts_hidden_512_dropout_0.3_seed_5.pth --pairs_dataset data_pubchem/common_substructure_pair_dataset.json --size_model 512 --number_of_anchors 2 --same_anchors
- No anchor restrictions (Both variants)
python -m scripts.pairs_experiment.pairs_prediction --output_folder no_restrictions_output --model_path checkpoints/gin/model_trained_without_salts_hidden_512_dropout_0.3_seed_5.pth --pairs_dataset data_pubchem/common_substructure_pair_dataset.json --size_model 512 --same_anchors
Step 3: View the Results
Open the following notebook to visualize results:
Counterfactuals Experiment
This experiment evaluates different counterfactual generation methods.
Step 1: Train Models
Step 2: Run Counterfactuals Experiment
python -m scripts.counterfactuals_experiment.counterfactuals_generation --model_size 512 --seed <SEED> --dataset <DATASET> --model_path <MODEL_PATH>
Parameters
| Argument | Description | Used Values |
|---|---|---|
--model_size |
Hidden size of the model | 512 |
--seed |
Random seed | 5, 15, 25 |
--dataset |
Name of the dataset | CYP3A4_Veith, CYP2D6_Veith, hERG |
--model_path |
Path to the trained model file | e.g., checkpoints/gin_cyp2d6_veith/model_CYP2D6_Veith_hidden_512_dropout_0.3_seed_15.pth |
Step 3: View the Results
Parameters
| Argument | Description | Used Values |
|---|---|---|
--dataset |
Name of the dataset | CYP3A4_Veith, CYP2D6_Veith, hERG |
Explainers Experiment
Step 1: Train Models
Training parameters are defined inscripts/explainers_experiment/config.yaml.
Step 2: Run Explainers Experiment
Experiment parameters are defined inscripts/explainers_experiment/config.yaml.
Step 3: Summmary of results (Table 3)
License and legal
This project is released under the MIT License.
Contact
For questions, please open an issue on GitHub or contact Tomasz Danel or Łukasz Janisiów (tomasz.danel <at>.uj.edu.pl, lukasz.janisiow <at> doctoral.uj.edu.pl).



