Quick Start

This guide shows how to train the spacer model directly from the command line using the provided train.py script. This approach is ideal for batch experiments or server environments without notebooks.

—

Running Spacer from the Command Line

After installation, navigate to your spacer directory and run:

(spacer) $ python train.py \
    --data path/to/training_data.h5ad \
    --reference_gene path/to/reference_genes.csv \
    --output_dir results/ \
    --engage_cell tcell \
    --learning_rate 0.0001 \
    --num_epochs 1000 \
    --patience 5 \
    --delta 0.001 \
    --max_instances 500 \
    --n_genes 10000 \
    --direction positive

The script will automatically detect and use a GPU if available (CUDA).

—

Argument Descriptions

The following command-line arguments are supported by train.py:

Argument	Description
–training_mode	Select “single” or “joint” training.
–data	Path to the input dataset (e.g., .h5ad or .csv).
–reference_gene	Path to a CSV file listing all reference genes.
–output_dir	Directory where models, metrics, and spacer scores are saved.
–engage_cell	Engage cell type used as bag centers (default: tcell).
–learning_rate	Learning rate for the optimizer (default: 0.0001).
–num_epochs	Total number of training epochs (default: 1000).
–patience	Early stopping patience for validation loss (default: 5).
–delta	Minimum improvement to reset early stopping (default: 0.001).
–max_instances	Maximum number of instances per bag (optional).
–n_genes	Number of top expressed genes in recruiting cell types to include.
–gene_weighting	Choose how to normalize gene-expression weights across genes.
–direction	Select “positive (induce)” or “negative (repel)” training.

—

Script Overview

The train.py script performs the following steps:

Load Reference Genes Reads the list of all genes from the specified reference file (reference_gene).In this study we use all human/mouse genes as our reference geneset. All human geens are provided in the data/ folder of the repository: - data/human_reference_genes.csv
Initialize the Model Build model structure and initialize parameters with modules for: - distance attention - gene expression weighting - spacer moudle scoring
Create Dataset and DataLoaders Loads the bag-level dataset via BagsDataset, then splits it into 70% training and 30% validation.
Train the Model Optimizes binary cross-entropy (BCE) loss using the AdamW optimizer. Early stopping monitors validation loss (patience, delta).
Validate and Save Best Model Evaluates validation AUROC for each epoch and saves the best performing weights as best_model.pth.
Log Training Metrics Saves epoch-level metrics (train_loss, val_loss, val_AUROC) to training_metrics.csv.
Track Spacer Scores For each epoch, saves spacer_score_changes_epoch_X.csv, showing gene-level spacer scores before and after training.
Final Model Output The fully trained model is stored as final_model.pth in your output directory.

—

Example Outputs

After training completes, your output_dir will contain:

results/
├── best_model.pth
├── final_model.pth
├── training_metrics.csv
├── spacer_score_changes_epoch_1.csv
├── spacer_score_changes_epoch_2.csv
└── ...

Each spacer_score_changes_epoch_X.csv file summarizes gene-specific spacer scores at each epoch, with the genes sorted by the magnitude of the spacer score

—

Tips

GPU Acceleration: spacer automatically uses CUDA if available. You can verify this in the log output.