Data Preparation
================

Before training **spacer**, prepare your spatial transcriptomics dataset in a `.h5ad` format compatible with **Scanpy**.  
This section describes the preprocessing, annotation, and structure requirements for input data.

---

Preprocessing Workflow
----------------------

Spacer expects a **Scanpy AnnData** object (`adata`) as input.  
You can generate this object from raw count matrices using **Scanpy** following these preprocessing steps:

.. code-block:: python

   import scanpy as sc

   # Load raw data (example)
   adata = sc.read_10x_h5("sample_filtered_feature_bc_matrix.h5")

   # Step 1: Filter low-quality data
   sc.pp.filter_cells(adata, min_genes=n)
   sc.pp.filter_genes(adata, min_cells=m)

   # Step 2: Normalize and log-transform
   sc.pp.normalize_total(adata, target_sum=1e4)
   sc.pp.log1p(adata)

   # Step 3: Annotate cell types (user-defined)
   # e.g., adata.obs['cell_type'] = cell_type_annotation_vector

---

Required Data Structure
-----------------------

Spacer requires the `.h5ad` object to contain the following essential fields:

- **Expression matrix (`adata.X`)**: normalized and log-transformed gene expression values.  
- **Metadata table (`adata.obs`)**: must contain the following columns:

+--------------------+--------------------------------------------------------------------------+
| **Column**         | **Description**                                                          |
+====================+==========================================================================+
| `X`, `Y`           | Spatial coordinates (in microns or pixel units) of each cell.            |
+--------------------+--------------------------------------------------------------------------+
| `cell_type`        | Integer-encoded major cell class:                                        |
|                    | - `0` → other cell                                                       |
|                    | - `1` → recruiting cell                                                  |
|                    | - `2` → engaging cell (customizable, e.g., T/B/macrophage)               |
+--------------------+--------------------------------------------------------------------------+
| `<EngagingTag>`    | Binary indicator for whether the cell at the center of the bag belongs   |
|                    | to the target engaging cell type (`1`) or not (`0`).                     |
+--------------------+--------------------------------------------------------------------------+


The `<EngagingTag>` column defines which cells will serve as the **center** for each neighborhood (“bag”) in spacer.  
Below is the default mapping used in our study:

.. code-block:: python

   mapping = {
       'tcell': 'T',
       'bcell': 'B',
       'macrophage': 'Macrophage',
       'neutrophil': 'Neutrophil',
       'fibroblast': 'Fibroblast',
       'endothelial': 'Endothelial',
   }

For example, if you are studying **T-cell recruitment**, the column name in `adata.obs` should be `"T"`,  
and its values should be `1` for T cells and `0` for all other cells.

---

Customizing the Mapping
-----------------------

In this work, we used the above mapping to ensure **consistent annotation across datasets** involving multiple stromal and immune cell types.  
Each key in the mapping corresponds to a general immune or stromal population, while the assigned value (e.g., `"T"`, `"B"`, `"Macrophage"`)  
serves as a compact label for downstream modeling and visualization.  

However, this mapping is **fully customizable**.  
Users can freely modify or extend it to match their experimental context or cell annotation schema.  
For instance, if you are analyzing **brain tissues**, you could define:

.. code-block:: python

   mapping = {
       'microglia': 'Microglia',
       'astrocyte': 'Astrocyte',
       'oligodendrocyte': 'Oligodendrocyte',
   }

Spacer will automatically adapt its bag construction and learning process to your new mapping.  
The only requirement is that the corresponding binary column (e.g., `"Microglia"`) exists in `adata.obs`  
with values `1` for target cells and `0` otherwise.