Dictionary learning for integrative, multimodal and scalable single-cell analysis

Bridge integration procedure

Our bridge integration procedure is designed to perform integration of single-cell datasets profiling different modalities by leveraging a separate multiomic dataset as a molecular bridge. The individual multiomic profiles each represent individual atoms, which together comprise a multiomic dictionary (that is, each cell in the bridge dataset represents an atom, and the entire bridge dataset represents a dictionary). This dictionary is used to transform both unimodal datasets into a shared space defined by the same set of features, facilitating cross-modality integration. Our approach consists of the following four broad steps described in detail below: (1) within-modality harmonization of unimodal and bridge datasets, (2) construction of a dictionary representation for each unimodal dataset, (3) dimensional reduction via Laplacian eigen decomposition and (4) alignment of dictionary representations across datasets. We illustrate each step of the method in Fig. 1b using the same mathematical notations that we introduce below.

All methods are implemented in our open-source R package Seurat (www.satijalab.org/seurat and www.github.com/satijalab/seurat).

Within-modality harmonization of unimodal and bridge datasets

The first step in our procedure is to harmonize the unimodal and bridge datasets based on shared modalities. For example, when performing bridge integration to map an scATAC-seq dataset onto an scRNA-seq reference (via a 10x multiome bridge), we first harmonize the gene expression measurements from the scRNA-seq and multiome experiments and the chromatin accessibility measurements from the scATAC-seq and multiome experiments. Specifically, we define the following:

\(X \in ^}} \times d_}}}\) is the scRNA-seq expression counts matrix,

\(Y \in ^}} \times d_}}}\) is the scATAC-seq accessibility counts matrix,

\(M = [M_XM_Y]\) is the multiomic expression + accessibility counts matrix, where

\(M_X \in ^}} \times d_}}}\) is the scRNA-seq subset of the multiomic matrix and

\(M_Y \in ^}} \times d_}}}\) is the scATAC-seq subset of the multiomic matrix.

Our goal is to harmonize X and MX and Y and MY. This can be performed with a wide variety of existing tools for the harmonization of single-cell datasets. For example, Seurat, Harmony, LIGER, scVI, Scanorama, fastMNN, scVI and scArches all learn a shared low-dimensional space that jointly represents the datasets and aligns cells in a matched biological state. Our goal is therefore to learn

\(X^ \ast \in ^}} \times d_}}}\), harmonized space for scRNA-seq data,

\(Y^ \ast \in ^}} \times d_}}}\), harmonized space for scATAC-seq data, and

$$M^ \ast = [M_X^ \ast M_Y^ \ast ],$$

where

\(M_X^ \ast \in ^}} \times d_}}}\) is the harmonized space for the scRNA-seq subset of the multiomic dataset and

\(M_Y^ \ast \in ^}} \times d_}}}\) is the harmonized space for the scATAC-seq subset of the multiomic dataset.

In this work, we treat the scRNA-seq dataset X as a reference and map the multiomic gene expression profiles (MX) onto this reference using the FindTransferAnchors and MapQuery functions in Seurat to obtain X* and \(M_X^ \ast\). An example workflow is provided at https://satijalab.org/seurat/articles/integration_mapping.html (‘Mapping and Annotating Query Datasets’).

The same functionality has been implemented in the Signac package for the mapping and harmonization of scATAC-seq datasets (https://satijalab.org/signac/articles/integrate_atac.html). However, we emphasize that our approach is compatible with a wide variety of preexisting approaches for within-modality harmonization, including all the methods listed above.

We also note that when finding anchors between the bridge and query datasets, we can leverage the multimodal nature of the bridge dataset to perform ‘supervised’ dimensional reduction, which uses both modalities when calculating a low-dimensional representation during harmonization. For example, we have previously described the use of ‘supervised PCA’ to learn optimized transformations from CITE-seq data4,77. When working with bridge datasets that measure ATAC-seq or CUT&Tag chromatin features (for example, Paired-Tag and 10x multiome), we use an analogous procedure for supervising the latent semantic indexing reduction.

Construction of a dictionary representation for each unimodal dataset

The goal of dictionary learning is to reconstruct individual data points as a weighted linear combination of atoms in a dictionary. We treat M* as a dictionary, with each row of this matrix representing an atom. We aim to learn reconstructions of X* and Y* based on the atoms of M* while minimizing the error between the original and reconstructed values. Specifically, we aim to identify the matrices DX and DY, where

\(D_X \in ^}} \times n_}}}\) is the dictionary representation of the scRNA-seq dataset, and

\(D_Y \in ^}} \times n_}}}\) is the dictionary representation of the scATAC-seq dataset, such that

$$\arg \mathop\limits_(||D_X(M_X^ \ast ) - X^ \ast ||_F^2 + ||D_X||_F^2)$$

and

$$\arg \mathop\limits_\left( \right).$$

As described in refs. 56,78, this optimization problem is analogous to matrix regression and has a closed-form solution for calculating DX and DY,

$$D_X = X^ \ast (M_X^ \ast )^$$

$$D_Y = Y^ \ast (M_Y^)^,$$

where † represents the pseudoinverse of the matrix.

We note that DX and DY represent transformations of the original scRNA-seq and scATAC-seq datasets. While the two experiments originally measured different sets of features, after the transformation, they now are represented by the same set of features, namely, the atoms of the multiomic experiment.

Dimensional reduction via Laplacian eigen decomposition

After the datasets have been transformed in the previous step, it is possible to integrate them directly. The dimensionality of the datasets is based on the number of cells in the multiomic dataset. Unlike the original measurements, the dictionary representations are not sparse. As multiomic datasets often consist of thousands of cells, working with high-dimensional and non-sparse dictionary representations is computationally inefficient. We therefore aimed to reduce the dimensionality of the dictionary representation. Motivated by a similar problem addressed by Laplacian eigenmaps42, a non-linear dimensionality reduction technique, we perform dimensionality reduction by computing an eigen decomposition of the graph Laplacian matrix. Unlike a PCA, which aims to identify low-dimensional representations that preserve data variance, Laplacian eigenmaps represent a low-dimensional reduction that optimally preserves the graph-defined local neighbor relationships42.

We first compute a graph representation of the multiomic dataset M*. We use a ‘shared nearest neighbor’ graph representation, as proposed by Levine et al.79 for clustering single-cell datasets. We note that the matrix representation of this graph is symmetric, which is a requirement for downstream eigen decomposition. Our approach is compatible with any user-defined distance metric when constructing this graph, although we recommend using either the Euclidean distance based on harmonized gene expression measurements (that is, \(M_X^ \ast\)) or, alternately, a weighted combination of modalities using the ‘weighted nearest neighbor’ distance metric that we have previously introduced4. We define

\(G \in ^}} \times n_}}}\) as the symmetric graph representation of the multiomic dataset and

\(L = I - D^}GD^}\) as the graph Laplacian matrix.

We next perform an eigen decomposition of the graph Laplacian matrix:

$$0 = \lambda _1 \le \lambda _2 \le \ldots \le \lambda _n.$$

Here, UL is the leftmost nLaplacian eigenvectors of U, where n specifies the reduced dimensionality of the dataset. We select nLaplacian = 50 for all examples in this work.

We now multiply the learned dictionary representations for the scRNA-seq and scATAC-seq datasets by this truncated set of eigenvectors. Doing so transforms these representations into the same lower-dimensional space (nLaplacian). We define

\(L_X \in ^}} \times n_}}}\) as the reduced dictionary representation for the scRNA-seq data,

\(L_Y \in ^}} \times n_}}}\) as the reduced dictionary representation for the scATAC-seq data and

\(L_M \in ^}} \times n_}}}\) as the reduced dictionary representation for the multiomic datasetand calculate the following matrices:

$$L_X = D_XU_L = X^ \ast \left( U_L} \right),$$

$$L_Y = D_YU_L = Y^ \ast \left( U_L} \right)$$

and

Alignment of dictionary representations across datasets

Both the scRNA-seq and scATAC-seq datasets have now been transformed into a low-dimensional space defined by the same set of features. They can now be directly harmonized using existing methods. As in step 1, multiple published methods can accomplish this goal. In this work, we use our internal implementation of the mnnCorrect integration technique to perform this harmonization39. We choose mnnCorrect, as we find that after performing the steps described above, any remaining sample-specific differences are minor and are typically far less than the differences we observe when aligning scRNA-seq datasets across different technologies. To demonstrate the compatibility of our approach with alternative methods, we also repeat our quantitative benchmarking experiments using our previously developed integration workflow in Seurat v3 (ref. 19) and observe very similar results (Supplementary Fig. 3).

Specifically, the final output of our procedure represents

\(L_X^ \ast \in ^}} \times n_}}}\) as the harmonized reduced dictionary representation for the scRNA-seq data,

\(L_Y^ \ast \in ^}} \times n_}}}\) as the harmonized reduced dictionary representation for the scATAC-seq data and

\(L_M^ \ast \in ^}} \times n_}}}\) as the harmonized reduced dictionary representation for the multiomic dataset.

These representations can be used as input for common downstream analytical tasks, including t-distributed stochastic neighbor embedding (t-SNE) or UMAP visualization, graph-based clustering and the identification of developmental trajectories.

Atomic sketch integration

Our approach consists of four steps: (1) for each dataset, sample a representative subset of cells (atoms) that span both rare and abundant populations; (2) for each dataset, learn a dictionary representation to reconstruct each cell based on the atoms; (3) integrate the atoms from each dataset and (4) for each dataset, reconstruct each cell from the integrated atoms. Each step is described in detail below. We note that steps 1, 2 and 4 are performed on each dataset individually, and step 3 only requires performing joint computation on the downsampled set of atoms. Therefore, our procedure never requires loading or processing the entirety of the datasets at one time. Our approaches should therefore successfully extend to and beyond the analysis of 100,000,000 cells, which is now an achievable scale for combinatorial barcoding technologies.

All methods are implemented in our open-source R package Seurat (www.satijalab.org/seurat, www.github.com/satijalab/seurat).

Sample a representative subset of cells (‘atoms’) from each dataset

Our first step is to selectively downsample the cells in each dataset, aiming to identify a reduced set of cells that are representative of the full dataset. In particular, we aim to ensure that rare populations continue to be represented even after downsampling. We also aim to identify cell subsets in a computationally efficient manner and to minimize any computation that must be performed on the full dataset before downsampling. We aim to select a subset of k cells from each dataset, each of which is referred to as an atom. In this manuscript, we use k = 5,000, unless otherwise noted.

We define

\(X \in ^}} \times d_}}}\) as the count matrix for scRNA-seq and

\(S \in ^}}}\) as the sampling matrix for the dataset; each row is one-hot row vector matrix indicating which cells are selected (that is, \(s_ = 1\) if cell i is the jth cell to be selected; \(i = 1,2,\ldots,n_}}\) and \(j = 1,2,\ldots,k\)).

\(SX \in ^}}}\) is the scRNA-seq matrix after downsampling to the k cells selected. We also call this matrix A, as it represents the ‘atoms’ selected from the original dataset.

We can use a variety of techniques to define the sketching matrix S. These include geometric sketching techniques, such as geosketch53 or Hopper54, or fast clustering procedures, such as mini-batch k-means55 followed by cluster-informed downsampling.

In this work, we select cells based on their statistical leverage scores, a method for selecting influential data points in a dataset. In the context of linear regression, statistical leverage represents the influence of an individual data point in determining the best least-squares fit. In this context, cells with high leverage scores will tend to make the largest contribution to the gene covariance matrix and, therefore, reflect the importance of the cell’s profile. The exact statistical leverage score for a cell can be computed via an eigen decomposition of the X matrix, but this is computationally inefficient. As an alternative, Clarkson and Woodruff56 propose a randomized algorithm that efficiently computes a fast approximation of statistical leverage56. This algorithm is attractive for single-cell sequencing analysis as it is highly scalable and runs efficiently on sparse datasets. Briefly, the algorithm amounts to constructing a ‘randomized’ sketch of the input matrix based on the Johnson–Lindenstrauss lemma and computing the Euclidean norms of the rows of that sketch. The algorithm is fully described in Clarkson and Woodruff56, but we note the key mathematical steps below.

For the randomized sketching matrix, we use the sparse random CountSketch matrix C, which consists of 0, 1 and –1 elements and is defined in ref. 80.

\(C \in ^}}}\) is the sparse randomized CountSketch matrix.

We then perform a QR decomposition

We then apply a fast Johnson–Lindenstrauss transformation using a very sparse random projection matrix ∏81. We calculate this matrix using the RandPro package82 in R (‘li’ projection function),

$$Z = X \times (R^ \times ).$$

We can now calculate the leverage score for each cell, which are the Euclidean norms of the rows of the Z matrix. We can also calculate a sampling probability for selecting each cell i as an atom based on the leverage scores.

\(l_i = ||Z\,[i,]||_2^2\) is the leverage score for cell i, and

\(p_i = \frac}\nolimits_^n }}\) is the probability of selecting cell i as an atom.

Finally, we sample k cells as atoms based on these probabilities. As described above, this procedure results in a downsampled dataset in which only the atoms remain, which we name A.

Learn a dictionary representation to reconstruct each cell based on the atoms

We aim to learn reconstructions of X based on the atoms of A while minimizing the error between the original and reconstructed values. Specifically, we aim to identify the matrix D, where

\(D \in ^}} \times k}\) is the dictionary representation of the scRNA-seq dataset

such that

$$\arg \min _D(||DA - X||_F^2 + ||D||_F^2).$$

As described previously, this optimization problem is analogous to matrix regression and has a closed-form solution for calculating D

where † represents the pseudoinverse of the matrix.

Integrate the atoms from each dataset

Let \(i = 1,2,\ldots,n_}}\) represent the datasets to be integrated, and let Ai represent the matrix of atoms that result from downsampling dataset i. Our goal is to harmonize the set of matrices \([A_1,A_2,\ldots,A_}}}]\).

This can be performed with a wide variety of existing tools for the harmonization of single-cell datasets. For example, Seurat, Harmony, LIGER, scVI, Scanorama, fastMNN, scVI and scArches all learn a shared low-dimensional space that jointly represents the datasets and aligns cells in a matched biological state together. Our goal is therefore to learn

$$[A_1^ \ast ,A_2^ \ast ,\ldots,A_}}}^ \ast ],$$

where \(A_i^ \ast \in ^}} \times d_}}}\) is the harmonized space for scRNA-seq dataset i.

In this manuscript, we use our previously developed anchor-based workflow to integrate the datasets using reciprocal PCA, which is optimized for integration tasks with large numbers of samples and cells (‘fast integration using reciprocal PCA’ at https://satijalab.org/seurat/articles/integration_rpca.html). The integration procedure returns a low-dimensional space that jointly represents atoms from all datasets.

Reconstruct each cell from the integrated atoms

The last step is performed individually for each dataset. Let \(i = 1,2,\ldots,n_}}\) represent the datasets to be integrated, and let Xi represent the full scRNA-seq count matrix representing dataset i.

We reconstruct integrated values for each cell in dataset i using the previously computed dictionary representation for the dataset along with the harmonized space \(A_i^ \ast\),

$$X_i^ \ast = D_iA_i^ \ast = X_i(A_i^ A_i^ \ast ).$$

The collection of matrices \(\left[ }}}^ \ast } \right]\) now represents a low-dimensional space that jointly represents all cells from all datasets. Because these matrices are low dimensional, each of them can be simultaneously loaded into memory. These representations can be used as input for common downstream analytical tasks, including t-SNE or UMAP visualization, graph-based clustering and the identification of developmental trajectories.

Preprocessing details for each datasetAdult mouse frontal cortex and hippocampus Paired-Tag dataset

The datasets from Zhu et al.26 were generated with Paired-Tag, which performs simultaneous profiling of histone modifications and cellular transcriptomes and contains a total of 64,849 nuclei. We extracted three datasets for the histone modifications H3K27ac, H3K4me1 and H3K27me3. We used the gene expression matrices as quantified in the original experiment. For each epigenetic modification, the original manuscript quantified read densities in 5,000 bins. These were aggregated into larger peaks using the CombineTiles function in Signac, and aggregated peaks less than 1 megabase in size were retained. We retained cells with total RNA counts between 500 and 10,000. We applied SCTransform to normalize the gene expression data and TF-IDF to normalize the histone modification data. We used PCA (dimensions 1:30) and TF-IDF (dimensions 2:30, excluding the first dimension, as this is typically correlated with technical metrics in ATAC-seq or scCUT&Tag data) to reduce the dimensionality of the RNA and histone modification modalities and construct the weighted nearest neighbor (WNN) graph.

Data acquisition source: Gene Expression Omnibus (GEO), accession number GSE152020.

Human frontal cortex snmC-seq data

This human frontal cortex dataset is an snmC-seq dataset from Luo et al.51 and contains 2,784 nuclei. We used the non-CG methylation 100,000-kb bin count matrices as quantified in the original experiment. We applied SCTransform83 to normalize the gene expression data and log normalization to normalize the methylation data. Because this dataset was used as a query dataset in this manuscript, we did not perform unsupervised dimensionality reduction on the methylation data.

Data acquisition source: GEO, accession number GSE97179 (https://brainome.ucsd.edu/annoj/brain_single_nuclei/).

Human frontal cortex snmC2T-seq data

This human frontal cortex dataset is an snmC2T-seq dataset from Luo et al.28 and contains 4,357 nuclei. We used the non-CG methylation 100,000-kb bin count matrices as quantified in the original experiment. We applied SCTransform to normalize the gene expression data and log normalization to normalize the methylation data. We used PCA to reduce the dimensionality to 30 for both datasets and construct the WNN graph.

Data acquisition source: GEO, accession number GSE140493.

BMMC multiome

We collected a total of ten 10x multiome datasets from the NeurIPS Multimodal Single-Cell Data Integration challenge website, representing 32,368 paired single-nucleus profiles of transcriptome and chromatin accessibility. We retained cells with total RNA counts between 1,000 and 10,000 and with total ATAC peak counts between 2,000 and 30,000. We applied SCTransform to normalize the gene expression data and TF-IDF to normalize ATAC peak counts. We used PCA (dimensions 1:40) and TF-IDF (dimensions 2:40) to reduce the dimensionality of each modality and construct the WNN graph.

Data acquisition source: https://openproblems.bio/competitions/neurips_2021/.

Human BMMC ATAC-seq

This human bone marrow dataset is an snATAC-seq dataset from Granja et al.43. As the reads were originally mapped to hg19, we used cellranger-atac v2 to remap fastq files to hg38. In each cell, we quantified the same set of peaks that were detected in the BMMC multiome dataset. After removing low-quality cells, 26,159 cells were retained, with total ATAC peaks of <50,000 and >2,000. We applied TF-IDF to normalize the ATAC-seq data. As this dataset was used as a query dataset in this manuscript, we did not perform unsupervised dimensionality reduction on the ATAC-seq data.

Data acquisition source: GEO, accession number GSE139369.

Human PBMC scRNA

This human PBMC scRNA dataset was obtained from the 10x Genomics website (https://www.10xgenomics.com/resources/datasets/) and consists of 33,015 cells. We retained cells with total RNA counts between 400 and 10,000. We applied log normalization for the gene expression matrix. We annotated these cells by mapping them to the Azimuth PBMC reference with the Seurat4 reference-mapping framework and refined the annotations by de novo clustering. These data were used for sketching benchmark analysis (Supplementary Fig. 5).

Data acquisition source: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc33k.

Human PBMC mulitome

This human PBMC multiome (RNA + ATAC) dataset was obtained from the 10x Genomics website (https://www.10xgenomics.com/resources/datasets/) and consists of 10,970 cells. We retained cells with total RNA counts between 500 and 10,000 and total ATAC peak counts between 2,000 and 100,000. We applied SCTransform to normalize the gene expression data and TF-IDF to normalize ATAC peak counts. We annotated these cells by mapping the RNA profile to the Azimuth PBMC RNA reference with the Seurat4 reference-mapping framework. We then used these annotations to create ATAC-seq tracks, shown in Supplementary Fig. 2d.

Data acquisition source: https://www.10xgenomics.com/resources/datasets/10-k-human-pbm-cs-multiome-v-1-0-chromium-x-1-standard-2-0-0.

Human CD34+ bone marrow multiome

This human CD34+ bone marrow multiome (RNA + ATAC) dataset was obtained from Persad et al.84 and consists of 13,398 cells from two replicates. We retained cells with total RNA counts between 500 and 30,000 and total ATAC peak counts between 1,000 and 100,000. We used the same normalization method used for the human PBMC multiome. We annotated these cells by mapping the RNA profile to the Azimuth RNA BMMC reference with the Seurat4 reference-mapping framework. When using the human PBMC multiome dataset, we did not observe sufficient numbers of ASDCs to create a chromatin track for this dataset. However, we identified 12 cells annotated as ASDCs in these CD34+ bone marrow data. We used these cells to generate a chromatin track for the SIGLEC6 locus (Supplementary Fig. 2e), which validates our predicted ASDC identified via bridge integration.

Data acquisition source: https://zenodo.org/record/6383269.

Human PBMC CyTOF dataset

This human PBMC CyTOF dataset was generated by the COVID-19 Multi-omics Blood Atlas COMBAT consortium and consists of 7.11 million cells with a panel of 47 antibodies. We removed cells from individuals with sepsis, yielding a remainder of 5.17 million cells. We used the normalized expression matrices as quantified in the original study. As this dataset was used as a query dataset in this manuscript, we did not perform unsupervised dimensionality reduction on the protein data.

Data acquisition source: https://zenodo.org/record/5139561.

Azimuth reference

Azimuth scRNA-seq references for human bone marrow (297,627 cells) and the human motor cortex (159,738 cells) were downloaded from the HuBMAP portal. The portal includes descriptions of each public data source used when compiling the reference dataset and a link to a GitHub repository and Docker Hub to reproduce the construction of the reference.

Data acquisition source: https://azimuth.hubmapconsortium.org.

Lung scRNA-seq dataset atlas

Nineteen datasets profiling human lung samples using scRNA-seq were downloaded from publicly available sources (links for each source dataset are provided in Supplementary Table 2). Low-quality cells were filtered using uniform quality control thresholds; cells with RNA counts between 300 and 100,000 and with mitochondrial read percentages below 20% were retained. Normalization was performed using log normalization implemented in Seurat. We used PCA (dimensions 1:40) to reduce the dimensionality of each dataset.

Data acquisition source: Supplementary Table 2 and lung scRNA datasets68,69,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101.

PBMC COVID scRNA-seq dataset atlas

Fourteen datasets profiling human PBMC samples using scRNA-seq were downloaded from publicly available sources (links for each source dataset are provided in Supplementary Table 2). Eleven of these datasets had been previously organized in Tian et al.60. Low-quality cells were filtered using uniform quality control thresholds; cells with RNA counts between 150 and 150,000 and with mitochondrial read percentages below 15% were retained. Normalization was performed using log normalization implemented in Seurat. We used PCA (dimensions 1:40) to reduce the dimensionality of each dataset.

Data acquisition source: Supplementary Table 2 and PBMC scRNA datasets4,62,63,102,103,104,105,106,107,108,109,110,111,112.

Differentiation trajectory and pseudotime analysis

In Fig. 2, we identify a myeloid differentiation trajectory and pseudotime ordering of cells that describes both reference (scRNA-seq) and query (scATAC-seq) cells. We extracted reference cells belonging to HSC, LMPP, GMP and CD14+ monocyte populations and query cells that mapped to any of these subsets after bridge integration. We next constructed a k-nearest neighbor (KNN) graph representing cells from both modalities using the latent space learned during the bridge integration procedure. This graph was used as input to the destiny package, which reduces the dimensionality of the data using diffusion maps113. We note that as we manually selected cell populations that are known to encompass monocytic differentiation, we did not expect or observe branching events. We used the first two diffusion map coordinates as input to monocle3 (ref. 114) to infer a pseudotemporal ordering.

We next aimed to identify cases where dynamic gene expression patterns ‘lag’ behind the accessibility dynamics of nearby regulatory regions. We can perform this analysis because our pseudotemporal ordering encompasses both scATAC-seq and scRNA-seq cells. We first associated each scATAC-seq peak with a gene using the ClosestFeature function in Signac. For each gene, we next smoothed the expression profile along the learned trajectory using the ksmooth function (‘stats’ package in R115) using 1,000 intervals and a bandwidth of 0.01. We repeated the same process for the accessibility of each peak linked to this gene (bandwidth of 0.05). We next calculated the cross-correlation of the smoothed expression and accessibility values, which measures the similarity for the two time series and calculates the optimal displacement of one relative to the other. We used the ccf function (‘stats’ package in R115) and identified a total of 574 gene–peak pairs with a cross-correlation of >0.6. Of these, we identified 236 cases exhibiting an optimal displacement of >0.01 (we illustrate 6 such cases in Fig. 2l).

Bridge cell downsampling analysis

To explore how the size and composition of the multiomic dataset affected the robustness of bridge integration, we performed 25 serial downsamplings of the entire BMMC multiomic dataset (200, 300, 400, 500, 600, 700, 800, 900, 1,000, 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, 10,000, 11,000, 12,000, 13,000, 14,000, 15,000, 20,000 and 30,000). We used one batch of the scATAC dataset (12,256 cells) as a query, repeated bridge integration and compared the resulting predictions with our original findings. As expected, we found that the degree of agreement after downsampling was cell-type dependent, as cells from abundant cell types were more robust to downsampling. We therefore expressed our results as a function of the number of cell types present in the bridge dataset for each cell type. For example, the 7,000-cell downsampled dataset contained 144 CD16+ monocytes (prediction concordance of 1.00) and 22 pro-B cells (prediction concordance of 0.66). The 2,000-cell downsampled dataset contained 41 CD16+ monocytes (prediction concordance of 0.94) and 6 pro-B cells (prediction concordance of 0.55). We aggregated all these results across downsamples and displayed the results in Fig. 2a. For visual clarity, we only showed an x axis range of 10 to 500 in Fig. 2a.

Bridge cell-type composition resampling analysis

To assess the robustness of our bridge integration procedure to the relative proportion of cell types in the bridge dataset, we scrambled the proportions and then repeated the bridge integration procedure. To accomplish this, we sampled 10,000 cells from the original bridge dataset without replacement. We set each cell’s sampling probability inversely to the proportion of cell types in the original dataset, ensuring that we would substantially alter cell-type composition. We then repeated bridge integration with the resampled dataset, mapping the same query dataset, and then compared the results in Supplementary Fig. 2.

Benchmark analysis with multiVI and Cobolt

To assess the performance of our bridge integration method alongside other recently proposed integration tools, we compared our results with multiVI49 from scvi-tools v 0.14.5 and Cobolt50 (v1.0.0). As both Cobolt and multiVI use variational autoencoders, both methods are run on a server with a discrete NVIDIA A100GPU with 40 gigabytes of memory and pyTorch-lightning v.1.3.8 installed. Seurat analyses are run on an Intel Xeon Platinum 8280L server and use a single computational core.

For multiVI, we used the scRNA-seq, scATAC-seq and multiomic RNA–ATAC paired counts matrices as input. We used the multiome_anndatas function to generate one anndata object for integration. We set batch information in categorical_covariate_keys, using the setup_anndata function. We then integrated the datasets by running the multiVI function, as outlined in the multiVI tutorial (https://docs.scvi-tools.org/en/stable/tutorials/notebooks/MultiVI_tutorial.html). We used 500 epochs for model training, as suggested in the multiVI tutorial. All other parameters were set to default settings. multiVI learns a latent space, which jointly represents cells across the scRNA-seq and scATAC-seq datasets. We extracted this space and performed nearest neighbor calculations and UMAP visualization in Seurat.

For Cobolt, we used the scRNA-seq, scATAC-seq and multiomic RNA–ATAC paired counts matrices as input. We used the SingleData function from cobolt_

留言 (0)

沒有登入
gif