Benchmarking computational methods for single-cell chromatin data analysis

Datasets and preprocessing

For our benchmark, we used 6 scATAC-seq or single-cell multi-omics datasets that are publicly available [9, 25,26,27] (see Additional file 2: Table S1; links to the public repositories can be found in the “Availability of data and materials” section). For datasets where the fragment files are publicly available, these were downloaded from the author’s repositories. For datasets where the fragment files are not available, we downloaded the bam files and used the command line tool Sinto [35] to create fragment files.

For each scATAC-seq (or the ATAC component of the multi-omic) dataset, we first performed per-cell quality control (QC) using ArchR (v1.0.3) [9] by thresholding the Transcription Start Site Enrichment Score (TSSE) and the number of unique fragments; the thresholds for each dataset are in Additional file 2: Table S2. Then, we applied doublet removal procedures using addDoubletScores() and filterDoublets() in ArchR. Key parameters for these two functions are in Additional file 2: Table S2, including k in function addDoubletScores() and filterRatio in function filterDoublets(). We then filtered the fragment files to keep only cells that passed QC. For single-cell multi-omics datasets, we filtered by QC of both the ATAC and the RNA modalities. QC of RNA-seq was conducted using Seurat (v4.3.0) [36] by applying filters nCount_RNA>800 & percent.mt\(\mathtt \) for Chen2019 and nFeature_RNA\(\mathtt \) & nFeature_RNA\(\mathtt \) & nCount_RNA\(\mathtt \) & percent.mt\(\mathtt \) for 10XPBMC. Doublets were identified using function scDblFinder() in R package scDblFinder (v1.13.9) [24] and then removed. Before calling scDblFinder(), Louvain clustering [37] was performed with resolution\(=0.5\) for Chen2019 and 0.8 for 10XPBMC in Seurat. Then, the clusterings results were used in scDblFinder(). All these filtering steps were applied to the fragment files, and the final filtered fragment files were inputs for the Snakemake pipeline.

Datasets from the human adult single-cell chromatin accessibility atlas

The human adult single-cell chromatin accessibility atlas [25] contains 111 distinct cell types across 30 tissues and is a rich resource for scATAC-seq data of different cell types. We took two subsets of this atlas as our evaluation datasets “Atlas1” and “Atlas2.” The idea is to use the tissue of origin as the ground truth for benchmarking clustering. By examining the fraction of cells from different tissues for each cell type, we found that many cell types exist exclusively in one tissue (Additional file 1: Fig. S19). Therefore, we selected tissue-cell-type pairs by first selecting a subset of cell types that have \(\ge 85\%\) of cells from the same tissue, and then for each of the corresponding tissues, we selected one cell type randomly but excluded cell types that have less than 300 cells. For each tissue with multiple samples, we selected one sample that contains the maximum number of cells of that cell type that have passed QC. We used only one sample for each cell type to eliminate any potential batch effect. For the tissue-cell-type pairs selected for each dataset, we downsampled cells per cell type, using fractions 0.3 for Atlas1 and 0.5 for Atlas2. The tissue-cell-type pairs and the sampled cell ID for “Atlas1” and “Atlas2” are available on GitHub [38].

scRNA-seq data annotation

For 10XPBMC and Chen2019 datasets, the RNA modality was subjected to Leiden clustering in Seurat using resolution = 0.5 for Chen2019 and resolution = 0.8 for 10XPBMC. The resolution for each dataset was determined by first performing Leiden clustering using a series of resolution values (with the maximum resolution obviously over-clustering the datasets), then constructing a cluster tree showing the co-clustering consistency across resolutions [39], and finally choosing a resolution value that gives stable clustering result and reasonable separation of cells in the UMAP. Then, for each cell, a label was transferred by reference mapping [36] using Seurat. In the case of the 10XPBMC dataset, an annotated PBMC reference dataset [40] was utilized for label transfer. As for the Chen2019 dataset, the scRNA-seq data of the adult mouse brain from the Allen Brain Atlas [41] served as the reference dataset. Then, we performed some manual curation to get the final cluster annotation. We describe roughly this process below. For each Leiden cluster, the majority cell label was token as the label of that cluster. If two clusters got the same label, we subset all cells from these two clusters and performed multiple rounds of clusterings on this subset. If the splitting of these two clusters were stable, we labeled them differently, including “CD4 Naive 1” and “CD4 Naive 2.” Otherwise, we merged these two clusters.

Feature engineering methods

For all methods, we followed the procedures recommended in the author’s documentation.

Signac

Starting from the fragment file, Signac first uses MACS2 for peak calling, then performs LSI on the peak count matrix to obtain a low-dimensional representation. Peak calling was conducted in two ways: (1) aggregating all cells for peak calling (denoted as “Signac_all_cell_peaks”) or (2) aggregating cells and calling peaks for each cluster individually, followed by generating a consensus peak set from the peaks identified in all clusters, referred to as “Signac_by_cluster_peaks.” LSI consists of 3 steps: (1) normalization using term frequency-inverse document frequency (TF-IDF), (2) selecting the top 95% most common peaks, (3) performing singular value decomposition (SVD).

We used the R package Signac (v1.9.0) for its implementation. As suggested by the tutorial: https://stuartlab.org/signac/articles/pbmc_multiomic.html, we created a fragment object, called MACS2, and removed peaks on nonstandard chromosomes and genomic blacklist regions and peaks having width \(<20\) or width \(>10000\). To identify peaks per cluster, the cell embeddings generated by “Signac_all_cell_peaks” are used to define clusters using the Louvain algorithm with a default resolution of 0.8.

Subsequently, we performed normalization, feature selection, and linear dimensional reduction using RunTFIDF() with method=1, FindTopFeatures() with min.cutoff=”q5”, and RunSVD() with n=100, respectively. As suggested in the tutorial, the first LSI components often capture sequencing depth. We removed LSI components that have larger than 0.75 Pearson correlation with the total number of counts.

Aggregation

The aggregation method starts with the peak count matrix where the peak set is identified using the method “Signac_by_cluster_peaks.” Then, the cell-by-peak fragment count matrix is used for subsequent TF-IDF normalization and PCA. Minibatch K-means clustering is applied to the PCA to cluster peaks into meta-features (K = 1000 by default). Ultimately, an aggregated count matrix is obtained by summing the counts per meta-feature, and PCA is performed on the aggregated count matrix to get the low-dimensional representation.

We used the function aggregateFeatures() in R package scDblFinder (v1.13.9) with the default parameters. By default, \(K=1000\) feature clusters are identified.

ArchR

ArchR takes the fragment files as input and can use either the genomic tiles or peaks as features. We implemented both options. The “ArchR_tiles” method uses 500-bp non-overlapping genomic tiles to construct a binarized tile matrix and then performs iterative LSI on the matrix to extract meaningful low-dimensional representations. Similar to “Signac_by_cluster_peaks” approach, “ArchR_peaks” method first uses the latent representation obtained from “ArchR_tiles” for clustering, then performs peak calling per individual clusters and generates a consensus peak set by merging these peak tracks. Afterwards, iterative LSI is performed on the peak count matric.

During the iterative LSI process, at each iteration, the top accessible features (in 1st iteration) or top variable features (since 2nd iteration) are selected for LSI. The resulting cell clusters are then identified and utilized for feature selection in the subsequent iteration, enabling an iterative refinement of the LSI procedure.

We used the R package ArchR(v1.0.3) for implementation. When running the function addIterativeLSI(), Louvain algorithm was used for the clustering in intermediate steps with increasing resolutions, and no subsampling of cells was performed. Other parameters we used were set to be the default values.

SnapATAC

The SnapATAC method (version 1) takes the fragment files as input and first constructs a binary cell-by-bin matrix using 5000-bp non-overlapping genomic bins. Then, after filtering out unwanted bins, it computes a pairwise cell-to-cell similarity matrix using Jaccard coefficient. This kernel matrix is subject to normalization of the coverage bias and then eigenvalue decomposition (EVD) to get the cell embeddings.

For the implementation, we used the command line tool snaptools (v1.4.8) to create snap files from fragment files, and the R package SnapATAC (v1.0.0) for the rest of the processing pipeline. We followed the standard procedures in https://github.com/r3fang/SnapATAC/tree/master/examples/10X_PBMC_15K, except that we ran the function runDiffusionMaps() using all cells instead of using landmark cells. This approach was chosen in order to maintain a consistent basis for comparison with other methodologies. Furthermore, the datasets employed are of small to moderate size, and running all methods using all cells does not pose significant efficiency issues, which is the primary concern that subsampling procedures are designed to address.

SnapATAC2

SnapATAC2 is the version 2 of SnapATAC method and it is released as a python package. By implementing AnnData object and optimizing the on-disk representation, it facilitates the processing of high-dimensional data. As demonstrated in the tutorial https://kzhang.org/SnapATAC2/tutorials/pbmc.html, SnapATAC2 first creates a cell-by-bin matrix containing insertion counts using 500-bp bins by default. Then, a pairwise cell-to-cell similarity matrix is generated, using either Jaccard coefficient (SnapATAC2_jaccard) or cosine similarity (SnapATAC2_cosine). With this kernel matrix, the symmetric normalized graph Laplacian is computed, and the bottom eigenvectors of the graph Laplacian is used as the lower dimensional representation. For implementation, we used SnapATAC2 (v2.2.0). To select features, we removed bins overlapping with the blacklist regions as always done in other methods and called function snapatac2.pp.select_features() with parameters min_cells=10, most_variable=1000000.

Clustering

In this study, we used a well-established graph-based clustering method for all clustering analyses. We first constructed a shared nearest neighbor graph and then applied the Leiden algorithm [20] using modularity as the optimization objective, as implemented in the Seurat package (v4.3.0) [36]. The Leiden algorithm incorporates a step where node partitions are refined by randomly reassigning nodes to communities that increase the objective function, enabling a wider exploration of the partition space [20]. To account for the inherent stochasticity in the Leiden algorithm, we ran it with 5 different random seeds: 0, 2, 5, 42, and 123. Since the optimal number of clusters is not known a priori, a range of resolutions was used to obtain diverse clustering solutions yielding varying numbers of clusters. The parameters we used to achieve the optimal solution for each method and dataset are presented in the Additional file 2.

Evaluation metrics

According to the data structure our evaluation applied to, we have classified our evaluation metrics into three categories: embedding-based, graph-based and partition-based (Fig. 1).

ASW, FNS

The silhouette width quantifies the average distance between an observation and the other observations within its cluster, relative to the average distance to the nearest neighboring cluster [42]. The Average Silhouette Width (ASW) is calculated as the mean silhouette width across all observations within a cluster, providing insights into the compactness of the cluster and its separation from other clusters. ASW values range from − 1 to 1, with 1 indicating dense and well-separated clusters, 0 representing clusters that overlap, and − 1 indicating significant misclassification, where within-cluster dispersion is greater than between-cluster dispersion.

A limitation of ASW is that it is not invariant to the scaling of the space. As a solution, we introduced the fraction of negative Silhouette score (FNS) to assess the cluster-level proportion of cells with a negative Silhouette width. FNS characterizes the fraction of cells with a smaller distance to cells within another cluster compared to their own cluster. It is robust to linear scaling and enables more meaningful comparisons across different dimensional reduction methods.

cLISI

The LISI has been proposed to evaluate either the mixing between batches or the separation between cell types [43]. To calculate it, a weighted k-nearst neighbor (kNN) graph is first generated based on Euclidean distance within an embedding space. Subsequently, for each node in the graph, it computes the expected number of cells needed to be sampled before two cells are drawn from the same batch/clusters within its neighborhood. We used the cluster-based variant of LISI, known as cluster LISI (cLISI), as a metric to assess the embedding representation. This is implemented by using the function compute_lisi() from the R package lisi v1.0 [43, 44]. cLISI ranges from 1 to K, where K is the total number of cell types in the dataset. One indicates a neighborhood consisting exclusively of cells from a single cell type, while K corresponds to complete mixing, with cells from all cell types found within the neighborhood.

PWC

Partition-based metrics are susceptible to the influence of clustering parameters, whereas embedding-based metrics rely on the proper definition of similarity within the embedding space, which may not necessarily align with the similarity employed in clustering. Therefore, we proposed a novel graph-based metric that directly operates on the graph where cells of the same (ground truth) type are identified as communities. Filippo et al. [45] discussed a definition of community in the network by splitting the total degree of a node i into two contributions: given a subgraph \(V \subset G\), \(k_i(V) = k_i^(V) + k_i^(V)\), where \(k_i^(V)\) is the number of edges connecting node i to other nodes in V, and \(k_i^(V)\) is the number of connections towards the rest of the network. The subgraph V is a community in a strong sense if \(k_i^(V) > k_i^(V), \forall i \in V\). Inspired by this definition, we introduced the metric Proportion of Weakly Connected cells (PWC). PWC quantifies, for a subgraph V consisting of all the cells of the same true class, the proportion of cells that have fewer connections within V than with the rest of the graph.

AW, AV

Wallace [46] proposed two asymmetric indices to quantify the similarity between two partitions of a set. Let \(U = \\) be the partition of the dataset defined by cell types and \(Z = \\) be the partition given by the clustering prediction. The first index W is the proportion of joint object pairs in partition U that are also joined in partition Z. The second index V is the proportion of joint object pairs in partition Z that are also joined in partition U. Both index W and V can be adjusted for chance using formula:

$$\begin \text = \frac - E(\text )})}, \end$$

(1)

where S is a similarity measure that does not have value 0 under statistical independence. A generalized hypergeometric model is assumed to calculate the expectation value of V and W [47]. AW can be interpreted as the completeness of cell types. It quantifies to what extend objects belonging to the same cell type in U are assigned to the same cluster in Z. Similarly, AV can be interpreted as the homogeneity of clusters, which measures to what extend clusters are not mixing objects of different cell types. AW and AV can be decomposed into indices for the individual cell types of partitions U and for the individual clusters of partitions Z, respectively [48], that is:

$$\begin \text _i = \frac^ \left( n_\\ 2\end}\right) - \left( n_\\ 2\end}\right) Q}n_\\ 2\end}\right) (N-Q)}, \end$$

(2)

$$\begin \text _j = \frac^ \left( n_\\ 2\end}\right) - \left( n_\\ 2\end}\right) P}n_\\ 2\end}\right) (N-P)}, \end$$

(3)

where \(n_\) is the number of objects placed in class \(U_i\) and in cluster \(Z_j\) , N is the total number of pairs of objects, \(P = \sum _^ \left( n_\\ 2\end}\right)\) is the number of object pairs that were placed in the same cluster in U, and \(Q = \sum _^ \left( n_\\ 2\end}\right)\) is the number of object pairs that were placed in the same cluster in Z. AW and AV range from \(-1\) to 1, with 0 for random assignments and 1 for perfect agreement.

ARI, ARI2

The adjusted Rand index is the harmonic mean of AW and AV, and therefore a summary of both the homogeneity of predicted clusters and the completeness of true classes. ARI can be decomposed into a weighted average of the AW\(_i\)’s and AV\(_j\)’s as follows [48]:

$$\begin \text = \frac^ \text _iP_i + \sum _^ \text _jQ_j}^P_i + \sum _^Q_j}, \end$$

(4)

where \(P_i = \left( n_\\ 2\end}\right)\) is the number of object pairs in cluster \(U_i\), and \(Q_j = \left( n_\\ 2\end}\right)\) is the number of object pairs in cluster \(Z_j\). Equation 4 shows that ARI is largely determined by the AW\(_i\) and AV\(_j\) values of large clusters. However, in many cases in single-cell analysis, the rare cell types are of more concern. Therefore, we included a variant of ARI (ARI2) proposed by Matthijs et al. [48] to alleviate the class size bias of ARI.

$$\begin \text = \frac^\prime \times 2\text ^\prime }^\prime + \text ^\prime }, \end$$

(5)

where

$$\begin \text ^\prime = \frac \sum \limits _^\text _i, \end$$

(6)

and

$$\begin \text ^\prime = \frac \sum \limits _^\text _j. \end$$

(7)

MI, VI

While ARI, AW, and AV are external evaluation metrics that count pairs of objects, Mutual Information (MI) and Variation of Information (VI) are based on information theory. These two groups of metrics do not always show consistent results, due to different underlying assumption. The MI between two partitions U and Z is as follows:

$$\begin \text = \sum \limits _^ \sum \limits _^ p_ log \frac}, \end$$

(8)

where \(p_=\frac}\), \(p_=\frac}\), and \(p_=\frac}\). It has been shown [49] that

$$\begin \text = H(Z) - H(Z|U), \end$$

(9)

where \(H(\cdot )\) is the Shannon entropy. Since Z stays the same for a given dataset, Eq. 9 indicates that comparing MI between methods on the same dataset is equivalent to comparing the conditional Shannon entropy of Z on U. In other words, MI can be interpreted as the measure of homogeneity of clusters, similar to AV. Note that MI is not normalized and the upper bound varies across datasets. It is therefore only meaningful to compare MI within the same dataset.

VI measures the amount of information that is lost or gained in changing from partition Z to U:

$$\begin \text = -\sum \limits _^ p_i logp_i - \sum \limits _^ p_j logp_j - 2\sum \limits _^ \sum \limits _^ p_ log \frac}, \end$$

(10)

Similarly, VI is also highly related to entropy:

$$\begin \text = H(U|Z) + H(Z|U). \end$$

(11)

VI is not normalized, and a higher VI value indicates a worse clustering solution.

Calculating and comparing the area under the curve

Partition-based metrics change as the clustering resolution changes. The true cluster numbers is not predetermined, and the optimal performance is not always achieved at the true number of clusters. Therefore, comparing clusterings at a fixed resolution or number of clusters becomes challenging. To address this challenge, we compared clusterings across a range of resolution parameters that result in varying number of clusters (Additional file 1: Fig. S20a). We examined the performance as the number of clusters changes and summarized the results using the area under the curve (AUC).

To calculate the AUC and compare between results of different ranges of cluster numbers, the upper bound of each metric is used for the normalization of the absolute AUC. Specifically, metrics such as ARI, ARI2, AW, and AV have an upper bound of 1. MI and VI were normalized using the empirical maximum value per method per dataset. Notably, in the case of VI, instead of using the normalized AUC directly for comparison, we used \(1 -\) normalized AUC.

When plotting the AUC heatmap, we colored the heatmap using the deviations from the column-wise median scaled by matrix-wise median absolute deviation. Let \(\varvec\) be the original matrix storing the metric values, \(\varvec\) be the transformed matrix, and \(A_\), \(B_\) is the element of matrix \(\varvec\), and \(\varvec\), respectively. The calculation of \(\varvec\) is as in Eqs. 12, 13, and 14. By applying this transformation, the color scale is unified across datasets.

$$\begin B_ = \frac - \text (\varvec_)}(\varvec)}, \end$$

(12)

where

$$\begin M^\prime _ = |M_|, \end$$

(13)

and

$$\begin \varvec_ = \varvec_ - \text (\varvec_). \end$$

(14)

Choosing the number of dimensions

When applying dimensional reduction methods, a parameter that one needs to choose is the number of dimensions n of the embedding space to use. Since all the methods use either principle component analysis (PCA) or singular value decomposition (SVD), we applied the elbow approach and examined the scree plot of each method by plotting the proportion of variance explained by each component against the component indices. We observed that for nearly all methods and datasets, the elbow point is before 15 dimensions (Additional file 1: Fig. S20b). We therefore used \(n=15\) for all methods. More details on how n affects the performance is discussed in the “Results” section.

Analysis of feature numbers

To understand how the feature numbers impact the performance, we run each method across a range of feature numbers: 25k, 100k, 200k, and 500k. For ArchR and SnapATAC2, adjusting these settings was straightforward, achieved by changing the varFeaturesargument in the addIterativeLSI() function, or the most_variable argument in the snapatac2.pp.select_features() function, respectively. In the case of Signac and SnapATAC, retaining the desired number of features involves computing the quantiles. The calculated quantile was specified through the argument min.cutoff in the function FindTopFeatures() in Signac. For SnapATAC, after initially removing the top 5% most accessible (and thus least variable) features, a second filtering step was done by selecting the top n% most accessible features. Here, n% was determined to ensure the retention of the specified feature counts. Then, the clustering was performed using the same optimal hyper-parameters as in the previous analysis which used the default feature numbers.

Analysis of clustering robustness

As described in the Clustering section, we performed Leiden clustering using a range of resolution values in combination with 5 random seeds. For each resolution, we compared the clustering outcomes from different seed pairs by computing ARI. An ARI of 1 indivates identical clustering predictions from the two seeds, whereas an ARI of 0 suggests that the clusterings are independent. We then calculated the deviation of the pairwise ARI from 1, interpreting this as a measure of the clustering’s robustness when fixing resolutions. To get an overall estimation of robustness as in Additional file 1: Fig. S12a, this deviation was averaged across seed pairs and resolutions.

Analysis of library size biases

The library size of each cell is the total number of unique fragments of that cell in the fragment file. To quantify to what extent the learned latent space is driven by the library size, we calculated an absolute Pearson’s correlation coefficient value \(r_\) for each latent dimension l and dataset d as follows:

$$\begin r_ = \frac|\text (\text (\varvec_), \varvec_)|}, \end$$

(15)

where cor() is the function to calculate Pearson’s correlation coefficient between two vectors, and sqrt() is the function to calculate element-wise square root of a vector. \(\varvec_\) represents the library size across all cells of cell type k in dataset d, and \(\varvec_\) represents the value of the latent component l across all cells of cell type k in dataset d. K is the total number of cell types. \(r_\) for dataset 10XPBMC and \(l=1,2,3,4,5\) are shown in Additional file 1: Fig. S14.

To summarize across methods, we averaged this value across \(l=1,2,3,4,5\) and the six datasets (see Fig. 5).

Benchmarking methods for predicting gene activity scores

From the five methods discussed, four offer the capability to infer gene activities. Signac employs a basic technique, counting fragments in the gene body and promoter regions for each gene, followed by log-normalization of these counts to derive a gene activity score. SnapATAC calculates gene body fragment counts, normalizes them using log-transformed count-per-million reads, and then employs a Markov affinity-graph-based method for imputation and smoothing. SnapATAC2 has a similar procedure to SnapATAC, with the distinction of counting TN5 insertions instead of fragments. ArchR adopts a more complex model that weights fragment counts based on their distance to the Transcription Start Site (TSS) and includes distal regulatory elements, while also considering neighboring genes.

To benchmark their performance on predicting gene activities, we utilized the two multi-omics datasets, namely 10XPBMC and Chen2019. Gene expression data from the RNA components were used as the ground truth. Additionally, we identified 500 metacells per dataset using k-means clustering, applied through the function fastcluster(returnType=“metacells”) in scDblFinder (v 1.13.9). We then selected the top 1000 highly variable genes and computed Pearson’s and Kendall’s correlation coefficients to compare the predicted gene activity scores with the actual gene expression data. Specifically, we computed both per-cell or metacell correlation across genes as well as per-gene correlation across cell or metacell.

Benchmarking methods for integration with scRNA-seq data

The cross-modality integration was performed in an unpaired fassion, meaning we treated the cells from two modalities as if they were unmatched. Specifically, we used GLUE [50] for such integration, where the processing of RNA are fixed, and the features and cell embeddings generated using different ATAC-seq processing pipelines are used as GLUE’s input. For the construction of the guidence graph in GLUE, we used the top 2000 high variable genes (HVGs) from the RNA data and the genomic features from ATAC data that were (i) selected by the processing method for dimensional reduction and (ii) are connected to any HVGs (i.e., overlapping in either the gene body or promoter region).

FOSCTTM are calculated as follows: let \(X_1\) and \(X_2\) be the cell embeddings of RNA and ATAC data within an integrated space, respectively. Each matrix is of size \(n \times m\), where n represents the number of cells and m is the number of latent dimensions. The rows within \(X_1\) and \(X_2\) are aligned such that each corresponds to the same cell. Define \(d_\) as the cosine distance between \(X_1[i,:]\) and \(X_2[j,:]\). For a given cell i, the fraction of samples closer than its true match is computed as:

$$\begin \text _i = \frac \end$$

(16)

, where \(R_i\) is the rank of \(d_\) among the distances \(d_, d_,...,d_\) that are less than or equal to \(d_\). Then, the \(\text _i\) were also calculated after switching \(X_1\) and \(X_2\), and then averaged for the cell pair i.

Other indicesEvenness

Evenness (E) quantifies the homogeneity of abundances of different types in a sample [51]. Here, we use Eq. 17 to calculate E:

$$\begin \text = \frac, \end$$

(17)

where \(H(\cdot )\) is Shannon entropy, and K is the total number of cell types. E ranges from \(\frac\) to 1, and a higher E indicates that the dataset is more balanced.

Geary’s C

We calculated Geary’s C index [52] of log-transformed fragment counts using spatial distance defined by k-nearest neighbor (KNN) graph (k = 20) [14]. Geary’s C is calculated as:

$$\begin \text = \frac\sum w_(x_i-x_j)^2}(x_i-\bar^2)}, \end$$

(18)

where N is the total number of cells; \(x_i\) is the log-transformed fragment counts of cell i, \(w_ij\) is the weight of edge between cell i and j on the KNN graph, and \(S_0\) is the sum of all weights in \(\varvec\).

留言 (0)

沒有登入
gif