Highly multiplexed spatial profiling with CODEX: bioinformatic analysis and application in human disease

Bioinformatics analysis of CODEX multiplexed imaging data reveals tissue structure and provides insight into biological mechanisms. This is typically done by converting the high-dimensional raw data into single-cell maps of tissue architecture and functional states. This process consists of four main steps (Fig. 2): (1) image preprocessing, (2) cell segmentation, (3) cell phenotyping, and (4) spatial analysis by one or more algorithms that relate cellular contexts to some determinative or predictive outcome.

Fig. 2figure 2

Overview of the computational workflow for CODEX multiplexed imaging data

Over the past 5 years, various computational methods and software tools have been developed to facilitate each of these analytical steps. Most carry out only one of the steps (CODEX Uploader [5], RAPID [25], CellSeg [26], Mesmer [27], CellPose [28], CELESTA [29], Astir [30], CytoMAP [31], histoCAT [32], TissueSchematics [33], and MISTy [34]), but a few attempt to address multiple steps or the entire workflow (Cytokit [35], MCMICRO [36], SIMPLI [37]) (Table 2).

Image preprocessing

Typical CODEX multiplexed imaging data consist of multiple tiles that correspond to a large tissue region. In each cycle, data are collected from four wavelength channels and multiple z-planes, resulting in terabytes of imaging data for each experiment. To visualize and analyze this dataset, it is necessary to convert the raw data into multidimensional hyperstacks (x, y, channel, cycle) by creating montages from individual tiles and aligning these montages from different cycles to generate a multidimensional tiff image. Several challenges are associated with this step: First, images can be blurry and noisy due to inherent limitations of microscopes systems (e.g., out-of-focus light and noise from the light source and the camera). Second, images can be misaligned due to axial and lateral drift during the cyclic imaging process. Third, there is autofluorescence background originating from endogenous tissue components such as erythrocytes, collagen, elastin, and lipofuscin, and formalin fixation. Finally, rapid processing of the large-scale imaging datasets. These issues are generic to all fluorescence-based multiplexed imaging technologies including CODEX.

Although several computational toolboxes are available for CODEX image preprocessing, most only address a subset of these issues. The CODEX Uploader, the first processor developed for CODEX data, was implemented as an open-source Java package by the Nolan lab [5,6,7]. This software corrects spatial drifts using 3D drift-compensation, deconvolves the z-stack images using the commercially available Microvolution algorithm, subtracts the background obtained by collection of blank image cycles, and generates hyperstacks consisting of all fluorescent channels and imaging cycles. The CODEX Uploader has a graphical user interface (GUI) that allows users without any programming background to process the complex multiplexed imaging data. The main limitations of the CODEX Uploader are that it can take several days to process terabytes of imaging data and it does not correct lateral drifts among tiles for multi-tile experiments. Cytokit, which was implemented in Python, employs a pipeline similar to that of the CODEX Uploader [35]. MCMICRO can be used to analyze multiplexed imaging data collected with a range of technologies (e.g., fluorophore-based, metal-based) [36]. The preprocessing pipeline in MCMICRO includes illumination correction, image stitching, and registration, but it does not have image deconvolution or autofluorescence removal.

RAPID [25] was developed for accurate and fast analysis of large-scale CODEX imaging data by the Nolan lab. It reduces processing time by two to threefold compared to the CODEX Uploader. RAPID deconvolves large-scale CODEX imaging data and stitches and registers images with axial and lateral drift correction. Moreover, RAPID minimizes intense tissue autofluorescence such as that introduced by erythrocytes, thereby improving the immunofluorescence detection of antigens, especially for low-abundance markers. RAPID incorporates an open-source, CUDA-driven, GPU-assisted deconvolution algorithm rather than the fee-based commercial software used in the CODEX Uploader with no impact on quality of the output.

Cell segmentation

After the raw imaging data is processed and aligned, the next step is cell segmentation. During this step, the boundaries of single cells are computationally identified and binary masks for individual cells are generated. This task has long been challenging in fluorescence microscopy image processing. While it is relatively easy to segment monodispersed cells in images of cultured cells as cells are sufficiently separated, segmenting cells that are densely packed in tissues such as lymph nodes and spleen is difficult. The accuracy of cell segmentation significantly influences the quantification of multicellular properties such as protein expression and cell morphology. An ideal cell segmentation algorithm for imaging data must accurately segment cells of various sizes and shapes independently of density across diverse tissue types and must be capable of demarcating the membrane, nucleus, and cytoplasm.

Cell segmentation algorithms typically use images of cells stained with marker of nuclei and/or membranes as input. Some algorithms generate only nuclei masks, but others can output both nuclei and whole-cell masks, thereby allowing protein quantification in sub-cellular compartments. Watershed nuclei segmentation algorithms are classical segmentation methods that perform well when segmenting cells of relatively homogenous size and shape such as lymphoid tissue in CODEX data [5, 6, 38]. However, watershed-based algorithms can be sensitive to noise such as imaging artifacts and blurred cell boundaries. In addition, significant tuning for multiple user-defined parameters based on the cell morphology and image intensity in the tissue images is necessary. Therefore, watershed cell segmentation does not work well for tissues that have cells of diverse sizes, shapes, and densities such as those in the tumor microenvironment.

Machine learning and deep learning algorithms trained on large-scale datasets are more robust to variations of image quality and generalize better across tissue types than watershed-based algorithms. Machine learning and deep learning algorithms differ in the composition and quantity of training dataset, model architecture, outputs, and post-processing methods. CellSeg is a modified Mask R-CNN algorithm that accurately segments nuclei in CODEX images and outperforms watershed-based nuclei segmentation [26]. CellSeg showed robust performance in delineating nuclei boundaries even in noisy images with low signal-to-noise ratio. A limitation is that it approximates the whole-cell mask by expanding the nuclei mask for a few pixels, which may not accurately match the cell boundaries. Cellpose is designed as a generalist algorithm for cell segmentation that is capable of generating both nuclei and whole-cell masks [28]. Although Cellpose outperformed several deep learning-based cell segmentation algorithms including Mask R-CNN, Stardist, and standard U-Nets, none of these models were trained on a multiplexed tissue imaging dataset; instead, training was implemented on cultured cells, animal cells, and non-microscopy images of non-cell objects. Recently, Noah et al. constructed a large-scale multiplexed imaging dataset named TissueNet with more than 1 million manually labeled cells [27]. Using TissueNet, they trained a deep neural network named Mesmer and achieved human-level performance in segmenting whole cells. Mesmer outperformed the pre-trained Cellpose model, FeatureNet, RetinaMask, and ilastik [39], and enabled both nuclei and whole-cell segmentation, thereby allowing for subcellular localization of protein signals. Interestingly, different deep learning models achieved similar performance when trained on the same TissueNet dataset, suggesting the importance of procuring large-scale annotated dataset.

Protein marker quantification

After cell segmentation, protein marker expression is typically quantified as the total marker intensity on each cell normalized by cell size. However, spatial spillover, which is the blending of signals between neighboring cells that are tightly packed such as in lymphoid tissue, is a common issue complicating the accurate quantification of single-cell protein expression in multiplexed imaging data. Signal spillover is often manifested by the false positive co-expression of mutually exclusive markers on the same cell. An example is the apparent co-expression of CD3 and CD20, which are expressed on T and B cells, respectively. To compensate for the cell-to-cell spillover in CODEX images, an adjacency matrix is typically calculated by measuring the fraction of shared boundary between each pair of cells, and then the raw intensity matrix of protein expression is multiplied by the inverse adjacency matrix, correcting for the spillover [5]. REDSEA is another spillover correction method that has been shown to decrease signal contamination from neighboring cells in MIBI data [40]. However, REDSEA requires that the cell mask have zero-pixel value between the adjacent cell boundaries, which is not the case with the typical cell masks generated by commonly used cell-segmentation algorithms.

Cell type identification

Assignment of cell types, namely, cell phenotyping, to segmented single cells from multiplexed imaging data remains a bottleneck for multiplexed image analysis. An ideal algorithm should automatically and precisely define cell types and subtypes of various abundances in an objective and robust manner with minimal human intervention.

The classical methods for cell phenotyping rely heavily on pathology expertise and are subjective and time-consuming. One such method employs the manual gating strategy that is typically used for analyzing flow cytometry or CyTOF datasets (e.g. Cytofkit [41], cytomapper [42]). This method is sensitive to signal spillover between adjacent cells and to segmentation noise [43]. SIMPLI [37] is a toolkit that defines cell types by global thresholding in a fashion similar to the hand-gating method; both are challenged by poor imaging quality and imaging noise.

Unsupervised clustering followed by manual annotation has been widely used for assignment of cell types [5, 7, 19]. This process typically starts with over-clustering of a protein expression matrix with each cell as a row and each marker signal as a column. Many clustering algorithms are available that can be used to group protein expression into clusters; these include X-shift, PhenoGraph, FlowSoM, and k-means, etc. Data normalization techniques and selection of the optimal cluster numbers can significantly influence the clustering results [43]. After the initial clustering, similar clusters are merged, and mixed clusters are further separated to finalize the cluster annotation. This process works in a bottom-up and iterative fashion. However, problems such as mixed clusters or unknown clusters can persist after many rounds of manual annotations, and annotating millions of cells requires significant time [7].

Although efforts have been made to develop new algorithms for automatic and fast cell type identification, robust and accurate algorithms that can detect cells of various abundances from multiplexed imaging data are still lacking. Astir [30] is a probabilistic machine learning method that infers cell types based on protein expression of cells and prior knowledge of marker proteins. This algorithm has superior accuracy and runtime compared to unsupervised clustering methods in mass cytometry data. However, it did not outperform supervised classification and cannot identify novel cell types. Additionally, Astir is sensitive to antibody staining quality and signal intensity. Although Astir utilizes protein expression profiles, it does not incorporate any spatial or morphological information unique to multiplexed imaging data. CELESTA is a fast cell-identification algorithm designed to leverage both the protein expression and spatial information from CODEX imaging data [29]. Cells with marker expressions that matches well with the pre-defined marker profile are easily identified and defined as “anchor” cells. When the marker expressions of cells do not match a pre-defined cell type, these cells are assigned an identity based on a spatial score of their neighboring cells types. One limitation of this method is that it is difficult to define rare cell types. CELESTA is also sensitive to marker staining quality and imaging noise. STELLAR [44] is new cell-type annotation method that uses graph convolutional neural network to learn the spatial and molecular similarities of cells. This method requires manually annotated data as a reference dataset to learn from.

Spatial analysis

Multiplexed imaging data contains rich information on the multicellular tissue ecosystem including cell types, cell states, and their coordinated activities across scales. The goal of spatial analysis is to decipher how cells and tissues are spatially organized and orchestrated in their native environment and how this organization influences biological function, disease progression, and response to therapies. To detangle this multilayered and interrelated spatial information, computational methods must dissect the multicellular modules that are coordinated by both the physically proximal and distant cell types across a spectrum of length scales and determine the associations of these multicellular modules with tissue functions and with physiological and pathological conditions.

Cell–cell interactions and communications can occur across a range of distances [45], broadly categorized as autocrine, juxtacrine, paracrine, and endocrine. The developed spatial analysis approaches span from identification of pair-wise cell–cell contacts to higher-order architectures.

Pair-wise cell–cell interactions: The most basic method for cell interaction analysis is to measure the pair-wise cell–cell interactions to infer attraction or repulsion between two cell populations. To measure these physical cell–cell contacts (i.e., juxtacrine contacts), Goltsev et al. identified pairs of neighboring cells from Delaunay graphs and calculated the odds ratio of the co-occurrence of two cell types [5]. This analysis showed that the most dominate pairwise cell–cell contacts are homotypic (e.g., T cells with T cells, B cells with B cells) and mostly reflect anatomic compartments in the tissue [5, 7]. histoCAT was originally designed for cell-interaction analysis in multiplexed image cytometry data [32]. It examines the significance of pairwise interactions by comparing the number of cell–cell interactions at a user-defined distance (e.g., 4 pixels) to that of a matched control containing randomly shuffled cell phenotypes [46]. This test can reveal significant enrichments or depletions of a cell in another cell’s neighborhood. Strategy used by Keren et al. [3] measured pair-wise interactions within 100 pixels (39 µm) of an index cell. Similar to the method developed by Goltsev et al., these pair-wise enrichment analysis [3, 46] identified mostly homotypic interactions between similar cells, suggesting that this might be a limitation of pair-wise analysis. Additionally, several spatial descriptive functions, including the K-function, L-function, and pair correlation function, can also be used to evaluate the spatial relationships between two cell populations [47].

Cell neighborhood, niche, or community analysis: Cell neighborhood analysis identifies higher-order (rather than paired) interactions between one or more cell phenotypes, which provides a catalog of repeating architectural units for tissues. Many methods are available to identify cell neighborhoods. A neighborhood is defined as a locally spatially homogeneous mixture of cell types [33]. The neighborhood identification methods typically raster scan each cell across the tissue with a pre-defined window that consists of N nearest neighboring cells (e.g., N = 10 was found to be suitable for colorectal tumor microenvironment [7]) or all the cells within a certain physical distance of the center cell, which is the strategy used in CytoMAP [31]. Jackson et al. used a graph-based method to identify spatially connected cell clusters, which they called cell communities [46]. These windows or graphs are grouped into cell neighborhoods by unsupervised clustering of the cell type composition of the windows. The size of these windows influences the type of cell neighborhoods identified. Spatial-LDA [48] is another method to identify cell neighborhoods with smooth boundary transitions, but it is very slow for large image data. Cell neighborhoods are composed of distinct spatial patterns of cell organization, thus reflecting characteristic local processes. Some neighborhoods correspond to known tissue microanatomies, such as tertiary lymphoid structure and bulk tumors, whereas others are novel such as the granulocyte-enriched cell neighborhood identified by Schürch et al. in their analysis of colorectal tumors [7]. In addition, other computational methods have been used to identify patterns of cell types and cell neighborhoods from multiplexed imaging data. For example, non-negative Tucker tensor decomposition and canonical correlation analysis have been used to evaluate cell type and cell neighborhood interactions [7]. “SpatialScore” was shown to predict immunotherapy response by measuring the physical distance ratio of each CD4 + T cell and its nearest tumor cell relative to its nearest Treg [19]. The tumor-immune mixing score is another spatial metric that has been used to quantify the degree of mixing between tumor and immune cells [3].

Higher-order functional units: It remains understudied how cell neighborhoods are assembled into higher-order, repetitive structures that coordinate complex tissue functions and promote disease development and progression. To infer the design principles of spatial organization of tissues, Bhate et al. proposed a conceptual and analytical framework to define and quantify tissue architectural units and their assembly at both the local and global level from multiplexed imaging data like CODEX [33]. Development of the framework, which Bhate et al. called the tissue schematic, starts by identification of cell neighborhoods from recurring patterns of spatially proximal cell types, next a graph is constructed using cell neighborhoods spanning the entire tissue sections, and finally recurring cell neighborhood combinations are classified as “motifs.” The junctions between two cell neighborhood boarders are the sites of possible interactions between the two local tissue processes; thus, “motifs” could have emergent functionality arising from signal propagation between cellular neighborhoods. Applied to CODEX imaging data of lymphoid tissues, tissue schematics identified the assembly rules shared by tonsils, lymph nodes, and spleen. In addition, this method revealed that the insertion of a tumor cell neighborhood into assembled regions of the T cell-enriched, macrophage-enriched, and vasculature neighborhoods was associated with poor survival outcomes. These results suggested the basic and translational utility of tissue schematics to understand tissue function and malfunction.

Spatial patterns of protein distribution: In addition to the geospatial distribution of cells in tissues, the expression levels and spatial patterns of protein markers are also important indicators of tissue functions. Goltsev et al. showed that cell surface marker expression is dependent on immediate neighbor cells [5]. For example, levels of CD79b and B220 are niche dependent in spleens of mice. These data suggested that geospatial location of cells within a tissue could be an indicator of potential functions. Keren et al. found that the expression of immunoregulatory proteins in distinct cell types correlates with the tissue architecture (mixed and compartmentalized) and their distance to tumor-immune borders in MIBI of human breast cancer tissue [3]. Recently, MISTy, a machine learning framework to quantify and predict the relationships between marker expression and the local cellular neighborhood (juxtaview) and broader tissue structure (paraview) was introduced [34]. Interestingly, using this algorithm to analyze imaging mass cytometry data of human breast cancer tissue, Tanevski et al. showed that broader tissue structure has more effect on the protein marker expression than do immediate neighbors.

留言 (0)

沒有登入
gif