The first six categories, grouped under a Package score (open source, version control, unit testing, GitHub repository, tutorial and function documentation), assess the quality of the code, its availability, the presence of a tutorial to guide users through one or more examples, GitHub issue activity and responsiveness and (ideally) usage in a nonnative language (that is, from Python to R or vice versa). Principal component regression, derived from PCA, has previously been used to quantify batch removal11. Diffusion maps of erythrocyte lineage cells of the 4 best (upper rows a, b and c) and 4 worst (lower rows a, b and c) integration methods, ordered by the overall score. Where HVGs were used, the top 2,000 were selected using a custom method, which selected HVGs in a manner unaffected by batch variance. For example, metrics that run on kNN graphs can be run on all output types after preprocessing. Lun, A. T. L., Bach, K. & Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. 910 data integration methods14 . 2 for the full plot). NMI compares the overlap of two clusterings. These results are remarkably consistent across tasks for integrating real data. Considering scalability, one might want to rapidly test how integration affects a dataset and thus opt for BBKNN. Classification 2, 193218 (1985). In summary, we evaluated the performance of 19 data integration outputs on six scATAC-seq tasks (Table 1) using 11 evaluation metrics (feature-level metrics were not applicable, see Supplementary Table 2). For each run of the Snakemake pipeline, we used the Snakemake benchmarking function to measure time and peak memory use (max PSS). 9 Usability assessment of data integration methods. For iLISI and cLISI this involved a two-step process. Nat. Values of 1 or 0 correspond to the same order of cells on the trajectory before and after integration or the reverse order, respectively. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. A single-cell data integration method aims to combine high-throughput sequencing datasets or samples to produce a self-consistent version of the data for downstream analysis 7.. Assuming there is a single, optimal way in which to run an integration method, we ranked methods by their top-performing preprocessing combination, which also indicated to users how best to run each integration method (Fig. These effects can arise from variations in sequencing depth, sequencing lanes, read length, plates or flow cells, protocol, experimental laboratories, sample acquisition and handling, sample composition, reagents or media and/or sampling time. The activity score was calculated as: To get the response score we first calculated a first response time for each issue. Even unintegrated data in gene activity space lacked biological variation in cell identities compared to the same data on peaks or windows. Open source Softw. However, if this variation is encoded (for example, neutrophil states in the lung), scGen and scANVI are the only methods that are able to preserve cell state differences that are each present only in a single batch. 2b,c) shows a consistent picture: all high-performing methods successfully removed batch effects between individuals and platforms while conserving biological variation at the cell-type and subtype levels. For scRNA-seq tasks, we chose the best performing combination of features (HVG or full features) and scaling flavors for each integration method, and then ranked these from best- to worst-performing to give a final ranking per task. Xie, Y., Allaire, J. J. For further details on datasets, please see the Supplementary Information. Nature 588, 466472 (2020). In comparison, the performance of MNN, ComBat and Seurat RPCA was better using HVG selection, with scaling having little effect on the output except a slightly improved performance in tasks with stronger batch effects. PubMed P. Angerer was of great help for timely bug fixes in package dependencies. 874656 awarded to F.J.T., by the Wellcome Trust grant no. Eraslan, G., Avsec, ., Gagneur, J. To guide integration method choice, we benchmarked 68 method and preprocessing combinations on 85 batches of gene expression, chromatin accessibility and simulation data from 23 publications, altogether representing >1.2 million cells distributed in 13 atlas-level integration tasks View on Springer nature.com Save to Library Create Alert Cite J. Comput. . (six samples, single-nucleus ATAC-seq protocol; retrieved from http://data.nemoarchive.org/biccn/grant/cemba/ecker/chromatin/scell/raw/) and Cusanovich et al. As we expect batches to integrate within cell identity clusters, we compute the batchASWj (ref. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Similarly, metrics that run on joint embeddings can also be run on corrected feature outputs. The color scheme indicates the overall ranking of each method. Here, we benchmark 38 method and preprocessing . Bioinformatics 34, 3600 (2018). 27, 546559 (2021). The overlap was scaled using the mean of the entropy terms for cell-type and cluster labels. Xiong, L. et al. performed the analysis. As LISI scores range from 1 to B (where B denotes the number of batches), indicating perfect separation and perfect mixing, respectively, we rescaled them to the range 0 to 1. Google Scholar. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Biotechnol. PubMed Instead, dimensionality reduction approaches designed for scATAC-seq data35 in combination with an MNN approach as implemented in FastMNN or Scanorama may represent a promising avenue for future integration approaches for this modality. 1). Johnson, W. E., Evan Johnson, W., Li, C. & Rabinovic, A. In contrast to existing methods, Multigrate is not limited to specific paired assays while comparing. 7 and Methods), we found that ComBat, BBKNN and SAUCIE performed best in terms of runtime and scVI, scANVI and BBKNN are the most memory efficient. Further, Harmony exhibited the lowest isolated label F1 bio-conservation score among top performers. For example, while certain bio-conservation metrics prioritized clearly separated cell clusters, others measured continuous cellular variation such as trajectories and the cell-cycle, or evaluated gene-level output. Commun. (b) Regression coefficients for number of cells and features on maximum memory usage for each method. Given sufficient numbers of cells, scVI has shown that it is able to remove strong batch effects while only sacrificing minimal biological variation. For the real data tasks, we downloaded 23 published datasets (see Supplementary Data 2 for a per-batch overview of datasets). Pliner, H. A. et al. Benchmarking atlas-level data integration in single-cell genomics - integration task datasets .H5AD human_pancreas_norm_complexBatch.h5ad (301.32 MB) .H5AD Lung_atlas_public.h5ad (972.43 MB) .H5AD Immune_ALL_hum_mou.h5ad (3.97 GB) .H5AD Immune_ALL_human.h5ad (1.92 GB) .H5AD large_atac_peaks.h5ad (2.27 GB) .H5AD large_atac_windows.h5ad (2.3 GB) 215), we found that the varying complexity of tasks affects the ranking of integration methods: while Seurat v3 and Harmony perform well on simpler real data tasks and some simulations, Scanorama and scVI performed particularly well on more complex real data. 21, 12 (2020). wrote the code for the scIB package and pipeline. 12, 28252830 (2011). Given runtime and memory limitations imposed in our benchmark, trVAE could not integrate datasets with >34,000 cells, while Seurat v3, MNN and scGen failed to integrate datasets with >100,000 cells (Supplementary Data 3). To guide integration method choice, we benchmarked 68 method and preprocessing combinations off 85 batches of gens expression, chromatin accessibility the . Extended Data Fig. We used the same set of cell-cycle genes for mouse and human data (using capitalization to convert between the gene symbols). Stat. Some methods output more than an integrated graph, joint embedding or corrected feature space; for example, scANVI outputs predicted labels where these are not provided, and DESC outputs a clustering of the data. Methods were tested with scaled and unscaled data as input, using the full feature (gene/open chromatin window or peak) set or only HVGs. Conos, which incorporates HVG selection and scaling within its method, performed slightly better on full feature input with scaling applied depending on the task. Our overall rankings were based on metrics measuring different aspects of integration success (for an overview, see the website and Supplementary Figs. Methods that can remove strong batch effects also tend to remove nuanced biological signals or require cell identity labels obtained via per-batch data processing. Rev. prepared the simulations. Pipeline for benchmarking atlas-level single-cell integration This repository contains the snakemake pipeline for our benchmarking study for data integration tools. Li, X. et al. Using these subset kNN graphs, we computed the graph connectivity (GC) score using the equation: Here, C represents the set of cell identity labels, |LCC()| is the number of nodes in the largest connected component of the graph and |Nc| is the number of nodes with cell identity c. The resultant score has a range of (0;1], where 1 indicates that all cells with the same cell identity are connected in the integrated kNN graph and the lowest possible score indicates a graph where no cell is connected. As a dataset specific alternative, method selection can be guided by running the scIB pipeline to test all methods on a user-provided dataset. To obtain Institute of Computational Biology, Helmholtz Zentrum Mnchen, German Research Center for Environmental Health, Neuherberg, Germany, Malte D. Luecken,M. Bttner,K. Chaichoompu,A. Danese,M. F. Mueller,D. C. Strobl,L. Zappia,M. Colom-Tatch&Fabian J. Theis, Institute of Medical Informatics, University of Mnster, Mnster, Germany, Department of Mathematics, Technische Universitt Mnchen, Garching bei Mnchen, Mnchen, Germany, Institute of Medical Informatics, Heidelberg University Hospital, Heidelberg, Germany, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany, Biomedical Center (BMC), Physiological Chemistry, Faculty of Medicine, Ludwig Maximilian University of Munich, Planegg-Martinsried, Germany, You can also search for this author in We find that Scanorama and scVI perform well, particularly on complex integration tasks. Nat. b,c, Visualization of the four best performers on the human immune cell integration task colored by cell identity (b) and batch annotation (c). Article Gene activities were particularly poorly suited to represent scATAC-seq data. AbstractCell atlases often include samples that span locations, labs, and conditions, leading to complex, nested batch effects in data. Trajectory structure was slightly better conserved in the overall high-performing methods Scanorama, scGen and FastMNN, while poor performers were consistent across label-free metrics (Supplementary Figs. Prof. If >25% of cells were assigned to connected components too small for kBET computation (smaller than k3), we assigned a kBET score of 1 to denote poor batch removal. Conditional out-of-distribution generation for unpaired data using transfer VAE. Our graph LISI extension produces consistent metric values with the standard LISI implementation for non-graph-based integration outputs (Supplementary Fig. Google Scholar. Single-cell atlases often include samples that span locations, laboratories and conditions, leading to complex, nested batch effects in data. Overall, the embeddings output by Scanorama, scANVI and scVI perform best, whereas SAUCIE and DESC perform poorly. All authors reviewed the final paper. Chazarra-Gil, R., van Dongen, S., Kiselev, V. Y. If available, we computed 500 HVGs per batch. Benchmarking atlas-level data integration in single-cell genomics. J. Here, 0 indicates that batches are well mixed, and any deviation from 0 indicates a batch effect: To ensure higher scores indicate better batch mixing, these scores are scaled by subtracting them from 1. Article Nat. Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. & Oshlack, A. Methods 16, 12331246 (2019). Yet, these often left batch structure within cell-type clusters and thus failed to fully integrate batches (Supplementary Figs. http://creativecommons.org/licenses/by/4.0/, scRNASequest: an ecosystem of scRNA-seq analysis, visualization, and publishing, Meta-analysis of (single-cell method) benchmarks reveals the need for extensibility and interoperability, The shaky foundations of simulating single-cell RNA sequencing data, Batch alignment of single-cell transcriptomics data using deep metric learning, Single-cell transcriptomic analysis reveals diversity within mammalian spinal motor neurons. In our study, we benchmark 16 methods (see Tools) with 4 combinations of preprocessing steps leading to 68 methods combinations on 85 batches of gene expression and chromatin accessibility data. We evaluated methods according to scalability, usability and their ability to remove batch effects while retaining biological variation using 14 evaluation metrics. . Single-cell atlases often include samples that span locations, laboratories and conditions, leading to complex, nested batch effects in data. First, all integration outputs were treated as separate integration runs. LIGER performs stronger batch removal than Harmony, although it leaves some batch structure within cerebellar granule cells on large ATAC tasks. Each integration method was evaluated with regards to accuracy, usability and scalability (Methods). We reported the regression coefficients for both number of cells, 1, and number of features, 2, to compare scalability between methods (Extended Data Fig. Metrics are divided into batch correction (blue) and bio-conservation (pink) categories. Methods https://doi.org/10.1038/s41592-019-0619-0 (2019). (http://dropviz.org/; DGE by Region section). Although scaling aided integration across species in several methods, it did not lead to a better conservation of the trajectory, as even the best trajectory-conserving methods did not integrate perfectly across species in the human or mouse task. 3). Overall method rankings across tasks (for example, Fig. Welch, J. D. et al. 3 and Supplementary Data 1). scVI and scGen perform well, particularly on complex . All of the top-performing methods exhibited high trajectory conservation scores, whereas DESC (on scaled/HVG data), scGen (on scaled/full feature data) and Seurat v3 CCA (on scaled/HVG data), produced poor conservation of this trajectory due to overclustering (DESC), merging of cell types (Seurat v3 CCA) or lack of relevant biological latent structure (scGen). We focus in particular on assessing the conservation of biological variation beyond cell identity labels via new integration metrics on trajectories or cell-cycle variation. 19 and 21). 49, e42 (2021). and M.C.-T., both through the Initiative and Network Fund of the Helmholtz Association awarded to F.J.T., by the Bavarian Ministry of Science and the Arts in the framework of the Bavarian Research Association ForInter (Interaction of human brain cells) awarded to F.J.T. 2022 . 5a). With the growing availability of datasets, removing batch effects within scATAC-seq data is also becoming an application of interest. For example, in the lung task, three datasets sample two distinct spatial locations (airway and parenchyma). Methods 17, 261272 (2020). Probabilistic harmonization and annotation of singlecell transcriptomics data with deep generative models. Overall, scANVI, Scanorama, scVI and scGen perform well, particularly on complex integration tasks, while single-cell ATAC-sequencing integration performance is strongly affected by choice of feature space. Completion status of all attempted integration method runs with error message summaries. Conservation of biological variation in single-cell data can be captured at the scale of cell identity labels (label conservation) and beyond this level of annotation (label-free conservation). Second, we rescaled the LISI scores as follows: \({\mathrm{cLISI}}:f(x) = \frac{{B - x}}{{B - 1}}\), where a 0 value corresponds to low cell-type separation and \({\mathrm{iLISI}}:g(x) = \frac{{x - 1}}{{B - 1}}\), where a 0 value corresponds to low batch integration. Source code for the website is available at https://github.com/theislab/scib-reproducibility/tree/main/website. A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples. On the right-hand side, criteria with poor scores across methods are highlighted for each category. Points marked with a cross use scaled features. 3a and Supplementary Data 1). Why rankings of biomedical image analysis competitions should be interpreted with care. 4c), fully merged batches within cell-type clusters. Our reprocessed versions of these datasets are publicly available as preprocessed Anndata objects on Figshare (https://doi.org/10.6084/m9.figshare.1242096852). 53, 403411 (2021). For the mouse brain (ATAC) integration, we used FASTQ files from Fang et al. Currently, at least 49 integration methods for scRNA-seq data are available8 (as of November 2020, Supplementary Table 1). F.J.T. We ranked the methods in each task and computed an average rank across tasks. Litviukov, M. et al. Thus, we tested up to 68 data integration setups per integration task, resulting in 590 attempted integration runs. In other words, the variance remains unchanged within each batch for complete conservation, while any deviation from the preintegration variance contribution reduces the score.