Supervised integration using SIGNAL
Yang Zhou
2024-06-06
Source:vignettes/Supervised_integration.Rmd
Supervised_integration.Rmd
Introduction
SIGNAL can be used to perform supervised integration of single-cell
data. Given the metadata meta
(a data frame containing meta
information) as inputs, SIGNAL requires a batch variable b
and at least one group variable g
. Using a feature-by-cell
data matrix X
(a matrix-like object) and batch and group
variables, SIGNAL embeds X
from multiple batches into a
low-dimensional space that is not affected by batch effects, in which
subgroups under the group variables of different batches are aligned
together.
In this vignette we will demonstrate how to use SIGNAL to perform supervised integration, i.e., integration using ‘cell type’ as group variable.
Load data matrix and metadata
We demonstrate SIGNAL integration on a commonly used single-cell RNA sequencing (scRNA-seq) dataset of cell lines. The highly variable genes (HVGs) are already selected and are used to perform integration.
X = readRDS("/home/server/zy/group_scripts/datasets_preparation/Jurkat_293t/X.rds")
meta = readRDS("/home/server/zy/group_scripts/datasets_preparation/Jurkat_293t/meta.rds")
We can take a look of the data.
str(X)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:3137390] 3 12 14 26 30 35 36 40 43 44 ...
## ..@ p : int [1:9532] 0 325 590 932 1251 1593 1944 2318 2644 2963 ...
## ..@ Dim : int [1:2] 2000 9531
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:2000] "AP006222.2" "RP11-206L10.2" "C1orf170" "HES4" ...
## .. ..$ : chr [1:9531] "AAACATACACTGGT-1" "AAACATACAGACTC-1" "AAACATTGACCAAC-1" "AAACATTGAGGCGA-1" ...
## ..@ x : num [1:3137390] 1.9 1.975 1.53 0.937 1.121 ...
## ..@ factors : list()
str(meta)
## 'data.frame': 9531 obs. of 2 variables:
## $ CellType: Factor w/ 2 levels "293t","jurkat": 1 1 1 1 1 1 1 1 1 1 ...
## $ Batch : Factor w/ 3 levels "Batch_1","Batch_2",..: 1 1 1 1 1 1 1 1 1 1 ...
Visualization of raw data
Let us visualize the raw data using PCA and UMAP.
pca_res = irlba(t(X), nv = 30)
raw_emb = as.matrix(pca_res$u %*% diag(pca_res$d))
raw_umap = as.data.frame(umap(raw_emb))
colnames(raw_umap) = c("UMAP1", "UMAP2")
raw_umap = cbind.data.frame(meta, raw_umap)
p1 = ggscatter(raw_umap, x = "UMAP1", y = "UMAP2", size = 0.1, color = "Batch", palette = "npg", legend = "right") +
guides(colour = guide_legend(override.aes = list(size = 2)))
p2 = ggscatter(raw_umap, x = "UMAP1", y = "UMAP2", size = 0.1, color = "CellType", palette = "npg", legend = "right") +
guides(colour = guide_legend(override.aes = list(size = 2)))
plot_grid(p1, p2, align = 'h', axis = "b")
We can see that there are batch effects between Jurkat cells from batch 2 and batch 3.
SIGNAL supervised integration
We perform SIGNAL supervised integration and visualize the integrated result. This usually takes a short time. It can integrate 1 million cells in ~2 minutes.
signal_emb = Run.gcPCA(X, meta, g_factor = "CellType", b_factor = "Batch")
## Run gcPCA!
## gcPCA done!
signal_umap = as.data.frame(umap(t(signal_emb)))
colnames(signal_umap) = c("UMAP1", "UMAP2")
signal_umap = cbind.data.frame(meta, signal_umap)
q1 = ggscatter(signal_umap, x = "UMAP1", y = "UMAP2", size = 0.1, color = "Batch", palette = "npg", legend = "right") +
guides(colour = guide_legend(override.aes = list(size = 2)))
q2 = ggscatter(signal_umap, x = "UMAP1", y = "UMAP2", size = 0.1, color = "CellType", palette = "npg", legend = "right") +
guides(colour = guide_legend(override.aes = list(size = 2)))
plot_grid(q1, q2, align = 'h', axis = "b")
Session Info
## R version 4.2.3 (2023-03-15)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 22.10
##
## Matrix products: default
## BLAS/LAPACK: /home/server/anaconda3/envs/zy/lib/libopenblasp-r0.3.21.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cowplot_1.1.1 ggpubr_0.6.0 ggplot2_3.4.4 uwot_0.2.2 irlba_2.3.5.1
## [6] Matrix_1.5-4.1 SIGNAL_1.0.0
##
## loaded via a namespace (and not attached):
## [1] MatrixGenerics_1.10.0 sass_0.4.9 tidyr_1.3.1
## [4] jsonlite_1.8.8 foreach_1.5.2 carData_3.0-5
## [7] bslib_0.7.0 highr_0.10 stats4_4.2.3
## [10] yaml_2.3.8 pillar_1.9.0 backports_1.4.1
## [13] lattice_0.21-8 glue_1.7.0 RcppEigen_0.3.4.0.0
## [16] digest_0.6.35 ggsignif_0.6.4 colorspace_2.1-0
## [19] htmltools_0.5.8.1 pkgconfig_2.0.3 broom_1.0.5
## [22] bigparallelr_0.3.2 rmio_0.4.0 purrr_1.0.2
## [25] scales_1.3.0 RSpectra_0.16-1 ff_4.0.12
## [28] BiocParallel_1.32.6 tibble_3.2.1 farver_2.1.1
## [31] car_3.1-2 generics_0.1.3 cachem_1.0.8
## [34] withr_3.0.0 BiocGenerics_0.44.0 cli_3.6.2
## [37] magrittr_2.0.3 mclust_6.0.0 ps_1.7.6
## [40] memoise_2.0.1 evaluate_0.23 bigassertr_0.1.6
## [43] fs_1.6.4 fansi_1.0.6 doParallel_1.0.17
## [46] rstatix_0.7.2 textshaping_0.3.7 tools_4.2.3
## [49] lifecycle_1.0.4 matrixStats_1.0.0 S4Vectors_0.36.2
## [52] munsell_0.5.1 ggsci_3.0.0 compiler_4.2.3
## [55] pkgdown_2.0.7 jquerylib_0.1.4 systemfonts_1.0.6
## [58] rlang_1.1.3 grid_4.2.3 iterators_1.0.14
## [61] BiocNeighbors_1.16.0 rstudioapi_0.15.0 RcppAnnoy_0.0.22
## [64] labeling_0.4.3 rmarkdown_2.26 gtable_0.3.5
## [67] codetools_0.2-19 flock_0.7 abind_1.4-5
## [70] bigstatsr_1.5.12 R6_2.5.1 knitr_1.46
## [73] dplyr_1.1.4 bit_4.0.5 fastmap_1.1.1
## [76] utf8_1.2.4 ragg_1.2.7 desc_1.4.3
## [79] parallel_4.2.3 Rcpp_1.0.12 vctrs_0.6.5
## [82] tidyselect_1.2.1 xfun_0.43 sparseMatrixStats_1.10.0