Skip to contents

Introduction

SIGNAL can be used to perform supervised integration of single-cell data. Given the metadata meta (a data frame containing meta information) as inputs, SIGNAL requires a batch variable b and at least one group variable g. Using a feature-by-cell data matrix X (a matrix-like object) and batch and group variables, SIGNAL embeds X from multiple batches into a low-dimensional space that is not affected by batch effects, in which subgroups under the group variables of different batches are aligned together.

In this vignette we will demonstrate how to use SIGNAL to perform supervised integration, i.e., integration using ‘cell type’ as group variable.

Load data matrix and metadata

We demonstrate SIGNAL integration on a commonly used single-cell RNA sequencing (scRNA-seq) dataset of cell lines. The highly variable genes (HVGs) are already selected and are used to perform integration.

X = readRDS("/home/server/zy/group_scripts/datasets_preparation/Jurkat_293t/X.rds")
meta = readRDS("/home/server/zy/group_scripts/datasets_preparation/Jurkat_293t/meta.rds")

We can take a look of the data.

str(X)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:3137390] 3 12 14 26 30 35 36 40 43 44 ...
##   ..@ p       : int [1:9532] 0 325 590 932 1251 1593 1944 2318 2644 2963 ...
##   ..@ Dim     : int [1:2] 2000 9531
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:2000] "AP006222.2" "RP11-206L10.2" "C1orf170" "HES4" ...
##   .. ..$ : chr [1:9531] "AAACATACACTGGT-1" "AAACATACAGACTC-1" "AAACATTGACCAAC-1" "AAACATTGAGGCGA-1" ...
##   ..@ x       : num [1:3137390] 1.9 1.975 1.53 0.937 1.121 ...
##   ..@ factors : list()
str(meta)
## 'data.frame':    9531 obs. of  2 variables:
##  $ CellType: Factor w/ 2 levels "293t","jurkat": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Batch   : Factor w/ 3 levels "Batch_1","Batch_2",..: 1 1 1 1 1 1 1 1 1 1 ...

Visualization of raw data

Let us visualize the raw data using PCA and UMAP.

pca_res = irlba(t(X), nv = 30)
raw_emb = as.matrix(pca_res$u %*% diag(pca_res$d))
raw_umap = as.data.frame(umap(raw_emb))
colnames(raw_umap) = c("UMAP1", "UMAP2")
raw_umap = cbind.data.frame(meta, raw_umap)
p1 = ggscatter(raw_umap, x = "UMAP1", y = "UMAP2", size = 0.1, color = "Batch", palette = "npg", legend = "right") + 
  guides(colour = guide_legend(override.aes = list(size = 2)))
p2 = ggscatter(raw_umap, x = "UMAP1", y = "UMAP2", size = 0.1, color = "CellType", palette = "npg", legend = "right") + 
  guides(colour = guide_legend(override.aes = list(size = 2)))
plot_grid(p1, p2, align = 'h', axis = "b")

We can see that there are batch effects between Jurkat cells from batch 2 and batch 3.

SIGNAL supervised integration

We perform SIGNAL supervised integration and visualize the integrated result. This usually takes a short time. It can integrate 1 million cells in ~2 minutes.

signal_emb = Run.gcPCA(X, meta, g_factor = "CellType", b_factor = "Batch")
## Run gcPCA!
## gcPCA done!
signal_umap = as.data.frame(umap(t(signal_emb)))
colnames(signal_umap) = c("UMAP1", "UMAP2")
signal_umap = cbind.data.frame(meta, signal_umap)
q1 = ggscatter(signal_umap, x = "UMAP1", y = "UMAP2", size = 0.1, color = "Batch", palette = "npg", legend = "right") + 
  guides(colour = guide_legend(override.aes = list(size = 2)))
q2 = ggscatter(signal_umap, x = "UMAP1", y = "UMAP2", size = 0.1, color = "CellType", palette = "npg", legend = "right") + 
  guides(colour = guide_legend(override.aes = list(size = 2)))
plot_grid(q1, q2, align = 'h', axis = "b")

Session Info
## R version 4.2.3 (2023-03-15)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 22.10
## 
## Matrix products: default
## BLAS/LAPACK: /home/server/anaconda3/envs/zy/lib/libopenblasp-r0.3.21.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] cowplot_1.1.1  ggpubr_0.6.0   ggplot2_3.4.4  uwot_0.2.2     irlba_2.3.5.1 
## [6] Matrix_1.5-4.1 SIGNAL_1.0.0  
## 
## loaded via a namespace (and not attached):
##  [1] MatrixGenerics_1.10.0    sass_0.4.9               tidyr_1.3.1             
##  [4] jsonlite_1.8.8           foreach_1.5.2            carData_3.0-5           
##  [7] bslib_0.7.0              highr_0.10               stats4_4.2.3            
## [10] yaml_2.3.8               pillar_1.9.0             backports_1.4.1         
## [13] lattice_0.21-8           glue_1.7.0               RcppEigen_0.3.4.0.0     
## [16] digest_0.6.35            ggsignif_0.6.4           colorspace_2.1-0        
## [19] htmltools_0.5.8.1        pkgconfig_2.0.3          broom_1.0.5             
## [22] bigparallelr_0.3.2       rmio_0.4.0               purrr_1.0.2             
## [25] scales_1.3.0             RSpectra_0.16-1          ff_4.0.12               
## [28] BiocParallel_1.32.6      tibble_3.2.1             farver_2.1.1            
## [31] car_3.1-2                generics_0.1.3           cachem_1.0.8            
## [34] withr_3.0.0              BiocGenerics_0.44.0      cli_3.6.2               
## [37] magrittr_2.0.3           mclust_6.0.0             ps_1.7.6                
## [40] memoise_2.0.1            evaluate_0.23            bigassertr_0.1.6        
## [43] fs_1.6.4                 fansi_1.0.6              doParallel_1.0.17       
## [46] rstatix_0.7.2            textshaping_0.3.7        tools_4.2.3             
## [49] lifecycle_1.0.4          matrixStats_1.0.0        S4Vectors_0.36.2        
## [52] munsell_0.5.1            ggsci_3.0.0              compiler_4.2.3          
## [55] pkgdown_2.0.7            jquerylib_0.1.4          systemfonts_1.0.6       
## [58] rlang_1.1.3              grid_4.2.3               iterators_1.0.14        
## [61] BiocNeighbors_1.16.0     rstudioapi_0.15.0        RcppAnnoy_0.0.22        
## [64] labeling_0.4.3           rmarkdown_2.26           gtable_0.3.5            
## [67] codetools_0.2-19         flock_0.7                abind_1.4-5             
## [70] bigstatsr_1.5.12         R6_2.5.1                 knitr_1.46              
## [73] dplyr_1.1.4              bit_4.0.5                fastmap_1.1.1           
## [76] utf8_1.2.4               ragg_1.2.7               desc_1.4.3              
## [79] parallel_4.2.3           Rcpp_1.0.12              vctrs_0.6.5             
## [82] tidyselect_1.2.1         xfun_0.43                sparseMatrixStats_1.10.0