Last Updated: July 20, 2022

Introduction

The challenge in analyzing scATACseq data lies with the nature of the data. It is extremely sparse and almost binary. For a diploid genome, each genomic site can be accessible on one allele, both alleles and no alleles. Many accessible regions are not transposed. The sparsity makes calling peaks very difficult. Many approaches have been developed to overcome this challenge. One approach is to group multiple regions together based on sequence similarity, such as binding motif, or k-mer, or co-activation pattern in DNaseq-seq data in ENCODE. This approach suffers from the loss of resolution. The second approach is to pool multiple cells together, which is used by cellranger-atac analysis pipeline and many other packages. However, this approach does not always solve the issue of sparsity, especially for rare cell types, and could underrepresent the rare cell types. The third approach works with the sparse matrix directly by turning the matrix into a binary accessibility matrix (SnapATAC, 5kb window size).

ArchR uses directly the sparse matrix and 500bp sliding window size to profile genome wide accessibility. It uses an optimized iterative latent semantic indexing (LSI) method for dimensionality reduction. ArchR is very efficient dealing with large datasets.

ArchR comparison

Install ArchR and load hg38 genome data

commands to install ArchR in R

the peak calling package macs2 used by ArchR need some special handling and the one version of R that I have found to work is R 4.1.0

if (!requireNamespace("BiocManager", quietly = TRUE)){
    install.packages("BiocManager")
}

if (!requireNamespace("rtracklayer", quietly = TRUE)){
    install.packages("rtracklayer")
}

if (!requireNamespace("devtools", quietly = TRUE)){
    install.packages("devtools")
}

if (!requireNamespace("pheatmap", quietly = TRUE)){
    install.packages("pheatmap")
}

devtools::install_github("GreenleafLab/ArchR", ref="master", repos = BiocManager::repositories())

library(ArchR)
ArchR::installExtraPackages()

## For nice plot in ArchR, package Cairo is required, but it is not absolutely necessary. The installation of Cairo needs XQuartz (https://www.xquartz.org/) for MacOS.

if (!requireNamespace("Cairo", quietly = TRUE)){
    install.packages("Cairo")
}

## Install peak calling package macs2

if (!requireNamespace("remotes", quietly = TRUE)){
    install.packages("remotes")
}

## for motif analysis
if (!requireNamespace("chromVARmotifs", quietly = TRUE)){
    devtools::install_github("GreenleafLab/chromVARmotifs")
}


#remotes::install_github("RockefellerUniversity/Herper")
#library(Herper)
#if (!requireNamespace("hexbin", quietly = TRUE)){
#    install.packages("hexbin")
#}

#install_CondaTools(tools="macs2", env="PeakCalling_analysis", pathToMiniConda="./")

ArchR is not dependent on the package Cairo, which provides the way to rasterize plots. Without Cairo, the generated plots are vectorized and difficult to edit.

Start ArchR and load the genome data

library(knitr)
library(pheatmap)
library(rtracklayer)
library(chromVARmotifs)
library(ArchR)
addArchRGenome("hg38")

There are pre-build genome data in ArchR: hg19, hg38, mm9 and mm10, but it supports the building of custom genomes.

Read in scATAC data

inputs <- system("ls */outs/fragments.tsv.gz", intern=T)
names(inputs) <- sapply(inputs, function(x){strsplit(x, "/")[[1]][1]})
#ArrowFiles <- createArrowFiles(inputFiles = inputs, sampleNames = names(inputs),
#   filterTSS = 4, # filter based on TSS enrichment score
#   filterFrags = 1000, # filter based on the minimum fragments per cell
#   addTileMat = T, # 500bp sliding window counts matrix
#   addGeneScoreMat = T) # Gene-Score Matrix uses ATAC-Seq signal proximal to the TSS to estimate gene activity
ArrowFiles <- c("A001-C-007.arrow", "A001-C-104.arrow", "B001-A-301.arrow")
ArrowFiles
## [1] "A001-C-007.arrow" "A001-C-104.arrow" "B001-A-301.arrow"

The above command create three Arrow files, one for each of our samples. An Arrow file is a hdf5 formatted file that is stored on harddrive. During the creataion of the Arrow files, ArchR does cell filtering based on the number of unique fragments, the signal-to-background ratio which is the ratio between the average accessibility within a 50-bp window around TSS as signal and the average accessibility flanking (+/- 1900-2000 bp) region. It also produce an assessment of the fragment size distribution to check the expected nucleosome periodicity.

Let’s take a look at the TSS enrichment results and the fragment size distribution.

samples <- names(inputs)
print(paste0("for sample: ", samples[1]))
knitr::include_graphics(file.path("QualityControl", samples[1], paste0(samples[1], "-Fragment_Size_Distribution.pdf")), file.path("QualityControl", samples[1], paste0(samples[1], "-TSS_by_Unique_Frag.pdf")), dpi=300)

print(paste0("for sample: ", samples[2]))
knitr::include_graphics(file.path("QualityControl", samples[1], paste0(samples[1], "-Fragment_Size_Distribution.pdf")), file.path("QualityControl", samples[1], paste0(samples[1], "-TSS_by_Unique_Frag.pdf")), dpi=300)


print(paste0("for sample: ", samples[3]))
knitr::include_graphics(file.path("QualityControl", samples[1], paste0(samples[1], "-Fragment_Size_Distribution.pdf")), file.path("QualityControl", samples[1], paste0(samples[1], "-TSS_by_Unique_Frag.pdf")), dpi=300)

Doublet identification

in silico synthetic doublets are created by randomly selecting any two cells thousands of times and projects these doublets to the UMAP of our data. Then nearest neighbor to these synthetic doublets are identified to be potential doublets.

double.score <- addDoubletScores(input = ArrowFiles,
                                 k = 10, # the number of nn to each synthetic doublet
                                 knnMethod="UMAP", # the embedding to use for knn search
                                 LSIMethod = 1, # TF-IDF normalization: 1. tf-logidf 2. log(tf-idf) 3. logtf-logidf
                                 verbose = T)

Create an ArchRProject

projatac <- ArchRProject(ArrowFiles = ArrowFiles, outputDirectory = "./scATAC_example", copyArrows = T)
projatac
## class: ArchRProject 
## outputDirectory: /Users/jli/Jessie/Research/BioInfo/tmp/scATAC/scATAC_example 
## samples(3): A001-C-007 A001-C-104 B001-A-301
## sampleColData names(1): ArrowFiles
## cellColData names(15): Sample TSSEnrichment ... DoubletEnrichment
##   BlacklistRatio
## numberOfCells(1): 16508
## medianTSS(1): 7.71
## medianFrags(1): 9557

Let’s plot some summary plots

## plot fragment size distribution
plotFragmentSizes(ArchRProj = projatac)

## plot TSS enrichment
p1 <- plotGroups(
    ArchRProj = projatac, 
    groupBy = "Sample", 
    colorBy = "cellColData", 
    name = "TSSEnrichment",
    plotAs = "ridges"
   )
p1

p2 <- plotGroups(
    ArchRProj = projatac, 
    groupBy = "Sample", 
    colorBy = "cellColData", 
    name = "TSSEnrichment",
    plotAs = "violin",
    alpha = 0.4,
    addBoxPlot = TRUE
   )
p2

plotTSSEnrichment(ArchRProj = projatac)

filter doublets

projatac.filtered <- filterDoublets(projatac)

Normalization and dimensionality reduction

ArchR normalization and dimensionality reduction

?addIterativeLSI
projatac.filtered <- addIterativeLSI(ArchRProj = projatac.filtered,
                                     useMatrix = "TileMatrix",
                                     name = "IterativeLSI",
                                     iterations = 2,
                                     clusterParams = list(resolution = c(0.2),
                                                          sampleCells = 5000,
                                                          n.start = 10),
                                     varFeatures = 25000,
                                     dimsToUse = 1:30)

Clustering

ArchR uses Seurat’s FindCluster() function to perform clustering.

projatac.filtered <- addClusters(input = projatac.filtered,
                                 reduceDims = "IterativeLSI",
                                 method = "Seurat",
                                 name = "Clusters",
                                 resolution = 0.8)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 15821
## Number of edges: 543190
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8951
## Number of communities: 19
## Elapsed time: 1 seconds

Summary plot on the distribution of clusters

cMatrix <- confusionMatrix(paste0(projatac.filtered$Clusters), paste0(projatac.filtered$Sample))
cMatrix <- cMatrix / Matrix::rowSums(cMatrix)
pheatmap::pheatmap(mat = cMatrix, color = paletteContinuous("blueYellow"), border_color = "black")

Visualization of cells with respect to clusters

In order to visualize the cells in 2D, we have to project the high dimensional data to 2D.

projatac.filtered <- addUMAP(ArchRProj = projatac.filtered, reducedDims = "IterativeLSI",
                             name = "UMAP", 
                             nNeighbors = 30, # number of nn to compute a UMAP
                             minDist = 0.5, # how tightly UMAP can pack points
                             metric = "cosine") # which distance to use, cosine for this type of data

## embedding slot is populated after the above step, please explore it
#str(projatac.filtered@embedding)

## plot the UMAP
plotEmbedding(ArchRProj = projatac.filtered, colorBy = "cellColData", name = "Sample", embedding = "UMAP")

## plot the UMAP, color by Clusters that were identified earlier
plotEmbedding(ArchRProj = projatac.filtered, colorBy = "cellColData", name = "Clusters", embedding = "UMAP")

ArchR provides tSNE plot feature as well, which is done using the function addTSNE(). Please explore the function and plot the data using tSNE.

Gene Scores

ArchR uses a model to calculate gene activity score:

  1. Accessibility within the gene body contributes to the gene score
  2. An exponential weighting function that accounts for the activity of putative distal regulatory elements in a distance-dependent manner (e(-abs(distance)/5000) + e-1))
  3. Imposed gene boundaries that minimizes the contribution of unrelated regulatory elements to the gene scores (100kb around the gene and not overlapping other genes)
  4. Applies weights to adjust for gene size effect

Gene Scores can be created at the step of creating the ArrowFiles. And it can be added by using the function addGeneScoreMatrix().

?addGeneScoreMatrix

Using these Gene Scores, one can identify marker genes. For a cell cluster, ArchR creates a group of background cells by selecting the nearest neighbor cells across the multi-dimensional space that do not belong to the cell cluster. This enables a robust determination of significance even when the group of cells is small.

GS.markers <- getMarkerFeatures(ArchRProj = projatac.filtered, useMatrix = "GeneScoreMatrix",
                                groupBy = "Clusters", bias = c("TSSEnrichment", "log10(nFrags)"),
                                testMethod = "wilcoxon")

## To get the markers
markers <- getMarkers(GS.markers, cutOff = "FDR <= 0.01 & Log2FC >= 1.25")

## We can visualize all of the markers in a heatmap and label some of genes of interest

marker.genes <- sample(unique(unlist(lapply(1:length(markers), function(x){markers[[names(markers)[x]]]$name[1:5]}))), 15)

heatmap.geneMK <- markerHeatmap(seMarker = GS.markers, cutOff = "FDR <= 0.01 & Log2FC >= 1.25",
                                labelMarkers = marker.genes, transpose = T)
##  [1] "LINC02287"    "MIR619"       "NKX2-3"       "COL6A1"       "MIR8071-1"   
##  [6] "SCIMP"        "ST6GAL1"      "FER1L6"       "LOC100130950" "GATA5"       
## [11] "NOVA2"        "HIF3A"        "CEMIP"        "RGS6"         "SLC26A3"
ComplexHeatmap::draw(heatmap.geneMK, heatmap_legend_side = "bot",
                     annotation_legend_side = "bot")

We can overlay the marker genes on UMAP embedding.

p <- plotEmbedding(ArchRProj = projatac.filtered, colorBy = "GeneScoreMatrix",
              name = marker.genes, embedding = "UMAP", quantCut = c(0.01, 0.95), imputeWeights = NULL)
p
## $LINC02287

## 
## $MIR619

## 
## $`NKX2-3`

## 
## $COL6A1

## 
## $`MIR8071-1`

## 
## $SCIMP

## 
## $ST6GAL1

## 
## $FER1L6

## 
## $LOC100130950

## 
## $GATA5

## 
## $NOVA2

## 
## $HIF3A

## 
## $CEMIP

## 
## $RGS6

## 
## $SLC26A3

## some fancy plotting of the same set of genes
p1 <- lapply(p, function(x){
  x + guides(color = F, fill = F) + theme_ArchR(baseSize = 6.5) +
    theme(plot.margin = unit(c(0,0,0,0), "cm")) +
    theme(axis.text.x=element_blank(),
          axis.ticks.x=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank())
})

do.call(cowplot::plot_grid, c(list(ncol=3), p1))

We can visualize the chromatin accessibilities around these marker genes.

p2 <- plotBrowserTrack(ArchRProj = projatac.filtered, groupBy = "Clusters", geneSymbol = marker.genes,
                       upstream = 50000, downstream = 50000)
## GRanges object with 15 ranges and 2 metadata columns:
##        seqnames              ranges strand |     gene_id      symbol
##           <Rle>           <IRanges>  <Rle> | <character> <character>
##    [1]    chr14   92905697-92907482      - |   101929002   LINC02287
##    [2]    chr12 108836908-108837006      - |      693204      MIR619
##    [3]    chr10   99532933-99536524      + |      159296      NKX2-3
##    [4]    chr21   45981737-46005050      + |        1291      COL6A1
##    [5]    chr14 105621116-105640232      + |   102465871   MIR8071-1
##    ...      ...                 ...    ... .         ...         ...
##   [11]    chr19   45933734-45973546      - |        4858       NOVA2
##   [12]    chr19   46297046-46343433      + |       64344       HIF3A
##   [13]    chr15   80779343-80951776      + |       57214       CEMIP
##   [14]    chr14   71932439-72566529      + |        9628        RGS6
##   [15]     chr7 107765467-107803225      - |        1811     SLC26A3
##   -------
##   seqinfo: 24 sequences from hg38 genome
#grid::grid.newpage()
#grid.draw(p2$IKZF1)
plotPDF(plotList = p2, name = "marker-gene-tracks.pdf", ArchRProj = projatac.filtered,
        addDOC = F, width = 5, height = 5)
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL

Integrate with scRNASeq data

The gene scores calculated enables the integration between scATACSeq data and scRNASeq data. ArchR compares the gene scores matrix of the scATACSeq dataset to the gene expression matrix of the scRNASeq dataset and tries to align the cells.

## load scRNASeq data
load("celltyped_seurat_object.RData")

## Integrate
projatac.filtered <- addGeneIntegrationMatrix(ArchRProj = projatac.filtered,
                                              useMatrix = "GeneScoreMatrix",
                                              matrixName = "GeneIntegrationMatrix",
                                              reducedDims = "IterativeLSI",
                                              seRNA = experiment.merged,
                                              addToArrow = F,
                                              groupRNA = "CellType",
                                              nameCell = "predictedCell",
                                              nameGroup = "predictedGroup",
                                              nameScore = "predictedScore")

## After this first round of integration, we can check how the clusters of scATAC data matches to the CellType of scRNA data.

integration.cMatrix <- as.matrix(confusionMatrix(projatac.filtered$Clusters, projatac.filtered$predictedGroup))
preClust <- colnames(integration.cMatrix)[apply(integration.cMatrix, 1, which.max)]
cbind(preClust, rownames(integration.cMatrix))
##       preClust                
##  [1,] "Neurons"          "C15"
##  [2,] "Podocytes"        "C12"
##  [3,] "Podocytes"        "C18"
##  [4,] "Goblet cells"     "C11"
##  [5,] "Podocytes"        "C2" 
##  [6,] "Plasma cells"     "C3" 
##  [7,] "Podocytes"        "C16"
##  [8,] "Podocytes"        "C17"
##  [9,] "Macrophages"      "C6" 
## [10,] "Goblet cells"     "C9" 
## [11,] "Epithelial cells" "C13"
## [12,] "T memory cells"   "C4" 
## [13,] "Goblet cells"     "C14"
## [14,] "Epithelial cells" "C19"
## [15,] "Enterocytes"      "C10"
## [16,] "B cells memory"   "C7" 
## [17,] "Epithelial cells" "C1" 
## [18,] "T memory cells"   "C5" 
## [19,] "Podocytes"        "C8"
## CellType from scRNA
cEpi <- "Epithelial cells"
matched.types <- unique(projatac.filtered$predictedGroup)
cNon.Epi <- paste0(matched.types[!(matched.types %in% cEpi)], collapse="|")

## Assigned scATAC
assign.Epi <- rownames(integration.cMatrix)[grep(cEpi, preClust)]
assign.Non.Epi <- rownames(integration.cMatrix)[grep(cNon.Epi, preClust)]

## Identify cells from scRNA to each categories
rna.Epi <- colnames(experiment.merged)[grep(cEpi, experiment.merged$CellType)]
rna.Non.Epi <- colnames(experiment.merged)[grep(cNon.Epi, experiment.merged$CellType)]

## Create a nested list to match the celltypes from scRNA to scATAC
groupList <- SimpleList(
  Epithelial = SimpleList(
    ATAC = projatac.filtered$cellNames[projatac.filtered$Clusters %in% assign.Epi],
    RNA = rna.Epi
  ),
  Non.Epithelial = SimpleList(
    ATAC = projatac.filtered$cellNames[projatac.filtered$Clusters %in% assign.Non.Epi],
    RNA = rna.Non.Epi
  )
)

## We pass this list to the `groupList` parameter of the addGeneIntegrationMatrix` function

projatac.filtered <- addGeneIntegrationMatrix(
  ArchRProj = projatac.filtered,
  useMatrix = "GeneScoreMatrix",
  matrixName = "GeneIntegrationMatrix",
  reducedDims = "IterativeLSI",
  seRNA = experiment.merged,
  addToArrow = FALSE,
  groupList = groupList,
  groupRNA = "CellType",
  nameCell = "predictedCell_Constrain",
  nameGroup = "predictedGroup_Constrain",
  nameScore = "predictedScore_Constrain"
)

Comparing the unconstrained and constrained integrations

pal <- paletteDiscrete(values = experiment.merged$CellType)
pal
##  Epithelial cells      Goblet cells           Neurons         Podocytes 
##         "#D51F26"         "#272E6A"         "#208A42"         "#89288F" 
##       Enterocytes    T memory cells       Macrophages      Plasma cells 
##         "#F47D2B"         "#FEE500"         "#8A9FD1"         "#C06CAB" 
## Endothelial cells    B cells memory        Tuft cells 
##         "#D8A767"         "#90D5E4"         "#89C75F"
p3 <- plotEmbedding(
  projatac.filtered,
  colorBy = "cellColData",
  name = "predictedGroup",
  pal = pal
)

p4 <- plotEmbedding(
  projatac.filtered,
  colorBy = "cellColData",
  name = "predictedGroup_Constrain",
  pal = pal
)

plotPDF(p3, p4, name="plot-UMAP-RNA-Integration.pdf", ArchRProj=projatac.filtered, addDOC=F, width=5, height=5)
saveArchRProject(ArchRProj = projatac.filtered, outputDirectory="scATAC_example", load=F)

Peak calling

ArchR use an iterative overlap scheme to form peaks. It also uses fixed width peaks. Variable width peaks makes it difficult to merge peak calls from multiple samples and makes the analysis more complicated because of the need to normalize to peak width.

Based on the nature of scATACSeq data, pseudo bulk data is generated by sampling cells. ArchR uses a complex decision tree to perform the pseudo bulk replicates generation.

generating pseudo bulk replicates

#loadArchRProject(path="./scATAC_peak")
#readRDS("scATAC_peak/Save-ArchR-Project.rds")
projatac.peak <- addGroupCoverages(ArchRProj = projatac.filtered, groupBy = "predictedGroup_Constrain")

#saveArchRProject(ArchRProj = projatac.peak, outputDirectory="scATAC_peak", load=T)
#saveArchRProject(ArchRProj = projatac.filtered, outputDirectory="scATAC_filtered", load=T)



#projatac.peak <- addReproduciblePeakSet(ArchRProj = projatac.peak,
#                                        groupBy = "predictedGroup_Constrain",
#                                        pathToMacs2 = path2macs2)

#projatac.peak <- addReproduciblePeakSet(ArchRProj = projatac.peak,
#                                        groupBy = "predictedGroup_Constrain",
#                                        peakMethod = "Tiles", method = "p")

peaks <- import("peaks.bed")
projatac.peak <- addPeakSet(ArchRProj = projatac.peak, peakSet = peaks, force=T)
projatac.peak <- addPeakMatrix(ArchRProj = projatac.peak)


## marker peaks
markersPeaks <- getMarkerFeatures(ArchRProj = projatac.peak,
                                  useMatrix = "PeakMatrix",
                                  groupBy = "predictedGroup_Constrain",
                                  bias = c("TSSEnrichment", "log10(nFrags)"),
                                  testMethod = "wilcoxon")

peak.markerList <- getMarkers(markersPeaks, cutOff = "FDR <= 0.01 & Log2FC >= 1")

heatmapPeaks <- markerHeatmap(
  seMarker = markersPeaks, 
  cutOff = "FDR <= 0.1 & Log2FC >= 0.5",
  transpose = TRUE
)
##   [1] "chr1:9787-10683"           "chr1:633519-634529"       
##   [3] "chr1:966225-966581"        "chr1:966683-967261"       
##   [5] "chr1:1305331-1306107"      "chr1:1330281-1330727"     
##   [7] "chr1:1330846-1331548"      "chr1:1692400-1693364"     
##   [9] "chr1:1737593-1738494"      "chr1:1778366-1779288"     
##  [11] "chr1:1890547-1891616"      "chr1:2493975-2494844"     
##  [13] "chr1:2547968-2548436"      "chr1:2585086-2585770"     
##  [15] "chr1:2821689-2822574"      "chr1:629499-630392"       
##  [17] "chr1:932841-933733"        "chr1:960308-961243"       
##  [19] "chr1:1004840-1005587"      "chr1:1006196-1007123"     
##  [21] "chr1:1013020-1013911"      "chr1:1024377-1025245"     
##  [23] "chr1:1032738-1033695"      "chr1:1059205-1060102"     
##  [25] "chr1:1068623-1069635"      "chr1:1073659-1074596"     
##  [27] "chr1:1243945-1244780"      "chr1:1272219-1273060"     
##  [29] "chr1:1281526-1282426"      "chr1:1317986-1318855"     
##  [31] "chr1:1337464-1338299"      "chr1:1344535-1345428"     
##  [33] "chr1:1666322-1667009"      "chr1:1723978-1724891"     
##  [35] "chr1:1725971-1726769"      "chr1:1881533-1882404"     
##  [37] "chr1:1896962-1897811"      "chr1:2048000-2048854"     
##  [39] "chr1:2049852-2050885"      "chr1:1437766-1438661"     
##  [41] "chr1:2212418-2213308"      "chr1:2341651-2342570"     
##  [43] "chr1:3221906-3222819"      "chr1:5927601-5928449"     
##  [45] "chr1:6204839-6205572"      "chr1:6359442-6360337"     
##  [47] "chr1:6472015-6472419"      "chr1:6706381-6707255"     
##  [49] "chr1:6730567-6731441"      "chr1:6750274-6751148"     
##  [51] "chr1:6899381-6900286"      "chr1:6916802-6917692"     
##  [53] "chr1:6962936-6963476"      "chr1:1125039-1125659"     
##  [55] "chr1:1126871-1127664"      "chr1:1145021-1145782"     
##  [57] "chr1:1157128-1158692"      "chr1:1162345-1163249"     
##  [59] "chr1:1167026-1167910"      "chr1:1304460-1305198"     
##  [61] "chr1:1364083-1364899"      "chr1:1721030-1721966"     
##  [63] "chr1:1728362-1729206"      "chr1:1857013-1857825"     
##  [65] "chr1:1858510-1859269"      "chr1:1915594-1916485"     
##  [67] "chr1:1143748-1144665"      "chr1:1230418-1231154"     
##  [69] "chr1:2290376-2291307"      "chr1:3890735-3891857"     
##  [71] "chr1:6056167-6056997"      "chr1:6127476-6128367"     
##  [73] "chr1:8126905-8127743"      "chr1:8211469-8212414"     
##  [75] "chr1:9179922-9180607"      "chr1:9239515-9240344"     
##  [77] "chr1:9730322-9731172"      "chr1:10414902-10415800"   
##  [79] "chr1:10530043-10530966"    "chr1:10677100-10678010"   
##  [81] "chr1:1217676-1218574"      "chr1:1658691-1659629"     
##  [83] "chr1:2250383-2251294"      "chr1:2632758-2633605"     
##  [85] "chr1:3249845-3250758"      "chr1:3292950-3293839"     
##  [87] "chr1:3320522-3321030"      "chr1:3903038-3903940"     
##  [89] "chr1:3921932-3922844"      "chr1:3942547-3943419"     
##  [91] "chr1:3948005-3948991"      "chr1:3971686-3972679"     
##  [93] "chr1:4083591-4084465"      "chr1:4267227-4268126"     
##  [95] "chr1:5367937-5368821"      "chr1:7401581-7402460"     
##  [97] "chr1:12539720-12540586"    "chr1:14258900-14259819"   
##  [99] "chr1:30823990-30824852"    "chr1:41109372-41110253"   
## [101] "chr1:62720732-62721626"    "chr1:79192946-79194003"   
## [103] "chr1:95742465-95743388"    "chr1:172381770-172382634" 
## [105] "chr1:223747799-223748651"  "chr10:6467239-6468118"    
## [107] "chr10:46910904-46911804"   "chr10:58810635-58811631"  
## [109] "chr10:75401677-75402543"   "chr10:111076576-111077105"
## [111] "chr1:1207878-1208719"      "chr1:1219068-1219971"     
## [113] "chr1:1250615-1251391"      "chr1:1615064-1615996"     
## [115] "chr1:2231570-2232313"      "chr1:2236919-2237786"     
## [117] "chr1:2412214-2413128"      "chr1:2415717-2416354"     
## [119] "chr1:2840075-2840946"      "chr1:3530914-3531848"     
## [121] "chr1:5667040-5667833"      "chr1:5955076-5956099"     
## [123] "chr1:6273122-6274039"
#draw(heatmapPeaks, heatmap_legend_side = "bot", annotation_legend_side = "bot")
plotPDF(heatmapPeaks, name = "Peak-Marker-Heatmap", width=8, height=6, ArchRProj=projatac.peak, addDOC=F)

peak.ma <- markerPlot(seMarker = markersPeaks, name = "Macrophages", cutOff="FDR <= 0.01 & Log2FC >= 1")
plotPDF(peak.ma, name = "Peak-MA", width=8, height=6, ArchRProj=projatac.peak, addDOC=F)
#saveArchRProject(ArchRProj = projatac.peak, outputDirectory="scATAC_peak", load=T)

peak.vc <- markerPlot(seMarker=markersPeaks, name = "Macrophages", cutOff="FDR <= 0.01 & Log2FC >= 1", plotAs="Volcano")
plotPDF(peak.vc, name="Peak-Volcano", width=8, height=6, ArchRProj=projatac.peak, addDOC=F)

peak.tracks <- plotBrowserTrack(
  ArchRProj = projatac.peak,
  groupBy = "predictedGroup_Constrain",
  geneSymbol = c("BMP4"),
  features = getMarkers(markersPeaks, cutOff="FDR <= 0.01 & Log2FC >= 1", returnGR = TRUE)["Macrophages"],
  upstream = 50000, downstream = 50000
)
## GRanges object with 1 range and 2 metadata columns:
##       seqnames            ranges strand |     gene_id      symbol
##          <Rle>         <IRanges>  <Rle> | <character> <character>
##   [1]    chr14 53949736-53958761      - |         652        BMP4
##   -------
##   seqinfo: 24 sequences from hg38 genome
plotPDF(peak.tracks, name="BMP4.markerpeak.track", width=5, height=5, ArchRProj=projatac.peak, addDOC=F)
## NULL

Peak testing between groups

T.vs.Macro <- getMarkerFeatures(
  ArchRProj = projatac.peak,
  useMatrix = "PeakMatrix",
  groupBy = "predictedGroup_Constrain",
  testMethod = "wilcoxon",
  bias = c("TSSEnrichment", "log10(nFrags)"),
  useGroups = "T memory cells",
  bgdGroups = "Macrophages"
)

Motif analysis

## motif enrichment using Catalog of Inferred Sequence Binding Preferences database
projatac.peak <- addMotifAnnotations(ArchRProj=projatac.peak, motifSet="cisbp", name="Motif")

# Up regulated TF motif
motifsUp <- peakAnnoEnrichment(
  seMarker = T.vs.Macro,
  ArchRProj = projatac.peak,
  peakAnnotation = "Motif",
  cutOff = "FDR <= 0.1 & Log2FC >= 0.5"
)

# gather data for plotting
df <- data.frame(TF = rownames(motifsUp), mlog10Padj = assay(motifsUp)[,1])
df <- df[order(df$mlog10Padj, decreasing=TRUE),]
df$rank <- seq_len(nrow(df))

ggUp <- ggplot(df, aes(rank, mlog10Padj, color = mlog10Padj)) + 
  geom_point(size = 1) +
  ggrepel::geom_label_repel(
        data = df[rev(seq_len(30)), ], aes(x = rank, y = mlog10Padj, label = TF), 
        size = 1.5,
        nudge_x = 2,
        color = "black"
  ) + theme_ArchR() + 
  ylab("-log10(P-adj) Motif Enrichment") + 
  xlab("Rank Sorted TFs Enriched") +
  scale_color_gradientn(colors = paletteContinuous(set = "comet"))
plotPDF(ggUp, name="Up.Motif-enrichment", width=5, height=5, addDOC=F)

# Down regulated TF motif
motifsDn <- peakAnnoEnrichment(
    seMarker = T.vs.Macro,
    ArchRProj = projatac.peak,
    peakAnnotation = "Motif",
    cutOff = "FDR <= 0.1 & Log2FC <= -0.5"
  )

df <- data.frame(TF = rownames(motifsDn), mlog10Padj = assay(motifsDn)[,1])
df <- df[order(df$mlog10Padj, decreasing = TRUE),]
df$rank <- seq_len(nrow(df))

ggDn <- ggplot(df, aes(rank, mlog10Padj, color = mlog10Padj)) + 
  geom_point(size = 1) +
  ggrepel::geom_label_repel(
        data = df[rev(seq_len(30)), ], aes(x = rank, y = mlog10Padj, label = TF), 
        size = 1.5,
        nudge_x = 2,
        color = "black"
  ) + theme_ArchR() + 
  ylab("-log10(FDR) Motif Enrichment") +
  xlab("Rank Sorted TFs Enriched") +
  scale_color_gradientn(colors = paletteContinuous(set = "comet"))

plotPDF(ggDn, name="Down.Motif-enrichment", width=5, height=5, addDOC=F)


## Motifs enriched in marker peaks

motif.marker.Up <- peakAnnoEnrichment(
  seMarker = markersPeaks,
  ArchRProj = projatac.peak,
  peakAnnotation = "Motif",
  cutOff = "FDR <= 0.1 & Log2FC >= 0.5"
)

heatmap.markerEN <- plotEnrichHeatmap(motif.marker.Up, n = 5, transpose = TRUE)
plotPDF(heatmap.markerEN, name="MarkerPeak.Motif.Up", width=5, height=5, addDOC=F)

## Other enrichment

?addArchRAnnotations

Session Information

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] motifmatchr_1.16.0                ggrepel_0.9.1                    
##  [3] hexbin_1.28.2                     circlize_0.4.13                  
##  [5] ComplexHeatmap_2.10.0             presto_1.0.0                     
##  [7] uwot_0.1.11                       ggridges_0.5.3                   
##  [9] BSgenome.Hsapiens.UCSC.hg38_1.4.4 BSgenome_1.62.0                  
## [11] Biostrings_2.62.0                 XVector_0.34.0                   
## [13] nabor_0.5.0                       SeuratObject_4.0.4               
## [15] Seurat_4.1.0                      rhdf5_2.38.1                     
## [17] SummarizedExperiment_1.24.0       Biobase_2.54.0                   
## [19] MatrixGenerics_1.6.0              Rcpp_1.0.8.3                     
## [21] Matrix_1.3-3                      matrixStats_0.61.0               
## [23] data.table_1.14.2                 stringr_1.4.0                    
## [25] plyr_1.8.6                        magrittr_2.0.2                   
## [27] ggplot2_3.3.5                     gtable_0.3.0                     
## [29] gtools_3.9.2                      gridExtra_2.3                    
## [31] ArchR_1.0.2                       chromVARmotifs_0.2.0             
## [33] rtracklayer_1.54.0                GenomicRanges_1.46.1             
## [35] GenomeInfoDb_1.30.0               IRanges_2.28.0                   
## [37] S4Vectors_0.32.3                  BiocGenerics_0.40.0              
## [39] pheatmap_1.0.12                   knitr_1.37                       
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2                  reticulate_1.24            
##   [3] R.utils_2.11.0              tidyselect_1.1.2           
##   [5] poweRlaw_0.70.6             RSQLite_2.2.9              
##   [7] AnnotationDbi_1.56.2        htmlwidgets_1.5.4          
##   [9] BiocParallel_1.28.3         Rtsne_0.15                 
##  [11] munsell_0.5.0               codetools_0.2-18           
##  [13] ica_1.0-2                   future_1.23.0              
##  [15] miniUI_0.1.1.1              withr_2.4.3                
##  [17] colorspace_2.0-2            highr_0.9                  
##  [19] rstudioapi_0.13             ROCR_1.0-11                
##  [21] tensor_1.5                  listenv_0.8.0              
##  [23] labeling_0.4.2              GenomeInfoDbData_1.2.7     
##  [25] polyclip_1.10-0             farver_2.1.0               
##  [27] bit64_4.0.5                 parallelly_1.30.0          
##  [29] vctrs_0.3.8                 generics_0.1.2             
##  [31] xfun_0.29                   doParallel_1.0.16          
##  [33] R6_2.5.1                    clue_0.3-60                
##  [35] bitops_1.0-7                rhdf5filters_1.6.0         
##  [37] spatstat.utils_2.3-0        cachem_1.0.6               
##  [39] DelayedArray_0.20.0         assertthat_0.2.1           
##  [41] promises_1.2.0.1            BiocIO_1.4.0               
##  [43] scales_1.1.1                Cairo_1.6-0                
##  [45] globals_0.14.0              goftest_1.2-3              
##  [47] seqLogo_1.60.0              rlang_1.0.2                
##  [49] GlobalOptions_0.1.2         splines_4.1.0              
##  [51] lazyeval_0.2.2              spatstat.geom_2.3-1        
##  [53] yaml_2.3.5                  reshape2_1.4.4             
##  [55] abind_1.4-5                 httpuv_1.6.5               
##  [57] tools_4.1.0                 ellipsis_0.3.2             
##  [59] spatstat.core_2.3-2         jquerylib_0.1.4            
##  [61] RColorBrewer_1.1-2          zlibbioc_1.40.0            
##  [63] purrr_0.3.4                 RCurl_1.98-1.5             
##  [65] rpart_4.1-15                deldir_1.0-6               
##  [67] GetoptLong_1.0.5            pbapply_1.5-0              
##  [69] cowplot_1.1.1               zoo_1.8-9                  
##  [71] cluster_2.1.2               RSpectra_0.16-0            
##  [73] scattermore_0.7             lmtest_0.9-39              
##  [75] RANN_2.6.1                  fitdistrplus_1.1-6         
##  [77] hms_1.1.1                   patchwork_1.1.1            
##  [79] mime_0.12                   evaluate_0.14              
##  [81] xtable_1.8-4                XML_3.99-0.8               
##  [83] shape_1.4.6                 compiler_4.1.0             
##  [85] tibble_3.1.6                KernSmooth_2.23-20         
##  [87] crayon_1.5.0                R.oo_1.24.0                
##  [89] htmltools_0.5.2             mgcv_1.8-35                
##  [91] later_1.3.0                 tzdb_0.2.0                 
##  [93] TFBSTools_1.32.0            tidyr_1.2.0                
##  [95] DBI_1.1.2                   MASS_7.3-54                
##  [97] readr_2.1.1                 cli_3.2.0                  
##  [99] R.methodsS3_1.8.1           parallel_4.1.0             
## [101] igraph_1.2.11               pkgconfig_2.0.3            
## [103] GenomicAlignments_1.30.0    TFMPvalue_0.0.8            
## [105] plotly_4.10.0               spatstat.sparse_2.1-0      
## [107] foreach_1.5.1               annotate_1.72.0            
## [109] bslib_0.3.1                 DirichletMultinomial_1.36.0
## [111] digest_0.6.29               sctransform_0.3.3          
## [113] RcppAnnoy_0.0.19            pracma_2.3.8               
## [115] CNEr_1.30.0                 spatstat.data_2.1-2        
## [117] rmarkdown_2.11              leiden_0.3.9               
## [119] restfulr_0.0.13             shiny_1.7.1                
## [121] Rsamtools_2.10.0            rjson_0.2.21               
## [123] nlme_3.1-152                lifecycle_1.0.1            
## [125] jsonlite_1.8.0              Rhdf5lib_1.16.0            
## [127] viridisLite_0.4.0           fansi_1.0.2                
## [129] pillar_1.7.0                lattice_0.20-44            
## [131] KEGGREST_1.34.0             fastmap_1.1.0              
## [133] httr_1.4.2                  survival_3.2-11            
## [135] GO.db_3.14.0                glue_1.6.2                 
## [137] iterators_1.0.13            png_0.1-7                  
## [139] bit_4.0.4                   stringi_1.7.6              
## [141] sass_0.4.0                  blob_1.2.2                 
## [143] caTools_1.18.2              memoise_2.0.1              
## [145] dplyr_1.0.8                 irlba_2.3.5                
## [147] future.apply_1.8.1