Last updated: 2022-06-24
Checks: 6 1
Knit directory:
apocrine_signature_mdamb453/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of
the R Markdown file created these results, you’ll want to first commit
it to the Git repo. If you’re still working on the analysis, you can
ignore this warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220427)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version ee8cf46. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/bigwig/
Unstaged changes:
Modified: analysis/motif_analysis.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/motif_analysis.Rmd
) and
HTML (docs/motif_analysis.html
) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote
), click on
the hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | ee8cf46 | Steve Pederson | 2022-06-23 | Added Richard’s curated motifs |
html | ee8cf46 | Steve Pederson | 2022-06-23 | Added Richard’s curated motifs |
Rmd | 1f959c3 | Steve Pederson | 2022-06-08 | Finished checking motif distances |
html | 1f959c3 | Steve Pederson | 2022-06-08 | Finished checking motif distances |
Rmd | b882cb8 | Steve Pederson | 2022-06-07 | Ranges are no longer centred |
html | b882cb8 | Steve Pederson | 2022-06-07 | Ranges are no longer centred |
Rmd | 0c12740 | Steve Pederson | 2022-06-06 | Finished regression analysis for motifs |
html | 0c12740 | Steve Pederson | 2022-06-06 | Finished regression analysis for motifs |
html | e893c08 | Steve Pederson | 2022-06-06 | Build page |
Rmd | 8dde309 | Steve Pederson | 2022-06-06 | Updated motif analysis |
html | 8dde309 | Steve Pederson | 2022-06-06 | Updated motif analysis |
Rmd | e1088ac | Steve Pederson | 2022-06-03 | revised comparison |
Rmd | 16615bf | Steve Pederson | 2022-05-28 | Finished motif analysis |
html | 16615bf | Steve Pederson | 2022-05-28 | Finished motif analysis |
Rmd | 1bf46f7 | Steve Pederson | 2022-05-27 | Added motifs distance plots |
html | 1bf46f7 | Steve Pederson | 2022-05-27 | Added motifs distance plots |
Rmd | 7cfa8e7 | Steve Pederson | 2022-05-26 | Added TOMTOM |
html | 7cfa8e7 | Steve Pederson | 2022-05-26 | Added TOMTOM |
Rmd | 8836090 | Steve Pederson | 2022-05-26 | Added motif centrality |
html | 8836090 | Steve Pederson | 2022-05-26 | Added motif centrality |
This step takes the peaks defined previously and
As AR is only found bound in the presence of DHT, all peaks are taken using DHT-treatment as the reference condition.
library(memes)
library(universalmotif)
library(tidyverse)
library(plyranges)
library(extraChIPs)
library(rtracklayer)
library(magrittr)
library(BSgenome.Hsapiens.UCSC.hg19)
library(scales)
library(pander)
library(glue)
library(cowplot)
library(parallel)
library(gt)
library(parallel)
library(readxl)
library(UpSetR)
library(reactable)
library(htmltools)
library(multcomp)
options(meme_bin = "/opt/meme/bin/")
check_meme_install()
theme_set(theme_bw())
panderOptions("big.mark", ",")
n_cores <- 12
getBestMatch <- function(x, subject, min.score = "80%", rc = TRUE, ...) {
pwm <- NULL
if (is.matrix(x)) pwm <- x
if (is(x, "universalmotif")) pwm <- x@motif
stopifnot(!is.null(pwm))
stopifnot(is(subject, "XStringViews"))
matches <- matchPWM(
pwm, subject, min.score = min.score, with.score = TRUE, ...
)
df <- as.data.frame(matches)
df$score <- mcols(matches)$score
df$strand <- "+"
if (rc) {
rc_matches <- matchPWM(
reverseComplement(pwm), subject ,
min.score = min.score, with.score = TRUE,
...
)
rc_df <- as.data.frame(rc_matches)
rc_df$score <- mcols(rc_matches)$score
rc_df$strand <- "-"
df <- bind_rows(df, rc_df)
}
grp_cols <- c("start", "end", "width", "seq", "score")
grp_df <- group_by(df, !!!syms(grp_cols))
summ_df <- summarise(
grp_df, strand = paste(strand, collapse = " "), .groups= "drop"
)
summ_df$strand <- str_replace_all(summ_df$strand, ". .", "*")
summ_df$strand <- factor(summ_df$strand, levels = c("+", "-", "*"))
summ_df$view <- findInterval(summ_df$start, start(subject))
summ_df <- arrange(summ_df, view, desc(score))
best_df <- distinct(summ_df, view, .keep_all = TRUE)
best_df <- mutate(
best_df,
start_in_view = start - start(subject)[view] + 1,
end_in_view = start_in_view + width
)
## This will only be added if the views have names
best_df$view_name <- names(subject[best_df$view])
best_df
}
bar_style <- function(width = 1, fill = "#e6e6e6", height = "75%", align = c("left", "right"), color = NULL) {
align <- match.arg(align)
if (align == "left") {
position <- paste0(width * 100, "%")
image <- sprintf("linear-gradient(90deg, %1$s %2$s, transparent %2$s)", fill, position)
} else {
position <- paste0(100 - width * 100, "%")
image <- sprintf("linear-gradient(90deg, transparent %1$s, %2$s %1$s)", position, fill)
}
list(
backgroundImage = image,
backgroundSize = paste("100%", height),
backgroundRepeat = "no-repeat",
backgroundPosition = "center",
color = color
)
}
hg19 <- BSgenome.Hsapiens.UCSC.hg19
sq19 <- seqinfo(hg19)
all_gr <- here::here("data", "annotations", "all_gr.rds") %>%
read_rds() %>%
lapply(
function(x) {
seqinfo(x) <- sq19
x
}
)
counts <- here::here("data", "rnaseq", "counts.out.gz") %>%
read_tsv(skip = 1) %>%
dplyr::select(Geneid, ends_with("bam")) %>%
dplyr::rename(gene_id = Geneid)
detected <- all_gr$gene %>%
as_tibble() %>%
distinct(gene_id, gene_name) %>%
left_join(counts) %>%
pivot_longer(
cols = ends_with("bam"), names_to = "sample", values_to = "counts"
) %>%
mutate(detected = counts > 0) %>%
group_by(gene_id, gene_name) %>%
summarise(detected = mean(detected) > 0.25, .groups = "drop") %>%
dplyr::filter(detected) %>%
dplyr::select(starts_with("gene"))
Pre-existing RNA-seq data was again loaded and the set of 21,173 detected genes was defined as those with \(\geq 1\) aligned read in at least 25% of samples. This dataset represented the DHT response in MDA-MB-453 cells, however, given the existence of both Vehicle and DHT conditions, represents a good representative measure of the polyadenylated components of the transcriptome.
features <- here::here("data", "h3k27ac") %>%
list.files(full.names = TRUE, pattern = "bed$") %>%
sapply(import.bed, seqinfo = sq19) %>%
lapply(granges) %>%
setNames(basename(names(.))) %>%
setNames(str_remove_all(names(.), "s.bed")) %>%
GRangesList() %>%
unlist() %>%
names_to_column("feature") %>%
sort()
The set of H3K27ac-defined features was again used as a marker of regulatory activity, with 16,544 and 15,286 regions defined as enhancers or promoters respectively.
uniprot_to_ensg <- here::here("data", "external", "uniprot_to_ensg.txt.gz") %>%
read_tsv(col_names = c("uniprot_id", "protein", "gene_id")) %>%
mutate(
gene_id = lapply(gene_id, str_split, pattern = "; ") %>%
lapply(unlist)
) %>%
unnest(gene_id) %>%
mutate(gene_id = str_replace_all(gene_id, "(ENSG[0-9]+)\\..*", "\\1")) %>%
left_join(
as_tibble(all_gr$gene) %>%
dplyr::select(gene_id, gene_name)
) %>%
dplyr::filter(!is.na(gene_id)) %>%
dplyr::select(-protein)
rime_detected <- here::here("data", "external", "MDA-MB-453_RIME_processed.xlsx") %>%
read_excel() %>%
dplyr::select(-starts_with("AVG"), -ends_with("Coverage2"), -contains("delta")) %>%
pivot_longer(cols = contains("_")) %>%
separate(name, c("sample", "measure"), sep = "_", extra = "merge") %>%
mutate(
IP = str_remove(sample, "[0-9]$"),
replicate = str_extract(sample, "[0-9]$"),
measure = paste(measure, IP, sep = "_")
) %>%
pivot_wider(
names_from = "measure", values_from = "value",
id_cols = c("Accession", "Description", "Protein", "replicate")
) %>%
mutate(
logfc_cov = log2(Coverage_AR + 1) - log2(Coverage_IgG + 1),
logfc_peptides = log2(Unique_peptides_AR + 1) - log2(Unique_peptides_IgG + 1)
) %>%
dplyr::filter(logfc_cov > 0) %>%
group_by(Accession, Description, Protein) %>%
summarise(
n_detected = dplyr::n(),
mn_logfc_coverage = mean(logfc_cov),
mn_coverage_AR = mean(Coverage_AR),
mn_coverage_IgG = mean(Coverage_IgG),
mn_peptides_AR = mean(Unique_peptides_AR),
mn_peptides_IgG = mean(Unique_peptides_IgG),
.groups = "drop"
) %>%
dplyr::filter(n_detected > 1) %>%
arrange(desc(mn_logfc_coverage)) %>%
left_join(uniprot_to_ensg, by = c("Accession" = "uniprot_id")) %>%
dplyr::select(
Accession, Description, Protein, starts_with("gene"), everything()
) %>%
dplyr::filter(!is.na(gene_name))
AR-RIME data for MDA-MB-453 cells was also used, with protein to
transcript ID mappings performed using the mappings obtained using the
following bash
command
curl https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz |\
zcat |\
cut -f1-2,19 |\
gzip -c > data/external/uniprot_to_ensg.txt.gz
For RIME data, the criteria for detection within a sample was simply defined as the coverage being greater when using AR as the IP-target, than when using IgG as the IP-target. Under this approach, 241 proteins were considered as detected in \(>2\) of 3 samples and were retained for inclusion with motif analysis. When further comparing to the RNA-Seq data only KRT9 was found in the RIME data without supporting RNA evidence. Any proteins with no gene_name associated (i.e. only an Ensembl Gene ID) were also excluded.
meme_db <- here::here("data", "external", "JASPAR2022_CORE_Hsapiens.meme") %>%
read_meme() %>%
to_df() %>%
as_tibble() %>%
mutate(
across(everything(), vctrs::vec_proxy),
altname = str_remove(altname, "MA[0-9]+\\.[0-9]*\\."),
organism = "Homo sapiens"
)
meme_altname <- setNames(meme_db$altname, meme_db$name)
meme_db_detected <- meme_db %>%
mutate(
gene_name = str_split(altname, pattern = "::", n = 2)
) %>%
dplyr::filter(
vapply(
gene_name,
function(x) all(x %in% detected$gene_name),
logical(1)
)
)
meme_info <- meme_db_detected %>%
mutate(motif_length = vapply(motif, function(x){ncol(x@motif)}, integer(1))) %>%
dplyr::select(name, altname, icscore, motif_length)
The motif database was initially defined as all 727 CORE motifs found in Homo sapiens, as contained in JASPAR2022. In order to minimise irrelevant results, only the 465 motifs associated with a detected gene, using presence within the RNA-Seq dataset as evidence of detection.
dht_consensus <- read_rds(
here::here("output/dht_consensus.rds")
)
seqinfo(dht_consensus) <- sq19
dht_peaks <- here::here("data", "peaks") %>%
list.files(recursive = TRUE, pattern = "oracle", full.names = TRUE) %>%
sapply(read_rds, simplify = FALSE) %>%
lapply(function(x) x[["DHT"]]) %>%
lapply(setNames, nm = c()) %>%
setNames(str_extract_all(names(.), "AR|FOXA1|GATA3|TFAP2B")) %>%
lapply(
function(x) {
seqinfo(x) <- sq19
subsetByOverlaps(x, dht_consensus)
}
)
targets <- names(dht_peaks)
grl_by_targets <- dht_consensus %>%
as_tibble() %>%
pivot_longer(
all_of(targets), names_to = "target", values_to = "detected"
) %>%
dplyr::filter(detected) %>%
group_by(range, H3K27ac) %>%
summarise(
targets = paste(sort(target), collapse = " "),
.groups = "drop"
) %>%
colToRanges(var = "range", seqinfo = sq19) %>%
sort() %>%
splitAsList(.$targets)
dht_consensus$targets <- bestOverlap(dht_consensus, grl_by_targets)
Union ranges were defined by grouping by the combination of detected AR, FOXA1, GATA3 and TFAP2B binding, then sequences obtained corresponding to each union range.
grl_by_targets %>%
lapply(
function(x) {
tibble(
`N Sites` = length(x),
`% H3K27ac` = mean(x$H3K27ac),
`Median Width` = floor(median(width(x)))
)
}
) %>%
bind_rows(.id = "Targets") %>%
mutate(
`N Targets` = str_count(Targets, " ") + 1,
`% H3K27ac` = percent(`% H3K27ac`, 0.1)
) %>%
dplyr::select(contains("Targets"), everything()) %>%
arrange(`N Targets`, Targets) %>%
pander(
justify = "lrrrr",
caption = "Summary of union ranges and which combinations of the targets were detected."
)
Targets | N Targets | N Sites | % H3K27ac | Median Width |
---|---|---|---|---|
AR | 1 | 344 | 10.5% | 223 |
FOXA1 | 1 | 57,755 | 9.8% | 291 |
GATA3 | 1 | 4,019 | 16.0% | 216 |
TFAP2B | 1 | 6,412 | 74.7% | 278 |
AR FOXA1 | 2 | 3,580 | 11.6% | 454 |
AR GATA3 | 2 | 157 | 12.7% | 281 |
AR TFAP2B | 2 | 134 | 49.3% | 374 |
FOXA1 GATA3 | 2 | 9,906 | 11.9% | 430 |
FOXA1 TFAP2B | 2 | 3,967 | 54.0% | 454 |
GATA3 TFAP2B | 2 | 411 | 53.3% | 324 |
AR FOXA1 GATA3 | 3 | 6,553 | 15.3% | 620 |
AR FOXA1 TFAP2B | 3 | 1,800 | 47.8% | 585 |
AR GATA3 TFAP2B | 3 | 63 | 47.6% | 354 |
FOXA1 GATA3 TFAP2B | 3 | 1,434 | 38.4% | 525 |
AR FOXA1 GATA3 TFAP2B | 4 | 6,321 | 42.3% | 746 |
seq_by_targets <- grl_by_targets %>%
lapply(function(x) split(x, x$H3K27ac)) %>%
lapply(GRangesList) %>%
lapply(get_sequence, genome = hg19)
## All bound sites
rds_ame_by_targets <- here::here("output", "ame_by_targets.rds")
if (!file.exists(rds_ame_by_targets)) {
ame_by_targets <- seq_by_targets %>%
lapply(unlist) %>%
mclapply(
runAme,
database = to_list(meme_db_detected),
mc.cores = min(n_cores, length(.))
)
nulls <- ame_by_targets %>%
vapply(is, logical(1), "data.frame") %>%
not() %>%
which() %>%
names()
if (length(nulls) > 0) {
ame_by_targets[nulls] <- seq_by_targets[nulls] %>%
lapply(unlist) %>%
lapply(
runAme,
database = to_list(meme_db_detected)
)
}
write_rds(ame_by_targets, rds_ame_by_targets, compress = "gz")
} else {
ame_by_targets <- read_rds(rds_ame_by_targets)
}
## Only sites marked as active by H3K27ac
rds_ame_by_targets_active <- here::here("output", "ame_by_targets_active.rds")
if (!file.exists(rds_ame_by_targets_active)) {
ame_by_targets_active <- seq_by_targets %>%
lapply(function(x) x[["TRUE"]]) %>%
mclapply(
runAme,
database = to_list(meme_db_detected),
mc.cores = min(n_cores, length(.))
)
nulls <- ame_by_targets_active %>%
vapply(is, logical(1), "data.frame") %>%
not() %>%
which() %>%
names()
if (length(nulls) > 0) {
ame_by_targets_active[nulls] <- seq_by_targets[nulls] %>%
lapply(function(x) x[["TRUE"]]) %>%
lapply(
runAme,
database = to_list(meme_db_detected)
)
}
write_rds(ame_by_targets_active, rds_ame_by_targets_active, compress = "gz")
} else {
ame_by_targets_active <- read_rds(rds_ame_by_targets_active)
}
## Compare the active and inactive sites for each set of targets
rds_ame_active_vs_inactive <- here::here("output", "ame_active_vs_inactive.rds")
if (!file.exists(rds_ame_active_vs_inactive)) {
ame_active_vs_inactive <- seq_by_targets %>%
mclapply(
function(x) {
runAme(x[["TRUE"]], x[["FALSE"]], database = to_list(meme_db_detected))
},
mc.cores = min(n_cores, length(.))
)
nulls <- ame_active_vs_inactive %>%
vapply(is, logical(1), "data.frame") %>%
not() %>%
which() %>%
names()
if (length(nulls) > 0) {
ame_active_vs_inactive[nulls] <- seq_by_targets[nulls] %>%
lapply(
function(x) {
runAme(x[["TRUE"]], x[["FALSE"]], database = to_list(meme_db_detected))
}
)
}
write_rds(ame_active_vs_inactive, rds_ame_active_vs_inactive, compress = "gz")
} else {
ame_active_vs_inactive <- read_rds(rds_ame_active_vs_inactive)
}
## Compare the inactive and active sites for each set of targets
rds_ame_inactive_vs_active <- here::here("output", "ame_inactive_vs_active.rds")
if (!file.exists(rds_ame_inactive_vs_active)) {
ame_inactive_vs_active <- seq_by_targets %>%
mclapply(
function(x) {
runAme(x[["FALSE"]], x[["TRUE"]], database = to_list(meme_db_detected))
},
mc.cores = min(n_cores, length(.))
)
nulls <- ame_inactive_vs_active %>%
vapply(is, logical(1), "data.frame") %>%
not() %>%
which() %>%
names()
if (length(nulls) > 0) {
ame_inactive_vs_active[nulls] <- seq_by_targets[nulls] %>%
lapply(
function(x) {
runAme(x[["FALSE"]], x[["TRUE"]], database = to_list(meme_db_detected))
}
)
}
write_rds(ame_inactive_vs_active, rds_ame_inactive_vs_active, compress = "gz")
} else {
ame_inactive_vs_active <- read_rds(rds_ame_inactive_vs_active)
}
## Compare promoters to shuffled sequences
seq_all4_prom <- dht_consensus %>%
filter(!!!syms(targets), promoter) %>%
get_sequence(genome = hg19)
ame_all4_promoters <- runAme(seq_all4_prom, database = to_list(meme_db_detected))
## Compare enhancers to shuffled sequences
seq_all4_enh <- dht_consensus %>%
filter(!!!syms(targets), enhancer) %>%
get_sequence(genome = hg19)
ame_all4_enh <- runAme(seq_all4_enh, database = to_list(meme_db_detected))
AME from the meme-suite was then run on each sequence to detect the presence of motifs associated with each target within each set of sequences. Under this approach the optimal score is estimated for considering a PWM match to be a true positive. Matches above this value are classed as true positives, whilst those below are considered as false positives. Significance is determined using using Fisher’s Exact Test against the set of control sequences, which are either shuffled input sequences, or a discrete set provided. Adjusted p-values are determined within each motif based on resampling, with the e-values representing a Bonferroni adjustment of the within-motif adjusted p-values.
Motif enrichment inevitably produces hits to highly-related motifs For unsupervised enrichment, the following strategy was used to aid interpretation
hclust
and a height of
0.9plot_grid(
plot_grid(
plotlist = lapply(
meme_db_detected %>%
dplyr::filter(altname %in% targets) %>%
arrange(altname) %>%
dplyr::filter(altname != "TFAP2B") %>%
to_list() %>%
lapply(
function(x){
x@name <- paste0(
x@name, " (", x@altname, ")"
)
x
}
),
function(x) {
view_motifs(x) +
ggtitle(x@name) +
theme(plot.title = element_text(hjust = 0.5, size = 11))
}
),
ncol = 1
),
plot_grid(
meme_db_detected %>%
dplyr::filter(altname == "TFAP2B") %>%
to_list() %>%
lapply(
function(x){
x@name <- paste0(
x@name, " (", x@altname, ")"
)
x
}
) %>%
view_motifs() +
theme(plot.title = element_text(hjust = 0.5, size = 9))
),
ncol = 2
)
All motifs matching the set of ChIP targets. AR, FOXA1 and GATA3 are not aligned with each other, whilst all TFAP2B motifs are shown using the best alignment to each other.
AME was used to assess enrichment of motifs within the sequences derived from each combination of ChIP targets. Enrichment was considered relative to shuffled sequences as the control set, with AME only returning matches considered to be statistically significant.
ame_by_targets %>%
bind_rows(.id = "targets") %>%
dplyr::filter(motif_alt_id %in% globalenv()$targets) %>%
mutate(
motif_id = as.factor(motif_id),
targets = as.factor(targets),
TFAP2B = ifelse(
str_detect(targets, "TFAP2B"), "TFAP2B", "No TFAP2B"
)
) %>%
plot_ame_heatmap(
id = motif_id,
value = tp_percent,
group = fct_rev(targets)
) +
geom_label(
aes(label = percent(tp_percent/100, 0.1)),
size = 3.5,
show.legend = FALSE
) +
facet_grid(TFAP2B ~ motif_alt_id, scales = "free", space = "free") +
scale_x_discrete(expand = expansion(c(0, 0))) +
scale_y_discrete(expand = expansion(c(0, 0))) +
labs(
x = "Detected Motif",
y = "Detected ChIP Targets",
fill = expression(widehat("TP%"))
) +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5, size = 10),
axis.text.y = element_text(angle = 0, hjust = 1, size = 10),
panel.grid = element_blank()
)
Estimated percentage of true positive matches to all binding motifs associated with each ChIP target within all sequences. Only motifs considered to be enriched are shown, such that missing values are able to be confidently assumed as not enriched for the motif. All provide supporting evidence for direct binding of each target to it’s motif.
Restricting the test for motif enrichment to all combinations of ChIP targets which overlap an active H3K27ac peak, similar patterns were observed. Of possible note is reduced enrichment for TFAP2\(\beta\) motifs amongst the sites where TFAP2\(\beta\) was not observed.
The set of H3K27-associated peaks where AR, GATA3 and TFAP2\(\beta\) were found in combination were notably enriched for one specific TFAP2\(\beta\) motif (MA0812) to the exclusion of others suggesting that FOXA1 independent formation of the regulatory complex may have a distinct mechanism. In conjunction with the increased occurrence of this motif, this set of peaks also contained the lowest enrichment for AR motifs amongst the AR-associated peaks.
Another clear association between the presence of FOXA1 and possible increased enrichment for the MA0811.1 TFAP2\(\beta\) motif may also be of interest.
ame_by_targets_active %>%
bind_rows(.id = "targets") %>%
dplyr::filter(motif_alt_id %in% globalenv()$targets) %>%
mutate(
motif_id = as.factor(motif_id),
targets = as.factor(targets),
TFAP2B = ifelse(
str_detect(targets, "TFAP2B"), "TFAP2B", "No TFAP2B"
)
) %>%
plot_ame_heatmap(
id = motif_id,
value = tp_percent,
group = fct_rev(targets)
) +
geom_label(
aes(label = percent(tp_percent/100, 0.1)),
size = 3.5,
show.legend = FALSE
) +
facet_grid(TFAP2B ~ motif_alt_id, scales = "free", space = "free") +
scale_x_discrete(expand = expansion(c(0, 0))) +
scale_y_discrete(expand = expansion(c(0, 0))) +
labs(
x = "Detected Motif",
y = "Detected ChIP Targets",
fill = expression(widehat("TP%"))
) +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5, size = 10),
axis.text.y = element_text(angle = 0, hjust = 1, size = 10),
panel.grid = element_blank()
)
Estimated percentage of true positive matches to all binding motifs associated with each ChIP target, restricting sequences to those marked by overlapping H3K27ac signal. Only motif considered to be enriched are shown, such that missing values are able to be confidently assumed as not enriched for the motif. All provide supporting evidence for direct binding of each target to it’s motif. Interestingly FOXA1 in combination with TFAP2B appears to favour the motif MA0811.1, whilst AR and GATA3 in combination with TFAP2B appears to favour MA0812.1
Motifs associated with any detected transcription factor besides the ChIP targets were also assessed for enrichment. Motifs which matched GATA[0-9], Forkhead (FOX[A-Z][0-9]) or TFAP2[A-Z] were omitted due to the high-level of motif redundancy and these being likely to represent the ChIP target, instead of an additional protein. Given that an extremely large number of motifs were found, only those found in >50% of sites in at least one group of ChIP targets were retained
The set of motifs associated with detectable genes (using RNA) was tested for enrichment against all sequences from all combinations of ChIP target binding, using shuffled sequences as controls. Given that 376 motifs were considered to be enriched in at least one set of peaks, the plot below only shows those motifs which were found in >50% of sequences in one or more sets of peaks.
The NFIX binding motif was strongly associated with most sets of peaks which were not bound by TFAP2\(\beta\), particularly with FOXA1 in the absence of TFAP2\(\beta\). However, a very strong association with all four bound targets was also observed. Given the short (6nt) nature of this motif, this may be artefactual.
The TRPS1 motif was also found to be strongly enriched where GATA3 was bound, however, this motif bears a very strong resemblance to the GATA3 motif.
meme_db %>%
dplyr::filter(str_detect(altname, "GATA3|TRPS1")) %>%
to_list() %>%
lapply(function(x) {x@name <- glue("{x@name}\n({x@altname})"); x}) %>%
view_motifs(names.pos = "right")
Comparison between GATA3 and TRPS1 motifs
x <- ame_by_targets %>%
bind_rows(.id = "targets") %>%
dplyr::filter(
!motif_alt_id %in% targets,
!str_detect(motif_alt_id, "(GATA[0-9]|FOX[A-Z][0-9]|TFAP2[A-Z])"),
tp_percent > 50
) %>%
distinct(motif_id) %>%
pull("motif_id")
cormat <- meme_db %>%
dplyr::filter(name %in% x) %>%
to_list() %>%
compare_motifs()
cormat[cormat < 0.9] <- 0
cormat[is.na(cormat)] <- 0
cl <- as.dist(1 - cormat) %>%
hclust() %>%
cutree(h= 0.9)
ame_by_targets %>%
bind_rows(.id = "targets") %>%
dplyr::filter(motif_id %in% names(cl)) %>%
mutate(
cluster = cl[motif_id],
motif_id = as.factor(motif_id),
targets = as.factor(targets),
TFAP2B = ifelse(
str_detect(targets, "TFAP2B"), "TFAP2B", "No TFAP2B"
)
) %>%
group_by(cluster) %>%
mutate(
best_tp = max(tp_percent)
) %>%
ungroup() %>%
group_by(motif_id) %>%
dplyr::filter(any(tp_percent == best_tp)) %>%
ungroup() %>%
plot_ame_heatmap(
id = motif_id,
value = tp_percent,
group = fct_rev(targets)
) +
geom_label(
aes(label = percent(tp_percent/100, 0.1)),
size = 3.5,
show.legend = FALSE
) +
facet_grid(TFAP2B ~ motif_alt_id, scales = "free", space = "free") +
scale_x_discrete(expand = expansion(c(0, 0))) +
scale_y_discrete(expand = expansion(c(0, 0))) +
labs(
x = "Detected Motif",
y = "Detected ChIP Targets",
fill = expression(widehat("TP%"))
) +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5, size = 10),
axis.text.y = element_text(angle = 0, hjust = 1, size = 10),
panel.grid = element_blank()
)
Enriched motifs when analysing all sequences associated with one or more ChIP targets. Motifs are only shown if found in >50% of sequences from at least one set of peaks.
When restricting the peaks to being any ChIP-target binding peak associated with H3K27ac significantly more motifs were considered to be enriched.
Notably:
No other striking patterns were observed.
x <- ame_by_targets_active %>%
bind_rows(.id = "targets") %>%
dplyr::filter(
!motif_alt_id %in% targets,
!str_detect(motif_alt_id, "(GATA[0-9]|FOX[A-Z][0-9]|TFAP2[A-Z])"),
tp_percent > 50
) %>%
distinct(motif_id) %>%
pull("motif_id")
cormat <- meme_db %>%
dplyr::filter(name %in% x) %>%
to_list() %>%
compare_motifs()
cormat[cormat < 0.9] <- 0
cormat[is.na(cormat)] <- 0
cl <- as.dist(1 - cormat) %>%
hclust() %>%
cutree(h= 0.9)
ame_by_targets_active %>%
bind_rows(.id = "targets") %>%
dplyr::filter(motif_id %in% names(cl)) %>%
mutate(
cluster = cl[motif_id],
motif_id = as.factor(motif_id),
targets = as.factor(targets),
TFAP2B = ifelse(
str_detect(targets, "TFAP2B"), "TFAP2B", "No TFAP2B"
)
) %>%
group_by(cluster) %>%
mutate(
best_tp = max(tp_percent)
) %>%
ungroup() %>%
group_by(motif_id) %>%
dplyr::filter(any(tp_percent == best_tp)) %>%
ungroup() %>%
plot_ame_heatmap(
id = motif_id,
value = tp_percent,
group = fct_rev(targets)
) +
geom_label(
aes(label = percent(tp_percent/100, 0.1)),
size = 3,
show.legend = FALSE
) +
facet_grid(TFAP2B ~ motif_alt_id, scales = "free", space = "free") +
scale_x_discrete(expand = expansion(c(0, 0))) +
scale_y_discrete(expand = expansion(c(0, 0))) +
labs(
x = "Detected Motif",
y = "Detected ChIP Targets",
fill = expression(widehat("TP%"))
) +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5, size = 8),
axis.text.y = element_text(angle = 0, hjust = 1, size = 8),
panel.grid = element_blank()
)
Enriched motifs when analysing all sequences associated with one or more ChIP targets and which overlap an H3K27ac peak. Motifs are only shown if found in >50% of sequences from at least one set of peaks.
In order to ascertain whether any co-factors are playing a role in active vs inactive regulatory regions, the set of sequences associated with binding of all four targets were compared to each other, based on H3K27ac overlap as the marker of an active regulatory region. Enriched motifs were those with an \(E\)-value < 0.01 and with a True Positive match in > 10% of sequences.
Due to the high-levels of similarity between identified motifs, enriched motifs were clustered by:
For each motif cluster the values used for visualisation were the lowest \(E\)-value within the cluster and the high estimate of True Positive matches.
x <- ame_active_vs_inactive$`AR FOXA1 GATA3 TFAP2B` %>%
dplyr::filter(
evalue < 0.01,
tp_percent > 10
) %>%
distinct(motif_id) %>%
pull("motif_id")
cormat <- meme_db %>%
dplyr::filter(name %in% x) %>%
to_list() %>%
compare_motifs()
cormat[cormat < 0.80] <- 0
cormat[is.na(cormat)] <- 0
cl_active_vs_inactive <- as.dist(1 - cormat) %>%
hclust() %>%
cutree(h= 0.9)
ame_active_vs_inactive$`AR FOXA1 GATA3 TFAP2B` %>%
dplyr::filter(motif_id %in% names(cl_active_vs_inactive)) %>%
mutate(
cluster = cl_active_vs_inactive[motif_id],
in_rime = motif_alt_id %in% rime_detected$gene_name,
motif_alt_id = ifelse(in_rime, paste0(motif_alt_id, "*"), motif_alt_id)
) %>%
group_by(cluster) %>%
summarise(
tp_percent = max(tp_percent),
p = min(evalue),
id = motif_alt_id %>%
unique %>%
sort() %>%
paste(collapse = "/")
) %>%
arrange(p) %>%
mutate(
id = str_trunc(id, width = 40) %>% fct_inorder()
) %>%
ggplot(
aes(tp_percent, fct_rev(id), fill = -log10(p))
) +
geom_col() +
scale_x_continuous(expand = expansion(c(0, 0.05))) +
scale_fill_viridis_c() +
labs(
x = "True Positive Matches (%)",
y = "Motif Cluster"
)
Motifs found enriched when comparing sequences from peaks overlapping an active H3K27ac regulatory region, with those from peaks not overlapping an H3K27ac-marked region. Motifs were clustered as described in the text, with any TFs detected as bound to AR in the RIME dataset indicated with an asterisk. Interestingly, TFAP2B motifs appeared to be enriched in this set of sequences.
The same approach was then take for sequences associated with inactive regulatory regions, by comparing these sequences to those from active regulatory regions.
x <- ame_inactive_vs_active$`AR FOXA1 GATA3 TFAP2B` %>%
dplyr::filter(evalue < 0.01, tp_percent > 10) %>%
distinct(motif_id) %>%
pull("motif_id")
cormat <- meme_db %>%
dplyr::filter(name %in% x) %>%
to_list() %>%
compare_motifs()
cormat[cormat < 0.8] <- 0
cormat[is.na(cormat)] <- 0
cl_inactive_vs_active <- as.dist(1 - cormat) %>%
hclust() %>%
cutree(h= 0.9)
ame_inactive_vs_active$`AR FOXA1 GATA3 TFAP2B` %>%
dplyr::filter(motif_id %in% names(cl_inactive_vs_active)) %>%
mutate(
cluster = cl_inactive_vs_active[motif_id],
in_rime = motif_alt_id %in% rime_detected$gene_name,
motif_alt_id = ifelse(in_rime, paste0(motif_alt_id, "*"), motif_alt_id)
) %>%
group_by(cluster) %>%
summarise(
tp_percent = max(tp_percent),
p = min(evalue),
id = motif_alt_id %>%
unique %>%
sort() %>%
paste(collapse = "/")
) %>%
arrange(p) %>%
mutate(
id = str_trunc(id, width = 40) %>% fct_inorder()
) %>%
ggplot(
aes(tp_percent, fct_rev(id), fill = -log10(p))
) +
geom_col() +
scale_x_continuous(expand = expansion(c(0, 0.05))) +
scale_fill_viridis_c() +
labs(
x = "True Positive Matches (%)",
y = "Motif Cluster"
)
Motifs found enriched when comparing sequences from peaks overlapping an inactive H3K27ac regulatory region, with those from peaks overlapping an H3K27ac-marked region. Motifs were clustered as described in the text, with any TFs detected as bound to AR in the RIME dataset indicated with an asterisk. Both FOXA1 and GATA3 motifs appeared to be enriched in this set of sequences.
For the 652 sequences marked by H3K27ac which overlapped a promoter, and where all four targets were bound, sequences were compared to a random shuffled set of sequences.
x <- ame_all4_promoters %>%
dplyr::filter(evalue < 0.01, tp_percent > 10) %>%
distinct(motif_id) %>%
pull("motif_id")
cormat <- meme_db %>%
dplyr::filter(name %in% x) %>%
to_list() %>%
compare_motifs()
cormat[cormat < 0.8] <- 0
cormat[is.na(cormat)] <- 0
cl_promoters <- as.dist(1 - cormat) %>%
hclust() %>%
cutree(h= 0.9)
ame_all4_promoters %>%
dplyr::filter(motif_id %in% names(cl_promoters)) %>%
mutate(
cluster = cl_promoters[motif_id],
in_rime = motif_alt_id %in% rime_detected$gene_name,
motif_alt_id = ifelse(in_rime, paste0(motif_alt_id, "*"), motif_alt_id)
) %>%
group_by(cluster) %>%
summarise(
tp_percent = max(tp_percent),
p = min(evalue),
id = motif_alt_id %>%
unique %>%
sort() %>%
paste(collapse = "/")
) %>%
arrange(p) %>%
mutate(
id = str_trunc(id, width = 40) %>% fct_inorder()
) %>%
ggplot(
aes(tp_percent, fct_rev(id), fill = -log10(p))
) +
geom_col() +
scale_x_continuous(expand = expansion(c(0, 0.05))) +
scale_fill_viridis_c() +
labs(
x = "True Positive Matches (%)",
y = "Motif Cluster"
)
Motifs found enriched when comparing sequences from peaks overlapping an active H3K27ac regulatory region which itself overlapped a TSS, as a putative promoter, to a set of shuffled sequences. Motifs were clustered as described in the text, with any TFs detected as bound to AR in the RIME dataset indicated with an asterisk. Both FOXA1 and TFAP2B motifs appeared to be enriched in this set of sequences.
For the 2022 sequences marked by an H3K27ac peak which overlapped a putative enhancer, and where all four targets were bound, sequences were compared to a random shuffled set of sequences.
x <- ame_all4_enh %>%
dplyr::filter(evalue < 0.01, tp_percent > 10) %>%
distinct(motif_id) %>%
pull("motif_id")
cormat <- meme_db %>%
dplyr::filter(name %in% x) %>%
to_list() %>%
compare_motifs()
cormat[cormat < 0.8] <- 0
cormat[is.na(cormat)] <- 0
cl_enh <- as.dist(1 - cormat) %>%
hclust() %>%
cutree(h= 0.9)
ame_all4_enh %>%
dplyr::filter(motif_id %in% names(cl_enh)) %>%
mutate(
cluster = cl_enh[motif_id],
in_rime = motif_alt_id %in% rime_detected$gene_name,
motif_alt_id = ifelse(in_rime, paste0(motif_alt_id, "*"), motif_alt_id)
) %>%
group_by(cluster) %>%
summarise(
tp_percent = max(tp_percent),
p = min(evalue),
id = motif_alt_id %>%
unique %>%
sort() %>%
paste(collapse = "/")
) %>%
arrange(p) %>%
mutate(
id = str_trunc(id, width = 40) %>% fct_inorder()
) %>%
# dplyr::slice(1:20) %>%
ggplot(
aes(tp_percent, fct_rev(id), fill = -log10(p))
) +
geom_col() +
scale_x_continuous(expand = expansion(c(0, 0.05))) +
scale_fill_viridis_c() +
labs(
x = "True Positive Matches (%)",
y = "Motif Cluster"
)
Motifs found enriched when comparing sequences from peaks overlapping an active H3K27ac regulatory region which did not overlap a TSS, as a putative enhancer, to a set of shuffled sequences. Motifs were clustered as described in the text, with any TFs detected as bound to AR in the RIME dataset indicated with an asterisk. Both FOXA1 and TFAP2B motifs appeared to be enriched in this set of sequences.
Given the unexpected result that TFAP2\(\beta\) motifs were enriched in the H3K27ac marked peaks (i.e. active regulatory regions), whilst FOXA1 and GATA motifs were enriched in the peaks not marked by H3K27ac (i.e. inactive regulatory regions), the co-occurrence of these motifs was explored across both sets of peaks as a predictive tool for an active regulatory region. Sequences were checked for the simple presence of each motif, using matches with a score > 60% of the maximum possible score as a positive match, to be as inclusive as possible. For all sequences, only the highest scoring match to the consensus motif was retained. For comparability across motifs, motif match scores were normalised to Z-scores.
inv.logit <- binomial()$linkinv
min_score <- percent(0.6, accuracy = 1)
all4_motifs <- meme_db_detected %>%
dplyr::filter(altname %in% targets) %>%
to_list() %>%
mclapply(
getBestMatch,
subject = seq_by_targets$`AR FOXA1 GATA3 TFAP2B` %>%
unlist() %>%
as("DNAStringSet") %>%
as("Views") %>%
setNames(
seq_by_targets$`AR FOXA1 GATA3 TFAP2B` %>%
unlist() %>%
names() %>%
str_remove_all("^(TRUE|FALSE)\\.")
),
min.score = min_score,
mc.cores = length(targets)
) %>%
setNames(
dplyr::filter(meme_db_detected, altname %in% targets)$name
) %>%
bind_rows(.id = "name") %>%
left_join(
dplyr::select(meme_info, name, altname, motif_length),
by = "name"
) %>%
dplyr::rename(range = view_name) %>%
full_join(
grl_by_targets$`AR FOXA1 GATA3 TFAP2B` %>%
as_tibble(),
by = "range"
)
glm_h3K27ac <- all4_motifs %>%
group_by(name) %>%
mutate(
score = (score - mean(score, na.rm = TRUE)) / sd(score, na.rm = TRUE)
) %>%
pivot_wider(
names_from = "name", values_from = "score", id_cols = c("range", "H3K27ac")
) %>%
dplyr::filter(
!is.na(MA0811.1), !is.na(MA0812.1), !is.na(MA0813.1)
) %>%
mutate(
PC1 = cbind(MA0811.1, MA0812.1, MA0813.1) %>%
prcomp() %>%
.[["x"]] %>%
.[,"PC1"] %>%
multiply_by(-1), # This always seems to come out in the opposite direction to observed
width = width(GRanges(range))
) %>%
with(
glm(H3K27ac ~ width + (PC1 + MA0007.2 + MA0148.4 + MA0037.3)^4 , family = "binomial")
) %>%
step() %>%
broom::tidy() %>%
mutate(
term = term %>%
str_replace_all("PC1", "TFAP2B (PC1)") %>%
str_replace_all("MA0007.2", "AR") %>%
str_replace_all("MA0148.4", "FOXA1") %>%
str_replace_all("MA0037.3", "GATA3"),
fdr = p.adjust(p.value, "fdr"),
` ` = case_when(
fdr > 0.1 ~ "",
fdr > 0.05 ~ ".",
fdr > 0.01 ~ "*",
fdr > 0.001 ~ "**",
TRUE ~ "***"
)
)
all4_motifs %>%
group_by(name) %>%
mutate(
score = (score - mean(score, na.rm = TRUE)) / sd(score, na.rm = TRUE)
) %>%
ggplot(aes(name, score, fill = H3K27ac)) +
geom_boxplot() +
facet_grid(.~altname, scales = "free", space = "free") +
scale_fill_manual(values = c("grey50", "red")) +
labs(
x = "Motif name", y = "Normalised Motif Match Score"
)
Distribution of Z-Scores for matches to each motif, separated by whether the peak was associated with an H3K27ac mark. Motifs with higher scores were clearly associated with H3K27ac marks for all TFAP2B motifs, whilst some tendency towards lower scores was noted for FOXA1 and GATA3, in keeping with the above enrichment results.
To quantify the impact of motif presence within the sequences where all 4 ChIP targets were detected, the probability of a peak being associated with H3K27ac marks was modelled using the strength of each motif in respect to the PWM. To avoid the high correlations between the three TFAP2\(\beta\) motifs, these were combined using PC1 returned from PCA across all three motifs as a representative score. This may make the direction of the probability more difficult to interpret as a lower values on PC1 may represent a higher match across all 3 motifs.
A logistic regression model was fitted using H3K27ac status as a the binary response variable with the scores for AR (MA0007.2), FOXA1 (MA0148.4) and GATA3 (MA0037.3) motifs being combined with PC1 across all 3 TFAP2\(\beta\) motifs as the predictors. The initial model was fitted incorporating 4-way interactions as the initial model, along with peak width, with terms being removed in a step-wise manner using Akaike’s Information Criterion (AIC).
cp <- htmltools::tags$caption(
glue(
"
Results from final logistic regression model for the probability of a peak
being marked by H3K27ac, based on the score of each detected motif's match
to the PWM, and the overall width of the peak. Whilst the width of the peak
clearly had a significant influence on the probability of a peak being
considered active, via H3K27ac, the presence of a strong TFAP2B motif also
had a clear influence on the likelihood of a peak being considered active.
Conversely, the stronger matches to the core FOXA1 and GATA3 motifs
indicated a decreased association with H3K27ac marks, as predicted by the
sequence enrichment analysis.
"
)
)
tbl <- glm_h3K27ac %>%
mutate(term = str_replace_all(term, "width", "Peak Width")) %>%
reactable(
columns = list(
term = colDef(
name = "Model Term",
cell = function(value) str_replace_all(value, ":", " : "),
minWidth = 150
),
estimate = colDef(
name = "Estimate", format = colFormat(digits = 3)
),
std.error = colDef(show = FALSE),
statistic = colDef(name = "Test Statistic"),
p.value = colDef(
show = FALSE,
cell = function(value) {
ifelse(
value < 0.001,
sprintf("%.2e", value),
sprintf("%.3f", value)
)
}
),
fdr = colDef(
name = "P<sub>FDR</sub>", html = TRUE,
cell = function(value) {
ifelse(
value < 0.001,
sprintf("%.2e", value),
sprintf("%.3f", value)
)
}
)
),
defaultColDef = colDef(
format = colFormat(digits = 2)
)
)
div(class = "table",
div(class = "table-header",
div(class = "caption", cp),
tbl
)
)
The distance between the best match to each motif was then assessed for the set of peaks for which all four targets were detected, separated by H3K27ac status.
pwm_lengths <- structure(meme_info$motif_length, names = meme_info$name)
all4_motifs %>%
arrange(view) %>%
## The best strategy is to find the distance between centres, then add half
## the width of each motif
mutate(match_centre = (start_in_view + end_in_view) / 2) %>%
pivot_wider(
names_from = "name", values_from = "match_centre", id_cols = c("range", "H3K27ac")
) %>%
mutate(
## AR
MA0007.2_MA0037.3 = abs(MA0007.2 - MA0037.3 ) - 0.5*(sum(pwm_lengths[c("MA0007.2", "MA0037.3")])),
MA0007.2_MA0148.4 = abs(MA0007.2 - MA0148.4 ) - 0.5*(sum(pwm_lengths[c("MA0007.2", "MA0148.4")])),
MA0007.2_MA0811.1 = abs(MA0007.2 - MA0811.1 ) - 0.5*(sum(pwm_lengths[c("MA0007.2", "MA0811.1")])),
MA0007.2_MA0812.1 = abs(MA0007.2 - MA0812.1 ) - 0.5*(sum(pwm_lengths[c("MA0007.2", "MA0812.1")])),
MA0007.2_MA0813.1 = abs(MA0007.2 - MA0813.1 ) - 0.5*(sum(pwm_lengths[c("MA0007.2", "MA0813.1")])),
## FOXA1
MA0148.4_MA0037.3 = abs(MA0148.4 - MA0037.3 ) - 0.5*(sum(pwm_lengths[c("MA0148.4", "MA0037.3")])),
MA0148.4_MA0811.1 = abs(MA0148.4 - MA0811.1 ) - 0.5*(sum(pwm_lengths[c("MA0148.4", "MA0811.1")])),
MA0148.4_MA0812.1 = abs(MA0148.4 - MA0812.1 ) - 0.5*(sum(pwm_lengths[c("MA0148.4", "MA0812.1")])),
MA0148.4_MA0813.1 = abs(MA0148.4 - MA0813.1 ) - 0.5*(sum(pwm_lengths[c("MA0148.4", "MA0813.1")])),
## GATA3
MA0037.3_MA0811.1 = abs(MA0037.3 - MA0811.1 ) - 0.5*(sum(pwm_lengths[c("MA0037.3", "MA0811.1")])),
MA0037.3_MA0812.1 = abs(MA0037.3 - MA0812.1 ) - 0.5*(sum(pwm_lengths[c("MA0037.3", "MA0812.1")])),
MA0037.3_MA0813.1 = abs(MA0037.3 - MA0813.1 ) - 0.5*(sum(pwm_lengths[c("MA0037.3", "MA0813.1")])),
) %>%
dplyr::select(range, H3K27ac, contains("_")) %>%
pivot_longer(starts_with("MA"), names_to = "motifs", values_to = "d") %>%
separate(motifs, into = c("pwm_1", "pwm_2"), sep = "_") %>%
left_join(
dplyr::select(meme_info, pwm_1 = name, altname_1 = altname), by = "pwm_1"
) %>%
left_join(
dplyr::select(meme_info, pwm_2 = name, altname_2 = altname), by = "pwm_2"
) %>%
arrange(range, d) %>%
distinct(range, altname_1, altname_2, .keep_all = TRUE) %>%
unite("comparison", starts_with("altname"), sep = " - ") %>%
mutate(
d = ifelse(d < 0, 0, d),
H3K27ac = ifelse(H3K27ac, "Active", "Inactive")
) %>%
ggplot(aes(d, stat(density), colour = H3K27ac)) +
geom_density() +
geom_vline(
aes(xintercept = med, colour = H3K27ac),
data = . %>%
group_by(comparison, H3K27ac) %>%
summarise(med = median(d), .groups = "drop"),
show.legend = FALSE
) +
facet_grid(comparison ~ ., scales = "free_y") +
scale_colour_manual(values = c("red", "black")) +
coord_cartesian(xlim = c(0, 500)) +
theme(legend.position = "top")
Distances between motifs for the best match for each motif. For comparisons to TFAP2B motifs, the nearest of the three was selected. Whilst the median distances between AR, FOXA1 and GATA3 motifs tended to be between 160 and 200bp, the median distances between TFAP2B motifs and the the other motifs tended to be around 80 to 100bp, suggesting that TFAP2B may bind between two of the other three targets in many complexes. A trend towards motifs being further aart within the active regulatory regions was moted, but a simple explanation for this remains elusive.
Showing all motifs in Richard’s manually curated list
ids <- c("MA0640.2", "MA0136.2", "MA0760.1", "MA0686.1")
meme_db %>%
dplyr::filter(name %in% ids) %>%
mutate(name = factor(name, levels = ids)) %>%
arrange(name) %>%
mutate(name = as.character(name)) %>%
to_list() %>%
lapply(
function(x) {
x@name <- paste0(str_pad(x@altname, 16, side = "both"), "\n(", x@name, ")")
x
}
) %>%
view_motifs(names.pos = "right")
Version | Author | Date |
---|---|---|
ee8cf46 | Steve Pederson | 2022-06-23 |
Showing only the motifs associated with detected genes, noting that ELF5 expression was not detected in the RNA-seq dataset
meme_db_detected %>%
dplyr::filter(name %in% ids) %>%
mutate(name = factor(name, levels = ids)) %>%
arrange(name) %>%
mutate(name = as.character(name)) %>%
to_list() %>%
lapply(
function(x) {
x@name <- paste0(str_pad(x@altname, 16, side = "both"), "\n(", x@name, ")")
x
}
) %>%
view_motifs(names.pos = "right")
Version | Author | Date |
---|---|---|
ee8cf46 | Steve Pederson | 2022-06-23 |
sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] multcomp_1.4-19 TH.data_1.1-1
[3] MASS_7.3-56 survival_3.2-13
[5] mvtnorm_1.1-3 htmltools_0.5.2
[7] reactable_0.3.0 UpSetR_1.4.0
[9] readxl_1.4.0 gt_0.6.0
[11] cowplot_1.1.1 glue_1.6.2
[13] pander_0.6.5 scales_1.2.0
[15] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.64.0
[17] Biostrings_2.64.0 XVector_0.36.0
[19] magrittr_2.0.3 rtracklayer_1.56.0
[21] extraChIPs_1.0.1 SummarizedExperiment_1.26.1
[23] Biobase_2.56.0 MatrixGenerics_1.8.0
[25] matrixStats_0.62.0 BiocParallel_1.30.2
[27] plyranges_1.16.0 GenomicRanges_1.48.0
[29] GenomeInfoDb_1.32.2 IRanges_2.30.0
[31] S4Vectors_0.34.0 BiocGenerics_0.42.0
[33] forcats_0.5.1 stringr_1.4.0
[35] dplyr_1.0.9 purrr_0.3.4
[37] readr_2.1.2 tidyr_1.2.0
[39] tibble_3.1.7 ggplot2_3.3.6
[41] tidyverse_1.3.1 universalmotif_1.14.1
[43] memes_1.4.1 workflowr_1.7.0
loaded via a namespace (and not attached):
[1] rappdirs_0.3.3 R.methodsS3_1.8.1
[3] csaw_1.30.1 bit64_4.0.5
[5] knitr_1.39 R.utils_2.11.0
[7] DelayedArray_0.22.0 data.table_1.14.2
[9] rpart_4.1.16 KEGGREST_1.36.0
[11] RCurl_1.98-1.6 AnnotationFilter_1.20.0
[13] doParallel_1.0.17 generics_0.1.2
[15] GenomicFeatures_1.48.3 callr_3.7.0
[17] EnrichedHeatmap_1.26.0 usethis_2.1.6
[19] RSQLite_2.2.14 bit_4.0.4
[21] tzdb_0.3.0 xml2_1.3.3
[23] lubridate_1.8.0 httpuv_1.6.5
[25] assertthat_0.2.1 xfun_0.31
[27] hms_1.1.1 jquerylib_0.1.4
[29] evaluate_0.15 promises_1.2.0.1
[31] fansi_1.0.3 restfulr_0.0.13
[33] progress_1.2.2 dbplyr_2.1.1
[35] igraph_1.3.1 DBI_1.1.2
[37] htmlwidgets_1.5.4 ellipsis_0.3.2
[39] crosstalk_1.2.0 backports_1.4.1
[41] biomaRt_2.52.0 vctrs_0.4.1
[43] here_1.0.1 ensembldb_2.20.1
[45] cachem_1.0.6 withr_2.5.0
[47] ggforce_0.3.3 Gviz_1.40.1
[49] vroom_1.5.7 checkmate_2.1.0
[51] GenomicAlignments_1.32.0 prettyunits_1.1.1
[53] cluster_2.1.3 lazyeval_0.2.2
[55] crayon_1.5.1 labeling_0.4.2
[57] edgeR_3.38.1 pkgconfig_2.0.3
[59] tweenr_1.0.2 pkgload_1.2.4
[61] ProtGenerics_1.28.0 nnet_7.3-17
[63] rlang_1.0.2 lifecycle_1.0.1
[65] sandwich_3.0-1 filelock_1.0.2
[67] BiocFileCache_2.4.0 modelr_0.1.8
[69] dichromat_2.0-0.1 cellranger_1.1.0
[71] rprojroot_2.0.3 polyclip_1.10-0
[73] reactR_0.4.4 Matrix_1.4-1
[75] ggseqlogo_0.1 zoo_1.8-10
[77] reprex_2.0.1 base64enc_0.1-3
[79] whisker_0.4 GlobalOptions_0.1.2
[81] processx_3.5.3 viridisLite_0.4.0
[83] png_0.1-7 rjson_0.2.21
[85] GenomicInteractions_1.30.0 bitops_1.0-7
[87] R.oo_1.24.0 cmdfun_1.0.2
[89] getPass_0.2-2 blob_1.2.3
[91] shape_1.4.6 jpeg_0.1-9
[93] memoise_2.0.1 plyr_1.8.7
[95] zlibbioc_1.42.0 compiler_4.2.0
[97] scatterpie_0.1.7 BiocIO_1.6.0
[99] RColorBrewer_1.1-3 clue_0.3-61
[101] Rsamtools_2.12.0 cli_3.3.0
[103] ps_1.7.0 htmlTable_2.4.0
[105] Formula_1.2-4 ggside_0.2.0.9990
[107] tidyselect_1.1.2 stringi_1.7.6
[109] highr_0.9 yaml_2.3.5
[111] locfit_1.5-9.5 latticeExtra_0.6-29
[113] ggrepel_0.9.1 grid_4.2.0
[115] sass_0.4.1 VariantAnnotation_1.42.1
[117] tools_4.2.0 circlize_0.4.15
[119] rstudioapi_0.13 foreach_1.5.2
[121] foreign_0.8-82 git2r_0.30.1
[123] metapod_1.4.0 gridExtra_2.3
[125] farver_2.1.0 digest_0.6.29
[127] Rcpp_1.0.8.3 broom_0.8.0
[129] later_1.3.0 httr_1.4.3
[131] AnnotationDbi_1.58.0 biovizBase_1.44.0
[133] ComplexHeatmap_2.12.0 colorspace_2.0-3
[135] rvest_1.0.2 brio_1.1.3
[137] XML_3.99-0.9 fs_1.5.2
[139] splines_4.2.0 jsonlite_1.8.0
[141] ggfun_0.0.6 testthat_3.1.4
[143] R6_2.5.1 Hmisc_4.7-0
[145] pillar_1.7.0 fastmap_1.1.0
[147] codetools_0.2-18 utf8_1.2.2
[149] lattice_0.20-45 bslib_0.3.1
[151] curl_4.3.2 limma_3.52.1
[153] rmarkdown_2.14 InteractionSet_1.24.0
[155] desc_1.4.1 munsell_0.5.0
[157] GetoptLong_1.0.5 GenomeInfoDbData_1.2.8
[159] iterators_1.0.14 haven_2.5.0
[161] gtable_0.3.0