Last updated: 2021-08-20

Checks: 7 0

Knit directory: MFM-223_DHT-RNASeq/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200930)

The command set.seed(20200930) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 31bb81f

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 31bb81f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  100Day_Cancer_Free.png

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/dge_analysis.Rmd) and HTML (docs/dge_analysis.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	31bb81f	Steve Ped	2021-08-20	Exported PCA pdf
html	5f368f7	Steve Ped	2021-08-20	Build site.
Rmd	9623f4a	Steve Ped	2021-08-20	Fixed CPM weirdness
html	5ef974f	Steve Ped	2021-08-20	Build site.
Rmd	add3750	Steve Ped	2021-08-20	Added export of pdf for MA & Volcano plots
html	af146a0	Steve Ped	2020-11-03	Build site.
Rmd	cbf7fbd	Steve Ped	2020-11-03	Rebuilt all after reorganising Rmd files
html	e7cd42f	Steve Ped	2020-11-03	Build site.
Rmd	154ffc2	Steve Ped	2020-11-03	Added export of CPM values
html	f2de6cc	Steve Ped	2020-10-15	Build site.
Rmd	2f025f7	Steve Pederson	2020-10-14	Reset for run with correct sjdbOverhang
Rmd	f5f296f	Steve Pederson	2020-10-14	Initial commit

library(tidyverse)
library(yaml)
library(scales)
library(pander)
library(glue)
library(edgeR)
library(AnnotationHub)
library(ensembldb)
library(cowplot)
library(ggfortify)
library(magrittr)
library(cqn)
library(ggrepel)
library(DT)

panderOptions("table.split.table", Inf)
panderOptions("big.mark", ",")
theme_set(theme_bw())

config <- here::here("config/config.yml") %>%
  read_yaml()
suffix <- paste0(config$tag)
sp <- config$ref$species %>%
  str_replace("(^[a-z])[a-z]*_([a-z]+)", "\\1\\2") %>%
  str_to_title()

if (!dir.exists(here::here("docs/assets"))) dir.create(here::here("docs/assets"))

Setup

Gene Annotations

ah <- AnnotationHub() %>%
  subset(rdataclass == "EnsDb") %>%
  subset(str_detect(description, as.character(config$ref$release))) %>%
  subset(genome == config$ref$build)
stopifnot(length(ah) == 1)

ensDb <- ah[[1]]
genesGR <- genes(ensDb)
transGR <- transcripts(ensDb)

mcols(transGR) <- mcols(transGR) %>%
  cbind(
    transcriptLengths(ensDb)[rownames(.), c("nexon", "tx_len")]
  )

mcols(genesGR) <- mcols(genesGR) %>%
  as.data.frame() %>%
  dplyr::select(
    gene_id, gene_name, gene_biotype, entrezid
  ) %>%
  left_join(
    mcols(transGR) %>%
      as.data.frame() %>%
      mutate(
        tx_support_level = case_when(
          is.na(tx_support_level) ~ 1L, 
          TRUE ~ tx_support_level
        )
      ) %>%
      group_by(gene_id) %>%
      summarise(
        n_tx = n(),
        longest_tx = max(tx_len),
        ave_tx_len = mean(tx_len),
        gc_content = sum(tx_len*gc_content) / sum(tx_len)
      ),
    by = "gene_id"
  ) %>%
  set_rownames(.$gene_id) %>%
  as("DataFrame")

Annotation data was loaded as an EnsDb object, using Ensembl release 101. Transcript level gene lengths and GC content was converted to gene level values using:

GC Content: The total GC content divided by the total length of transcripts
Gene Length: The mean transcript length

Sample Merging

samples <- config$samples %>%
  here::here() %>%
  read_tsv() %>%
  mutate(
    Filename = paste0(sample, suffix)
  ) %>%
  dplyr::select(Filename, flowcell_id, treat, rep, desc) %>%
  mutate(name = str_extract(desc, "MFM[0-9].++")) %>%
  dplyr::select(-desc) %>%
  mutate_if(
    function(x){length(unique(x)) < length(x)},
    as.factor
  ) %>%
  mutate(
    treat = relevel(treat, ref = "Vehicle")
  )

mergedSamples <- samples %>%
  group_by(name, rep, treat) %>%
  tally()

pander(
  mergedSamples,
  caption = "*Summary of sequencing total runs for all samples*"
)

*Summary of sequencing total runs for all samples*
name	rep	treat	n
MFM1-D	1	DHT	8
MFM1-V	1	Vehicle	8
MFM2-D	2	DHT	8
MFM3-D	3	DHT	8
MFM3-V	3	Vehicle	8
MFM4-D	4	DHT	8
MFM4-V	4	Vehicle	8
MFM5-D	5	DHT	8
MFM5-V	5	Vehicle	8
MFM6-D	6	DHT	8
MFM6-V	6	Vehicle	8

This data set is slightly unique in that samples were sequenced across multiple lanes and flowcells. Metadata was re-organised to enable summarising of counts across the multiple sequencing runs obtained for every sample. The initial PCA showed that variability between replicates and flowcells was dwarfed by variability between samples. As such, all counts were able to be merged to give a single set of counts for each individual sample.

Importantly, no control (Vehicle) sample was found for replicate 2 and if using a nested approach this treated sample may need to be omitted

treat_cols <- hcl.colors(
  n = length(levels(mergedSamples$treat)), 
  palette = "Red-Blue", 
  rev = TRUE
  ) %>%
  setNames(levels(mergedSamples$treat)) 
rep_cols <- hcl.colors(
  n = length(levels(mergedSamples$rep)), 
  palette = "Spectral"
  ) %>%
  setNames(levels(mergedSamples$rep))

Count Data

counts <- here::here("data/aligned/counts/counts.out") %>%
  read_tsv(comment = "#") %>%
  dplyr::select(Geneid, ends_with("bam")) %>%
  rename_at(vars(ends_with("bam")), dirname) %>%
  rename_all(basename) %>%
  column_to_rownames("Geneid")

mergedCounts <- counts %>%
  rownames_to_column("gene_id") %>%
  pivot_longer(
    cols = -gene_id,
    names_to = "Filename",
    values_to = "counts"
  ) %>%
  left_join(samples) %>%
  group_by(
    gene_id, name, rep, treat
  ) %>%
  summarise(counts = sum(counts), .groups = "drop") %>%
  pivot_wider(
    id_cols = gene_id,
    values_from = counts,
    names_from = name
  ) %>%
  column_to_rownames("gene_id")

minCPM <- 1.5
minSamples <- ncol(mergedCounts) / 2
genes2Keep <- mergedCounts %>%
  cpm() %>%
  is_greater_than(minCPM) %>%
  rowSums() %>%
  is_weakly_greater_than(minSamples)

The criteria for a gene to be considered as detected, \(>\) 1.5 counts per million (CPM) were required to observed in \(\geq\) 5.5 samples. This effectively ensured \(\geq\) 10 counts in at least 5.5 merged libraries.

Of the 60,671 genes contained in the annotation for this release, 45,019 genes were removed as failing this criteria for detection, leaving 15,652 genes for downstream analysis.

a <- mergedCounts %>%
  cpm(log = TRUE) %>%
  as.data.frame() %>%
  rownames_to_column("gene_id") %>%
  as_tibble() %>%
  pivot_longer(
    cols = contains("MFM"),
    names_to = "name",
    values_to = "logCPM"
  ) %>%
  left_join(samples) %>%
  ggplot(aes(logCPM, stat(density), colour = rep, linetype = treat)) +
  geom_density() +
  scale_colour_manual(values= rep_cols) +
  labs(
    y = "Density",
    colour = "Replicate",
    linetype = "Treatment"
  )
b <- mergedCounts[genes2Keep,] %>%
  cpm(log = TRUE) %>%
  as.data.frame() %>%
  rownames_to_column("gene_id") %>%
  as_tibble() %>%
  pivot_longer(
    cols = contains("MFM"),
    names_to = "name",
    values_to = "logCPM"
  ) %>%
  left_join(samples) %>%
  ggplot(aes(logCPM, stat(density), colour = rep, linetype = treat)) +
  geom_density() +
  scale_colour_manual(values= rep_cols) +
  labs(
    y = "Density",
    colour = "Replicate",
    linetype = "Treatment"
  )
plot_grid(
  a + theme(legend.position = "none"), 
  b + theme(legend.position = "none"), 
  get_legend(a),
  labels = c("A", "B"),
  rel_widths = c(4, 4, 1),
  nrow = 1
)

Distributions of logCPM values on merged counts, A) before and B) after filtering of undetectable genes. Some differences between replicates wre noted.

Version	Author	Date
f2de6cc	Steve Ped	2020-10-15

dge <- mergedCounts[genes2Keep,] %>%
  DGEList(
    samples = mergedSamples %>%
      as.data.frame() %>%
      set_rownames(.$name) %>%
      .[,colnames(.)],
    group = mergedSamples %>%
      as.data.frame() %>%
      set_rownames(.$name) %>%
      .[,colnames(.)] %>%
      pull(treat),
    genes = genesGR[rownames(.)]
  ) %>%
  calcNormFactors()

QC

Library Sizes

dge$samples %>%
  ggplot(aes(name, lib.size, fill = treat)) +
  geom_col() +
  geom_hline(yintercept = 1e7, linetype = 2) +
  facet_wrap(~treat, scales = "free_x") +
  scale_y_continuous(
    labels = comma, expand = expansion(c(0, 0.05))
  ) +
  scale_fill_manual(values= treat_cols) +
  labs(x = "Sample Name", y = "Library Size") +
  theme(legend.position = "none")

Library sizes after removal of undetectable genes. The common-use minimum library size of 10 million reads is shown as a dashed line.

Version	Author	Date
f2de6cc	Steve Ped	2020-10-15

Given that previous work had only assessed the libraries prior to merging, the library sizes were checked after merging replicate sequencing runs. The median library size was found to be 13.3 million reads which is just above the common-use minimum recommendation of 10 million reads/sample.

However, samples MFM3-D and MFM4-D were both below the common-use minimum library size.

PCA: Pre Normalisation

pca <- dge %>%
  cpm(log = TRUE) %>%
  t() %>%
  prcomp()

pca %>%
  autoplot(
    data = dge$samples, 
    x = 1, y = 2,
    colour = "rep", 
    shape = "treat", 
    size = 3
  ) +
  geom_segment(
    aes(
      x = Vehicle_PC1, xend = DHT_PC1, 
      y = Vehicle_PC2, yend = DHT_PC2,
      colour = rep
    ),
    data = . %>%
      dplyr::select(any_of(colnames(dge$samples)), PC1, PC2) %>% 
      pivot_longer(cols = starts_with("PC"), names_to = "PC", values_to = "value") %>% 
      pivot_wider(names_from = c("treat", "PC"), values_from = value, id_cols = c("rep")),
    alpha = 0.4,
    arrow = arrow(),
    show.legend = FALSE
  ) +
  scale_colour_manual(values= rep_cols) +
  labs(
    colour = "Replicate",
    shape = "Treatment"
  )

PCA on logCPM from merged counts. PC1 appears to capture the variability between replicates, whilst PC2 primarily captures the effect of DHT treatment. Replicate 2, whch is missing the control sample appears to be the most divergent from the remaining samples.

Version	Author	Date
f2de6cc	Steve Ped	2020-10-15

Despite concerns raised by the library sizes, replicates 3 and 4 appears to be relatively consistent with other samples of the same treatment group.

Checks for GC and Length bias

Given the clear unexpected variability captured by PC1, the impact of GC content or length bias was assessed as a possible source.

Genes were divided in 10 approximately equal sized bins based on increasing length, and 10 approximately equal sized bins based on increasing GC content, with the final GC/Length bins being the combination 100 bins using both sets. The contribution of each gene to PC1 and PC2 was assessed and a t-test performed on each bin. This tests

\[ H_0: \mu = 0 \text{ against } H_A: \mu \neq 0 \]

where \(\mu\) represents the true contribution to PC1 of all genes in that bin.

If any bin makes a contribution to PC1 the mean will be clearly non-zero, whilst if there is no contribution the mean will be near zero. In this way, the impact of gene length and GC content on variance within the dataset can be assessed. As seen below, the contribution of GC content and gene length to PC1 is very clear, with a smaller contribution being evident across PC2. As a result, Conditional Quantile Normalisation (CQN) is recommended in preference to the more common TMM normalisation.

dge$genes %>%
  dplyr::select(gene_id, ave_tx_len, gc_content) %>%
  mutate(
    GC = cut(
      x = gc_content,
      labels = seq_len(10),
      breaks = quantile(gc_content, probs = seq(0, 1, length.out = 11)),
      include.lowest = TRUE
    ),
    Length = cut(
      x = ave_tx_len,
      labels = seq_len(10),
      breaks = quantile(ave_tx_len, probs = seq(0, 1, length.out = 11)),
      include.lowest = TRUE
    ),
    bin = paste(GC, Length, sep = "_"),
    PC1 = pca$rotation[gene_id, "PC1"],
    PC2 = pca$rotation[gene_id, "PC2"]
  ) %>%
  pivot_longer(
    cols = c("PC1", "PC2"),
    names_to = "PC",
    values_to = "value"
  ) %>%
  group_by(PC, GC, Length, bin) %>%
  summarise(
    Size = n(),
    mean = mean(value),
    sd = sd(value),
    t = t.test(value)$statistic,
    p = t.test(value)$p.value,
    adjP = p.adjust(p, method = "bonf")
  ) %>%
  ggplot(
    aes(Length, GC, colour = t, alpha = -log10(adjP), size = Size)
  ) +
  geom_point() +
  facet_wrap(~PC) +
  scale_colour_gradient2() +
  scale_size_continuous(range = c(1, 10)) +
  labs(alpha = expression(paste(-log[10], p))) +
  theme(
    panel.grid = element_blank(),
    legend.position = "bottom"
    )

Contribution of each GC/Length Bin to PC1 and PC2. Fill colours indicate the t-statistic, with tranparency denoting significance as -log10(p), using Bonferroni-adjusted p-values. The number of genes in each bin is indicated by the circle size. The clear pattern across PC1 is unambiguous.

Version	Author	Date
f2de6cc	Steve Ped	2020-10-15

Given these results, the considered options for analysis are:

Use CQN normalisation and GLM-QL Fits
Use limma/voom to take advantage of the ability to down-weight samples
Exclude replicate 2, in combination with either of the above strategies

Normalisation

cqNorm <- with(
  dge,
  cqn(
    counts= counts,
    x = genes$gc_content,
    lengths = genes$ave_tx_len
  )
)
dge$offset <- cqNorm$glm.offset
logCPM <- cqNorm$y + cqNorm$offset

a <- cqNorm$func1 %>%
  as.data.frame() %>%
  mutate(x = cqNorm$grid1) %>%
  pivot_longer(
    cols = any_of(colnames(dge)),
    names_to = "name",
    values_to = "QR fit"
  ) %>%
  left_join(dge$samples) %>%
  ggplot(
    aes(x, `QR fit`, colour = rep, group = name, linetype = treat)
  ) +
  geom_line() +
  labs(x = "GC content", colour = "Replicate", linetype = "Treatment")
b <- cqNorm$func2 %>%
  as.data.frame() %>%
  mutate(x = cqNorm$grid2) %>%
  pivot_longer(
    cols = any_of(colnames(dge)),
    names_to = "name",
    values_to = "QR fit"
  ) %>%
  left_join(dge$samples) %>%
  ggplot(
    aes(x, `QR fit`, colour = rep, group = name, linetype = treat)
  ) +
  geom_line() +
  labs(
    x = expression(paste(log[10], " Gene Length (kb)")),
    colour = "Replicate", linetype = "Treatment"
  )
plot_grid(
  a + theme(legend.position = "none"), 
  b + theme(legend.position = "none"),
  get_legend(a),
  nrow = 1,
  rel_widths = c(3, 3, 1)
)

Model fits used when applying CQN. The divergent samples previously noted on the PCA are again quite divergent here. In particular, the long genes with low GC content appear to be where the primary differences are found, in keeping with the previous PCA analysis.

Version	Author	Date
f2de6cc	Steve Ped	2020-10-15

pcaPost <- logCPM %>%
  t() %>%
  prcomp()

pcaPost %>%
  autoplot(
    data = dge$samples, 
    x = 1, y = 2,
    colour = "rep", 
    shape = "treat", 
    size = 3
  ) +
  geom_segment(
    aes(
      x = Vehicle_PC1, xend = DHT_PC1, 
      y = Vehicle_PC2, yend = DHT_PC2,
      colour = rep
    ),
    data = . %>%
      dplyr::select(any_of(colnames(dge$samples)), PC1, PC2) %>% 
      pivot_longer(cols = starts_with("PC"), names_to = "PC", values_to = "value") %>% 
      pivot_wider(names_from = c("treat", "PC"), values_from = value, id_cols = c("rep")),
    alpha = 0.4,
    arrow = arrow(),
    show.legend = FALSE
  ) +
  scale_colour_manual(values= rep_cols) +
  labs(
    colour = "Replicate",
    shape = "Treatment"
  )

PCA on logCPM after performing CQN. It is immediately apparent the CQN has removed much of the previous PC1, leaving the post-normalised PC1 capturing the majority of the variability. However, replicate 2 appears to be a notable outlier and removal of this ‘orphaned’ treated sample may be required.

Version	Author	Date
f2de6cc	Steve Ped	2020-10-15

ggsave(
  here::here("docs/assets/pcaPost.pdf"),
  height = 8, width = 8, units = "in"
)

dge$genes %>%
  dplyr::select(gene_id, ave_tx_len, gc_content) %>%
  mutate(
    GC = cut(
      x = gc_content,
      labels = seq_len(10),
      breaks = quantile(gc_content, probs = seq(0, 1, length.out = 11)),
      include.lowest = TRUE
    ),
    Length = cut(
      x = ave_tx_len,
      labels = seq_len(10),
      breaks = quantile(ave_tx_len, probs = seq(0, 1, length.out = 11)),
      include.lowest = TRUE
    ),
    bin = paste(GC, Length, sep = "_"),
    `pre-CQN` = pca$rotation[gene_id, "PC1"],
    `post-CQN` = pcaPost$rotation[gene_id, "PC1"]
  ) %>%
  pivot_longer(
    cols = c("pre-CQN", "post-CQN"),
    names_to = "status",
    values_to = "value"
  ) %>%
  mutate(status = factor(status, levels = c("pre-CQN", "post-CQN"))) %>%
  group_by(status, GC, Length, bin) %>%
  summarise(
    Size = n(),
    mean = mean(value),
    sd = sd(value),
    t = t.test(value)$statistic,
    p = t.test(value)$p.value,
    FDR = p.adjust(p, method = "bonf")
  ) %>%
  ggplot(
    aes(Length, GC, colour = t, alpha = -log10(FDR), size = Size)
  ) +
  geom_point() +
  facet_wrap(~status) +
  scale_colour_gradient2() +
  scale_size_continuous(range = c(1, 10)) +
  theme(
    panel.grid = element_blank(),
    legend.position = "bottom"
  )

Contributions of gene length and GC content to PC1 shown both before and after CQN. The strong patterns seen before normalisation have clearly been reduced. Fill colours indicate the t-statistic, with tranparency denoting significance as -log10(p), using Bonferroni-adjusted p-values. The number of genes in each bin is indicated by the circle size. The clear pattern across PC1 is unambiguous.

Version	Author	Date
f2de6cc	Steve Ped	2020-10-15

CQN clearly had a positive impact on the data, however, replicate 2 was removed in order to allow the use of a blocking approach, where an individual baseline level is set for each untreated sample and the common difference between vehicle and DHT-treated is then found.

dge <- dge[,!str_detect(colnames(dge), "MFM2")]
dge$samples %<>% droplevels()

Differential Expression Analysis

Main Results

A design matrix was formed giving each replicate it’s own baseline and the common DHT response was then specified as the final column. Following this, dispersions were estimated for the main DGEList object and the model fitted using the Quasi-likelihood GLM.

X <- model.matrix(~0 + rep + treat, data = dge$samples) %>%
  set_colnames(str_remove(colnames(.), "treat"))
dge <- estimateDisp(dge, design = X, robust = TRUE)
fit <- glmQLFit(dge)

alpha <- 0.05
lambda <- log2(1.2)
topTable <- glmTreat(fit, coef = "DHT", lfc = lambda) %>%
  topTags(n = Inf) %>%
  .[["table"]] %>%
  as_tibble() %>%
  mutate(
    location = paste0(seqnames, ":", start, "-", end, ":", strand),
    rankingStat = -sign(logFC)*log10(PValue),
    signedRank = rank(rankingStat),
    DE = FDR < alpha
  ) %>%
  dplyr::select(
    gene_id, gene_name, logCPM, logFC, PValue, FDR, 
    location, gene_biotype, entrezid, ave_tx_len, gc_content, 
    rankingStat, signedRank, DE
  )
de <- dplyr::filter(topTable, DE)$gene_id
up <- dplyr::filter(topTable, DE, logFC > 0)$gene_id
down <- dplyr::filter(topTable, DE, logFC < 0)$gene_id

In order to detect genes which respond to DHT the Null Hypothesis (\(H_0\)) was specified to be a range around zero, instead of the conventional point-value.

\[ H_0: -\lambda \leq \mu \leq \lambda \\ \text{Vs.} \\ H_A: |\mu| > \lambda \]

This characterises \(H_0\) as being a range instead of being the singularity 0, with \(\mu\) representing the true mean logFC. The default value of \(\lambda = \log_2 1.2 = 0.263\) was chosen. This removes any requirement for post hoc filtering based on logFC. Using this approach, 856 genes were considered as DE to an FDR of 0.05. Of these, 384 were up-regulated with the remaining 472 being down-regulated.

topTable %>%
  ggplot(aes(logCPM, logFC)) +
  geom_point(
    aes(colour = DE),
    alpha = 0.5
  ) +
  geom_text_repel(
    aes(label = gene_name, colour = DE),
    data = . %>%
      dplyr::filter(logFC > 1.7 | logFC < -1.5),
    show.legend = FALSE
  ) +
  geom_smooth(se = FALSE) +
  scale_y_continuous(breaks = seq(-8, 8, by = 2)) +
  scale_colour_manual(
    values = c("grey50", "red")
  ) +
  theme(
    legend.position = "none"
  )

MA plot with blue curve through the data showing the GAM fit. The curve for low-expressed genes indicated a potential bias remains in the data, whilst this is not evident for highly expressed genes. DE genes are highlighted in red.

Version	Author	Date
5ef974f	Steve Ped	2021-08-20
f2de6cc	Steve Ped	2020-10-15

ggsave(
  here::here("docs/assets/plotMA.pdf"),
  width = 10, height = 6,
  units = "in"
)

topTable %>%
  ggplot(aes(logFC, -log10(PValue))) +
  geom_point(
    aes(colour = DE),
    alpha = 0.5
  ) +
  geom_text_repel(
    aes(label = gene_name, colour = DE),
    data = . %>%
      dplyr::filter(abs(logFC) > 1.7 | PValue < 5e-13),
    show.legend = FALSE
  ) +
  geom_vline(xintercept = c(-1,1), colour = "blue", linetype = 2) +
  labs(
    y = expression(paste(-log[10], "p"))
  ) +
  scale_x_continuous(breaks = seq(-8, 8, by = 2)) +
  scale_colour_manual(
    values = c("grey50", "red")
  ) +
  theme(
    legend.position = "none"
  )

Volcano plot showing significance against log fold-change. The common-use cutoff values for considering a gene as DE (|logFC| > 1) are shown in blue as a guide only as these were not used for this analysis.

Version	Author	Date
5ef974f	Steve Ped	2021-08-20
f2de6cc	Steve Ped	2020-10-15

ggsave(
  here::here("docs/assets/plotVolcano.pdf"),
  width = 10, height = 8,
  units = "in"
)

topTable %>%
  dplyr::filter(DE) %>%
  dplyr::select(-DE, -gc_content, -ave_tx_len, -contains("rank")) %>%
  mutate(
    across(c("logFC", "logCPM"), round, digits = 2)
  ) %>%
  datatable(
    rownames = FALSE, 
    caption = glue("*All genes were considered as DE using the criteria of an FDR-adjusted p-value < {alpha}.*")
  ) %>%
  formatSignif(
    columns = c("PValue", "FDR"),
    digits = 3
  )

Top DE Genes

logCPM values for the most up and down-regulated genes were also inspected using simple boxplots.

Top ranked up-regulated genes

Version	Author	Date
5f368f7	Steve Ped	2021-08-20
5ef974f	Steve Ped	2021-08-20
f2de6cc	Steve Ped	2020-10-15

Top ranked down-regulated genes

Version	Author	Date
5f368f7	Steve Ped	2021-08-20
5ef974f	Steve Ped	2021-08-20
f2de6cc	Steve Ped	2020-10-15

Checks For Bias

Using the sign of logFC to indicate direction, and -\(\log_{10} p\) as the ranking statistic, bias was also checked within the set of results, using GC content and gene length as potential sources of bias amongst the set of DE genes.

a <- topTable %>%
  ggplot(aes(signedRank, gc_content)) +
  geom_point(aes(colour = DE), alpha = 0.4) +
  geom_smooth(se = FALSE) +
  geom_hline(yintercept = mean(topTable$gc_content)) +
  labs(
    x = "Rank", y = "% GC Content"
  ) +
  scale_colour_manual(values = c("grey50", "red")) + 
  theme(legend.position = "none")
b <- topTable %>%
    ggplot(aes(gc_content, stat(density))) +
    geom_density(colour = "grey50") +
    geom_vline(xintercept = mean(topTable$gc_content)) +
    coord_flip() + 
    theme(
      axis.text = element_blank(),
      axis.title = element_blank(),
      axis.ticks = element_blank(),
      panel.grid.major.x = element_blank(),
      panel.grid.minor.x = element_blank()
    )
c <- topTable %>%
  ggplot(aes(signedRank, ave_tx_len)) +
  geom_point(aes(colour = DE), alpha = 0.4) +
  geom_smooth(se = FALSE) +
  geom_hline(yintercept = 10^mean(log10(topTable$ave_tx_len))) +
  labs(
    x = "Rank", y = "Gene Length (nt)"
  ) +
  scale_y_log10(label = comma) +
  scale_colour_manual(values = c("grey50", "red")) + 
  theme(legend.position = "none")
d <- topTable %>%
  ggplot(aes(ave_tx_len, stat(density))) +
  geom_density(colour = "grey50") +
  geom_vline(xintercept = 10^mean(log10(topTable$ave_tx_len))) +
  coord_flip() + 
  scale_x_log10() +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    axis.ticks = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()
  )
plot_grid(
  plot_grid(ngsReports:::.emptyPlot(""), a, b, rel_widths = c(0.35, 9, 1), align = "h", nrow = 1),
  plot_grid(c, d, rel_widths = c(9, 1), align = "h"),
  nrow = 2,
  rel_heights = c(1, 1),
  labels = c("A", "B"),
  align = "v",
  axis = "l"
)

_Genes ranked for signed differential expression shown against A) GC Content, and B) Gene Length. Ranks were assigned based on the ranking statistic R = -sign(logFC)*log₁₀p, such that genes at the left are most down-regulated, whilst those at the right are the most up-regulated. Horizontal black lines indicate the overall average value, whilst blue curves indicate the localised GAM fit. A small positive bias was noted at both extremes for both GC content and Gene Length._

Version	Author	Date
f2de6cc	Steve Ped	2020-10-15

Data Export

These above results were exported as the file MFM-223_RNASeq.tsv.

topTable %>%
  mutate(
    entrezid = vapply(entrezid, str_flatten, character(1), collapse = ";")
  ) %>%
  write_tsv(
    file = here::here("output/MFM-223_RNASeq.tsv")
  )

In addition, the final DGEList was exported for easier integration with subsequent analyses, as were the CPM values for all detected genes

write_rds(
  dge,
  file = here::here("output/dge.rds")
)

2^logCPM %>% 
  as.data.frame() %>%
  rownames_to_column("gene_id") %>%
  mutate(gene_name = dge$genes$gene_name) %>%
  dplyr::select(starts_with("gene"), everything()) %>%
  write_tsv("output/MFM-223_CPM.tsv")

sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] splines   stats4    parallel  stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] DT_0.18                 ggrepel_0.9.1           cqn_1.36.0             
 [4] quantreg_5.85           SparseM_1.81            preprocessCore_1.52.1  
 [7] nor1mix_1.3-0           mclust_5.4.7            magrittr_2.0.1         
[10] ggfortify_0.4.11        cowplot_1.1.1           ensembldb_2.14.1       
[13] AnnotationFilter_1.14.0 GenomicFeatures_1.42.3  AnnotationDbi_1.52.0   
[16] Biobase_2.50.0          GenomicRanges_1.42.0    GenomeInfoDb_1.26.7    
[19] IRanges_2.24.1          S4Vectors_0.28.1        AnnotationHub_2.22.1   
[22] BiocFileCache_1.14.0    dbplyr_2.1.1            BiocGenerics_0.36.1    
[25] edgeR_3.32.1            limma_3.46.0            glue_1.4.2             
[28] pander_0.6.3            scales_1.1.1            yaml_2.2.1             
[31] forcats_0.5.1           stringr_1.4.0           dplyr_1.0.5            
[34] purrr_0.3.4             readr_1.4.0             tidyr_1.1.3            
[37] tibble_3.1.1            ggplot2_3.3.3           tidyverse_1.3.1        
[40] workflowr_1.6.2        

loaded via a namespace (and not attached):
  [1] readxl_1.3.1                  backports_1.2.1              
  [3] lazyeval_0.2.2                crosstalk_1.1.1              
  [5] BiocParallel_1.24.1           digest_0.6.27                
  [7] htmltools_0.5.1.1             fansi_0.4.2                  
  [9] memoise_2.0.0                 Biostrings_2.58.0            
 [11] modelr_0.1.8                  matrixStats_0.58.0           
 [13] askpass_1.1                   prettyunits_1.1.1            
 [15] colorspace_2.0-0              blob_1.2.1                   
 [17] rvest_1.0.0                   rappdirs_0.3.3               
 [19] haven_2.4.1                   xfun_0.22                    
 [21] crayon_1.4.1                  RCurl_1.98-1.3               
 [23] jsonlite_1.7.2                zoo_1.8-9                    
 [25] gtable_0.3.0                  zlibbioc_1.36.0              
 [27] XVector_0.30.0                MatrixModels_0.5-0           
 [29] DelayedArray_0.16.3           DBI_1.1.1                    
 [31] Rcpp_1.0.6                    viridisLite_0.4.0            
 [33] xtable_1.8-4                  progress_1.2.2               
 [35] bit_4.0.4                     htmlwidgets_1.5.3            
 [37] httr_1.4.2                    ellipsis_0.3.1               
 [39] farver_2.1.0                  pkgconfig_2.0.3              
 [41] XML_3.99-0.6                  sass_0.3.1                   
 [43] here_1.0.1                    locfit_1.5-9.4               
 [45] utf8_1.2.1                    labeling_0.4.2               
 [47] tidyselect_1.1.0              rlang_0.4.10                 
 [49] later_1.2.0                   munsell_0.5.0                
 [51] BiocVersion_3.12.0            cellranger_1.1.0             
 [53] tools_4.0.3                   cachem_1.0.4                 
 [55] cli_2.5.0                     generics_0.1.0               
 [57] RSQLite_2.2.7                 broom_0.7.6                  
 [59] ggdendro_0.1.22               evaluate_0.14                
 [61] fastmap_1.1.0                 knitr_1.33                   
 [63] bit64_4.0.5                   fs_1.5.0                     
 [65] nlme_3.1-149                  whisker_0.4                  
 [67] mime_0.10                     xml2_1.3.2                   
 [69] biomaRt_2.46.3                compiler_4.0.3               
 [71] rstudioapi_0.13               plotly_4.9.4                 
 [73] curl_4.3                      interactiveDisplayBase_1.28.0
 [75] reprex_2.0.0                  statmod_1.4.35               
 [77] bslib_0.2.4                   stringi_1.5.3                
 [79] highr_0.9                     lattice_0.20-41              
 [81] ProtGenerics_1.22.0           Matrix_1.2-18                
 [83] vctrs_0.3.7                   pillar_1.6.0                 
 [85] lifecycle_1.0.0               BiocManager_1.30.12          
 [87] jquerylib_0.1.4               data.table_1.14.0            
 [89] bitops_1.0-7                  conquer_1.0.2                
 [91] httpuv_1.6.0                  rtracklayer_1.50.0           
 [93] R6_2.5.0                      promises_1.2.0.1             
 [95] gridExtra_2.3                 MASS_7.3-53                  
 [97] ngsReports_1.8.1              assertthat_0.2.1             
 [99] SummarizedExperiment_1.20.0   openssl_1.4.3                
[101] rprojroot_2.0.2               withr_2.4.2                  
[103] GenomicAlignments_1.26.0      Rsamtools_2.6.0              
[105] GenomeInfoDbData_1.2.4        mgcv_1.8-35                  
[107] hms_1.0.0                     grid_4.0.3                   
[109] rmarkdown_2.7                 MatrixGenerics_1.2.1         
[111] git2r_0.28.0                  shiny_1.6.0                  
[113] lubridate_1.7.10

Differential Gene Expression

Stephen Pederson
Dame Roma Mitchell Cancer Research Laboratories
Adelaide Medical School
University of Adelaide

20 August, 2021

Setup

Gene Annotations

Sample Merging

Count Data

QC

Library Sizes

PCA: Pre Normalisation

Checks for GC and Length bias

Normalisation

Differential Expression Analysis

Main Results

Top DE Genes

Checks For Bias

Data Export

Differential Gene Expression

Stephen PedersonDame Roma Mitchell Cancer Research LaboratoriesAdelaide Medical SchoolUniversity of Adelaide

20 August, 2021

Setup

Gene Annotations

Sample Merging

Count Data

QC

Library Sizes

PCA: Pre Normalisation

Checks for GC and Length bias

Normalisation

Differential Expression Analysis

Main Results

Top DE Genes

Checks For Bias

Data Export

Stephen Pederson
Dame Roma Mitchell Cancer Research Laboratories
Adelaide Medical School
University of Adelaide