差异基因表法分析

DESeq2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# 
# Differential expression analysis with the DESeq2 package.
#
# https://bioconductor.org/packages/release/bioc/html/DESeq2.html
#

# Load the library.
suppressPackageStartupMessages(library(DESeq2))

# The name of the file that contains the counts.
counts_file = "counts.csv"

# The sample file is in CSV format and must have the headers "sample" and "condition".
design_file = "design.csv"

# The final result file.
output_file = "results.csv"

# Read the sample file.
colData <- read.csv(design_file, stringsAsFactors=F)

# Turn conditions into factors.
colData$condition = factor(colData$condition)

# The first level should correspond to the first entry in the file!
# Required later when building a model.
colData$condition = relevel(colData$condition, toString(colData$condition[1]))

# Isolate the sample names.
sample_names <- colData$sample

# Read the data from the standard input.
df <- read.csv(counts_file, header=TRUE, row.names=1 )

# Created rounded integers for the count data
countData <- round(df[, sample_names])

# Other columns in the dataframe that are not sample information.
otherCols <- df[!(names(df) %in% sample_names)]

#
# Running DESeq2
#

# Create DESEq2 dataset.
dds <- DESeqDataSetFromMatrix(countData=countData, colData=colData, design = ~condition)

# Run deseq
dse <- DESeq(dds)

# Format the results.
res <- results(dse)

#
# The rest of the code is about formatting the output dataframe.
#

# Turn the DESeq2 results into a data frame.
data <- cbind(otherCols, data.frame(res))

# Create the foldChange column.
data$foldChange <- 2 ^ data$log2FoldChange

# Rename columns to better reflect reality.
names(data)[names(data)=="pvalue"] <-"PValue"
names(data)[names(data)=="padj"] <- "FDR"

# Create a real adjusted pvalue
data$PAdj <- p.adjust(data$PValue, method="hochberg")

# Sort the data by PValue to compute false discovery counts.
data <- data[with(data, order(PValue, -foldChange)), ]

# Compute the false discovery counts on the sorted table.
data$falsePos = 1:nrow(data) * data$FDR

# Create the additional columns that we wish to present.
data$baseMeanA <- 1
data$baseMeanB <- 1

# Get the normalized counts.
normed <- counts(dse, normalized=TRUE)

# Round normalized counts to a single digit.
normed <- round(normed, 1)

# Merge the two datasets by row names.
total <- merge(data, normed, by=0)

# Sort again for output.
total <- total[with(total, order(PValue, -foldChange)), ]

# Sample names for condition A
col_names_A <- data.frame(split(colData, colData$condition)[1])[,1]

# Sample names for condition B
col_names_B = data.frame(split(colData, colData$condition)[2])[,1]

# Create the individual baseMean columns.
total$baseMeanA = rowMeans(total[, col_names_A])
total$baseMeanB = rowMeans(total[, col_names_B])

# Bringing some sanity to numbers. Round columns to fewer digits.
total$foldChange = round(total$foldChange, 3)
total$log2FoldChange = round(total$log2FoldChange, 1)
total$baseMean = round(total$baseMean, 1)
total$baseMeanA = round(total$baseMeanA, 1)
total$baseMeanB = round(total$baseMeanB, 1)
total$lfcSE = round(total$lfcSE, 2)
total$stat = round(total$stat, 2)
total$FDR = round(total$FDR, 4)
total$falsePos = round(total$falsePos, 0)

# Reformat these columns as string.
total$PAdj <- formatC(total$PAdj, format = "e", digits = 1)
total$PValue <- formatC(total$PValue, format = "e", digits = 1)

# Rename the first column.
colnames(total)[1] <- "name"

# Reorganize columns names to make more sense.
new_cols <- c("name", names(otherCols), "baseMean","baseMeanA","baseMeanB","foldChange",
"log2FoldChange","lfcSE","stat","PValue","PAdj", "FDR","falsePos",col_names_A, col_names_B)

# Slice the dataframe with new columns.
total <- total[, new_cols]

# Write the results to the standard output.
write.csv(total, file=output_file, row.names=FALSE, quote=FALSE)

# Inform the user.
print("# Tool: DESeq2")
print(paste("# Design: ", design_file))
print(paste("# Input: ", counts_file))
print(paste("# Output: ", output_file))

Deseq2的输出中,Padj实际上是FDR! 一般我们认为:PAdj进行多重检验所使用的方法是hochberg, FDR进行多重检验的方法是Benjamini & Hochberg。所以在代码中我们进行了转换操作。

数据结果说明

name - the feature identity. It must be unique within the column. It may be a gene name, transcript name, or exon - whatever the feature we chose to quantify.
baseMean - the average normalized expression level across samples. It measures how much total signal is present across both conditions.
baseMeanA - the average normalized expression level across the first condition. It measures how much total signal is there for condition A.
baseMeanB - the average normalized expression level across the first condition. It measures how much total signal is there for condition B.
foldChange - the ratio of baseMeanB/baseMeanA. Very important to always be aware that in the fold change means B/A (second condition/first condition)
log2FoldChange - the second logarithm of foldChange. Log 2 transformations are convenient as they transform the changes onto a uniform scale. A four-fold increase after transformation is 2. A four-fold decrease (1/4) after log 2 transform is -2. This property makes it much easier to compare the magnitude of up/down changes.
PValue - the uncorrected p-value of the likelihood of observing the effect of the size foldChange (or larger) by chance alone. This p-value is not corrected for multiple comparisons.
PAdj - the multiple comparisons corrected PValue (via the Hochberg method). This probability of having at least one false positive when accounting for all comparisons made. This value is usually overly conservative in genomics.
FDR - the False Discovery Rate - this column represents the fraction of false discoveries for all the rows above the row where the value is listed. For example, if in row number 300 the FDR is 0.05, it means that if you were cut the table at this row and accept all genes at and above it as differentially expressed then, 300 * 0.05 = 15 genes out of the 300 are likely to be false positives. The values in this column are also called q-values.
falsePos - this column is derived directly from FDR and represents the number of false positives in the rows above. It is computed as RowIndex * FDR and is there to provide a direct interpretation of FDR.
The following columns represent the normalized matrix of the original count data in this case, 3 and 3 conditions.

edgeR

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
#
# Differential expression analysis with the edgeR package.
#
# https://bioconductor.org/packages/release/bioc/html/edgeR.html
#

# Load the library
library(edgeR)

# The name of the file that contains the counts.
counts_file = "counts.csv"

# The sample file is in CSV format and must have the headers "sample" and "condition".
design_file = "design.csv"

# The final result file.
output_file = "results.csv"

# Read the sample file.
colData <- read.csv(design_file, stringsAsFactors=F)

# Turn conditions into factors.
colData$condition = factor(colData$condition)

# The first level should correspond to the first entry in the file!
# Required when building a model.
colData$condition = relevel(colData$condition, toString(colData$condition[1]))

# Isolate the sample names.
sample_names <- colData$sample

# Read the data from the standard input.
df = read.csv(counts_file, header=TRUE, row.names=1 )

# Created rounded integers for the count data
counts = round(df[, sample_names])

# Other columns in the dataframe that are not sample information.
otherCols = df[!(names(df) %in% sample_names)]

# Using the same naming as in the library.
group <- colData$condition

# Creates a DGEList object from a table of counts and group.
dge <- DGEList(counts=counts, group=group)

# Maximizes the negative binomial conditional common likelihood to estimate a common dispersion value across all genes.
dis <- estimateCommonDisp(dge)

# Estimates tagwise dispersion values by an empirical Bayes method based on weighted conditional maximum likelihood.
tag <- estimateTagwiseDisp(dis)

# Compute genewise exact tests for differences in the means between the groups.
etx <- exactTest(tag)

# Extracts the most differentially expressed genes.
etp <- topTags(etx, n=nrow(counts))

# Get the scale of the data
scale = dge$samples$lib.size * dge$samples$norm.factors

# Get the normalized counts
normed = round(t(t(counts)/scale) * mean(scale))

# Turn the DESeq2 results into a data frame.
data = merge(otherCols, etp$table, by="row.names")

# Create column placeholders.
data$baseMean = 1
data$baseMeanA = 1
data$baseMeanB = 1
data$foldChange = 2 ^ data$logFC
data$falsePos = 1

# Rename the column.
names(data)[names(data)=="logFC"] <-"log2FoldChange"

# Compute the adjusted p-value
data$PAdj = p.adjust(data$PValue, method="hochberg")

# Rename the first columns for consistency with other methods.
colnames(data)[1] <- "name"

# Create a merged output that contains the normalized counts.
total <- merge(data, normed, by.x='name', by.y="row.names")

# Sort the data for the output.
total = total[with(total, order(PValue, -foldChange)), ]

# Compute the false discovery counts on the sorted table.
data$falsePos = 1:nrow(data) * data$FDR

# Sample names for condition A
col_names_A = data.frame(split(colData, colData$condition)[1])[,1]

# Sample names for condition B
col_names_B = data.frame(split(colData, colData$condition)[2])[,1]

# Create the individual baseMean columns.
total$baseMeanA = rowMeans(total[, col_names_A])
total$baseMeanB = rowMeans(total[, col_names_B])
total$baseMean = total$baseMeanA + total$baseMeanB

# Round the numbers
total$foldChange = round(total$foldChange, 3)
total$FDR = round(total$FDR, 4)
total$PAdj = round(total$PAdj, 4)
total$logCPM = round(total$logCPM, 1)
total$log2FoldChange = round(total$log2FoldChange, 1)
total$baseMean = round(total$baseMean, 1)
total$baseMeanA = round(total$baseMeanA, 1)
total$baseMeanB = round(total$baseMeanB, 1)
total$falsePos = round(total$falsePos, 0)

# Reformat these columns as string.
total$PAdj = formatC(total$PAdj, format = "e", digits = 1)
total$PValue = formatC(total$PValue, format = "e", digits = 1)

# Reorganize columns names to make more sense.
new_cols = c("name", names(otherCols), "baseMean","baseMeanA","baseMeanB","foldChange",
"log2FoldChange","logCPM","PValue","PAdj", "FDR","falsePos",col_names_A, col_names_B)

# Slice the dataframe with new columns.
total = total[, new_cols]

# Write the result to the standard output.
write.csv(total, file=output_file, row.names=FALSE, quote=FALSE)

# Inform the user.
print("# Tool: edgeR")
print(paste("# Design: ", design_file))
print(paste("# Input: ", counts_file))
print(paste("# Output: ", output_file))

数据结果说明

name - the feature identity. It must be unique within the column. It may be a gene name, transcript name, or exon - whatever the feature we chose to quantify.
baseMean - the average normalized expression level across samples. It measures how much total signal is present across both conditions.
baseMeanA - the average normalized expression level across the first condition. It measures how much total signal is there for condition A.
baseMeanB - the average normalized expression level across the first condition. It measures how much total signal is there for condition B.
foldChange - the ratio of baseMeanB/baseMeanA. Very important to always be aware that in the fold change means B/A (second condition/first condition)
log2FoldChange - the second logarithm of foldChange. Log 2 transformations are convenient as they transform the changes onto a uniform scale. A four-fold increase after transformation is 2. A four-fold decrease (1/4) after log 2 transform is -2. This property makes it much easier to compare the magnitude of up/down changes.
PValue - the uncorrected p-value of the likelihood of observing the effect of the size foldChange (or larger) by chance alone. This p-value is not corrected for multiple comparisons.
PAdj - the multiple comparisons corrected PValue (via the Hochberg method). This probability of having at least one false positive when accounting for all comparisons made. This value is usually overly conservative in genomics.
FDR - the False Discovery Rate - this column represents the fraction of false discoveries for all the rows above the row where the value is listed. For example, if in row number 300 the FDR is 0.05, it means that if you were cut the table at this row and accept all genes at and above it as differentially expressed then, 300 * 0.05 = 15 genes out of the 300 are likely to be false positives. The values in this column are also called q-values.
falsePos - this column is derived directly from FDR and represents the number of false positives in the rows above. It is computed as RowIndex * FDR and is there to provide a direct interpretation of FDR.
The following columns represent the normalized matrix of the original count data in this case, 3 and 3 conditions.