Microarray analysis
Chia sẻ bởi Nguyễn Xuân Vũ |
Ngày 18/03/2024 |
8
Chia sẻ tài liệu: Microarray analysis thuộc Sinh học
Nội dung tài liệu:
Microarray analysis
Prof. William Stafford Noble
GENOME 541
Intro to Computational Molecular Biology
Lecture schedule
Lecture 1: Microarray analysis
Lecture 2: Predicting protein function
Lecture 3: Protein identification from tandem mass spectra
Lecture 4: Motif discovery
Outline
Identifying differentially expressed genes
t-test
Analysis of variance
Multiple testing correction
Clustering
Supervised learning
Gene expression matrix
The matrix entry at (i, j) is the expression level of gene i in experiment j.
Experiments
Genes
Analysis tasks
Identify up- and down-regulated genes.
Find groups of genes with similar expression profiles.
Find groups of experiments (tissues) with similar expression profiles.
Find genes that explain observed differences among tissues (feature selection).
Two modes of analysis
Similarity of gene expression profiles
biological insight
Similarity of experimental expression profiles
disease diagnosis and prognosis
Tissue phenotype
Gene function
Identifying differentially expressed genes
Two-fold up-regulation
Early work simply identified genes that changed a lot.
Problems with this approach:
Two-fold up-regulation
Early work simply identified genes that changed a lot.
Problems with this approach:
Only identifies most changed genes.
Also identifies noise and highly variable genes.
Ratio is unstable when the denominator is small.
Conclusion: Don’t do it!
Ratios are unstable
Initial measurements:
30/60 = 0.5
500/1000 = 0.5
Add random noise (+15 numerator and -15 denominator):
45/45 = 1.0
515/985 = 0.52
Replication
Replication is the only reliable way to determine confidence.
Biological replicates provide more information than technical or sample replicates.
0.80 0.91 0.85 0.82 consistent
1.83 0.21 2.11 0.40 inconsistent
An example
Two-class metrics
Standard t-test
Welch’s approximation
Mann-Whitney non-parametric test
Selecting genes with a t-test
μi = mean expression value in class i
ni = number of examples in class i
v = pooled variance across both classes
http://mathworld.wolfram.com/Studentst-Distribution.html
Zar. Biostatistical Analysis. 1999.
Computing the t statistic
Observed gene expression values:
Treatment A: 0.45 0.57 1.02 0.97
Treatment B: 1.50 2.07 0.51 1.63
Mean:
mean (A) = 3.01 / 4 = 0.7525
mean (B) = 5.71 / 4 = 1.4275
Sum of squares:
SS (A) = (0.7525 - 0.45)2 + (0.7525 - 0.57)2 + (0.7525 - 1.02)2 + (0.7525 - 0.97)2 = 0.243675
SS (B) = … = 1.300875
Variance: SS / n-1
var (A) = 0.242675 / 3 = 0.08089
var (B) = 1.300875 / 3 = 0.43363
Pooled variance: (SS1 + SS2) / (n1 + n2 - 2) = (0.243675 + 1.300875) / (4 + 4 - 2) = 0.2574
t statistic: |0.7525 – 1.4275| / √(0.2574/4 + 0.2574/4) = 1.8815
Computing a p-value
1.8815
The p-value is the area under the t distribution to the right of the observed value of the t statistic.
What is the difference between a one-tailed and a two-tailed test?
Tails
Two-tailed: Do set A and set B come from different distributions?
One-tailed: Does set A come from a distribution with larger mean than set B?
This corresponds to finding differentially regulated genes versus finding up-regulated genes.
Types of tests
Standard t-test assumes the samples are drawn from normal distributions with equal variance and different means.
Welch’s t-test allows for different variances between classes.
Mann-Whitney (Wilcoxon) converts the data to ranks, and does not assume a particular distribution.
The permutation test computes the t-statistic for many random permutations of the labels.
Permutation test
Cost-benefits analysis
t-test assumes both samples are drawn from the same normal distribution.
Welch’s approximation allows the samples to be drawn from different normals.
Mann-Whitney makes no assumption about the distribution.
The tests, as listed, yield decreasing power.
The permutation test gives the most flexibility in choosing a test statistic that reflects prior knowledge, but it can be computationally expensive for small p-values.
Analysis of variance
ANOVA
The t-test and its variants only work when there are two sample pools.
Analysis of variance (ANOVA) is a general technique for handling multiple variables, with replicates.
A simple experiment
Measure response to a drug treatment in two different mouse strains.
Repeat each measurement five times.
Total experiment = 2 strains * 2 treatments * 5 repetitions = 20 arrays
If you look for treatment effects using a t-test, then you ignore the strain effects.
ANOVA lingo
Factor: a variable that is under the control of the experimenter (strain, treatment).
Level: a possible value of a factor (drug, no drug).
Main effect: an effect that involves only one factor.
Interaction effect: an effect that involves two or more factors simultaneously.
Balanced design: an experiment in which each factor and level is measured an equal number of times.
Two-factor design
ANOVA model
is the mean expression level of the gene.
T and S are main effects (treatment, strain) with n and m levels, respectively.
TS is an interaction effect.
p is the number of replicates per group.
represents random error (to be minimized).
ANOVA steps
For each gene on the array
Fit the parameters T and S, minimizing .
Test T, S and TS for difference from zero, yielding three F statistics.
Convert the F statistics into p-values.
ANOVA assumptions
For a given gene, the random error terms are independent, normally distributed and have uniform variance.
The main effects and their interactions are linear.
ANOVA output
A
B
Vehicle
Drug
Gene
p-value
Strain effects
Treatment effects
Interaction effects
Multiple testing correction
This and some following slides are from http://compdiag.molgen.mpg.de/ngfn/docs/2004/mar/DifferentialGenes.pdf.
Multiple testing correction
On an array of 10,000 spots, a p-value of 0.0001 may not be significant.
Bonferroni correction: divide your p-value cutoff by the number of measurements.
For significance of 0.05 with 10,000 spots, you need a p-value of 5 10-6.
Bonferroni is conservative because it assumes that all genes are independent.
Types of errors
False positive (Type I error): the experiment indicates that the gene has changed, but it actually has not.
False negative (Type II error): the gene has changed, but the experiment failed to indicate the change.
Typically, researchers are more concerned about false positives.
Without doing many (expensive) replicates, there will always be many false negatives.
False discovery rate
The false discovery rate (FDR) is the percentage of genes above a given position in the ranked list that are expected to be false positives.
False positive rate: percentage of non-differentially expressed genes that are flagged.
False discovery rate: percentage of flagged genes that are not differentially expressed.
5 FP
13 TP
33 TN
5 FN
FDR = FP / (FP + TP) = 5/18 = 27.8%
FPR = FP / (FP + TN) = 5/38 = 13.2%
Bonferroni vs. FDR
Bonferroni controls the family-wise error rate; i.e., the probability of at least one false positive.
FDR is the proportion of false positives among the genes that are flagged as differentially expressed.
Controlling the FDR
Order the unadjusted p-values p1 p2 … pm.
To control FDR at level α,
Reject the null hypothesis for j = 1, …, j*.
This approach is conservative if many genes are differentially expressed.
(Benjamini & Hochberg, 1995)
FDR example
Choose the threshold so that, for all the genes above it, (jα)/m is less than the corresponding p-value.
Approximately 5% of the examples above the line are expected to be false positives.
Rank (jα)/m p-value
1 0.00005 0.0000008
2 0.00010 0.0000012
3 0.00015 0.0000013
4 0.00020 0.0000056
5 0.00025 0.0000078
6 0.00030 0.0000235
7 0.00035 0.0000945
8 0.00040 0.0002450
9 0.00045 0.0004700
10 0.00050 0.0008900
…
1000 0.05000 1.0000000
False discovery rate
q-value
The p-value for a particular gene G is the probability that a randomly generated expression profile would be as or more extremely differentially expressed.
The q-value for a particular gene G is the proportion of false positives among all genes that are as or more extremely differentially expressed.
Equivalently, the q-value is the minimal FDR at which this gene appears significant.
Number of examples above threshold
Q-value software
http://faculty.washington.edu/~jstorey/qvalue/
http://noble.gs.washington.edu/proj/qvality
“A common mistake is to state that the p value is the probability
a feature is a false positive. We stress that the q value is also not the
probability that the feature is a false positive. In the example
presented above MSH2 has a q value equal to 0.013. This value does
not imply that MSH2 is a false positive with probability 0.013.
Rather, 0.013 is the expected proportion of false positives incurred
if we call MSH2 significant. Because the q-value measure includes
genes that are possibly much more significant than MSH2, the
probability that MSH2 is itself a false positive may be substantially
higher.”
Summary
Individual measurements from microarray experiments are not trustworthy.
Repetition or independent verification are the best means of verification.
For simple designs, use a t-test.
For complex designs, use ANOVA.
Correct for multiple comparisons using FDR and q-values.
Clustering expression data
Why cluster?
Place genes with similar expression profiles into clusters.
What is the gene’s function?
Place experiments / samples with similar expression profiles into clusters.
What is the expression profile of a particular disease or phenotype?
Fibroblast gene clustering
Cholesterol biosynthesis
Cell cycle
Immediate-early response
Signaling and angiogenesis
Wound healing and tissue remodeling
Iyer et al. “The transcriptional program in the response of human fibroblasts to serum.” Science. 283:83-7, 1999.
Soft tissue sarcoma clustering
Post hoc analysis
Select clusters.
Select ordering of genes for visualization.
Determine cluster labels.
Determine significance of clusters.
Comparison of clustering algorithms
Hierarchical clustering
Widely used for expression analysis.
Easy to understand.
Does not require the number of clusters a priori.
Difficult to implement well.
Requires post-processing.
Unstable.
Greediness can lock in early mistakes.
There is no reason to think that expression data is organized hierarchically.
Comparison of clustering algorithms
Self-organizing maps
Less widely used for expression analysis.
Difficult to understand.
Requires the number of clusters a priori.
Easy to implement.
Scales well.
Allows imposition of partial structure.
Stable.
Comparison of clustering algorithms
k-means
Less widely used for expression analysis.
Easy to understand.
Requires the number of clusters a priori.
Easy to implement.
Scales well.
Stable.
Creates unorganized cluster that are hard to interpret.
Some other clustering algorithms
Bayesian clustering
Matrix tree incision
Spectral clustering
Superparamagnetic clustering
Various bi-clustering algorithms.
What clustering can’t do
Identify differentially regulated genes.
Account for complex experimental design.
Provide semantics for discovered clusters.
Say whether a particular group (pathway) of genes is differentially expressed.
Incorporate prior knowledge about relevant gene groups.
Supervised learning from microarray data
Supervised learning
Predictor
Learner
Model
Class
Class
Experiments
Experiments
Genes
Genes
Training set
Test set
Learning gene classes
Predictor
Learner
Model
Class
MYGD
79 experiments
79 experiments
3500
Genes
2465
Genes
Training set
Test set
Eisen et al.
Eisen et al.
Class prediction
Predictions of gene function
Fleischer et al. “Systematic identification and functional screens of uncharacterized proteins associated with eukaryotic ribosomal complexes” Genes Dev, 2006.
Overview
218 human tumor samples spanning 14 common tumor types
90 normal samples
16,063 “genes” measured per sample
Overall SVM classification accuracy: 78%.
Random classification accuracy: 1/14 = 9%.
Cost/Benefits of SVMs
SVMs perform well in high-dimensional data sets with few examples.
Convex optimization implies that you get the same answer every time.
Kernels functions allow encoding of prior knowledge.
Kernel functions handle arbitrary data types.
The hyperplane does not provide a good explanation, especially with a non-linear kernel function.
Prof. William Stafford Noble
GENOME 541
Intro to Computational Molecular Biology
Lecture schedule
Lecture 1: Microarray analysis
Lecture 2: Predicting protein function
Lecture 3: Protein identification from tandem mass spectra
Lecture 4: Motif discovery
Outline
Identifying differentially expressed genes
t-test
Analysis of variance
Multiple testing correction
Clustering
Supervised learning
Gene expression matrix
The matrix entry at (i, j) is the expression level of gene i in experiment j.
Experiments
Genes
Analysis tasks
Identify up- and down-regulated genes.
Find groups of genes with similar expression profiles.
Find groups of experiments (tissues) with similar expression profiles.
Find genes that explain observed differences among tissues (feature selection).
Two modes of analysis
Similarity of gene expression profiles
biological insight
Similarity of experimental expression profiles
disease diagnosis and prognosis
Tissue phenotype
Gene function
Identifying differentially expressed genes
Two-fold up-regulation
Early work simply identified genes that changed a lot.
Problems with this approach:
Two-fold up-regulation
Early work simply identified genes that changed a lot.
Problems with this approach:
Only identifies most changed genes.
Also identifies noise and highly variable genes.
Ratio is unstable when the denominator is small.
Conclusion: Don’t do it!
Ratios are unstable
Initial measurements:
30/60 = 0.5
500/1000 = 0.5
Add random noise (+15 numerator and -15 denominator):
45/45 = 1.0
515/985 = 0.52
Replication
Replication is the only reliable way to determine confidence.
Biological replicates provide more information than technical or sample replicates.
0.80 0.91 0.85 0.82 consistent
1.83 0.21 2.11 0.40 inconsistent
An example
Two-class metrics
Standard t-test
Welch’s approximation
Mann-Whitney non-parametric test
Selecting genes with a t-test
μi = mean expression value in class i
ni = number of examples in class i
v = pooled variance across both classes
http://mathworld.wolfram.com/Studentst-Distribution.html
Zar. Biostatistical Analysis. 1999.
Computing the t statistic
Observed gene expression values:
Treatment A: 0.45 0.57 1.02 0.97
Treatment B: 1.50 2.07 0.51 1.63
Mean:
mean (A) = 3.01 / 4 = 0.7525
mean (B) = 5.71 / 4 = 1.4275
Sum of squares:
SS (A) = (0.7525 - 0.45)2 + (0.7525 - 0.57)2 + (0.7525 - 1.02)2 + (0.7525 - 0.97)2 = 0.243675
SS (B) = … = 1.300875
Variance: SS / n-1
var (A) = 0.242675 / 3 = 0.08089
var (B) = 1.300875 / 3 = 0.43363
Pooled variance: (SS1 + SS2) / (n1 + n2 - 2) = (0.243675 + 1.300875) / (4 + 4 - 2) = 0.2574
t statistic: |0.7525 – 1.4275| / √(0.2574/4 + 0.2574/4) = 1.8815
Computing a p-value
1.8815
The p-value is the area under the t distribution to the right of the observed value of the t statistic.
What is the difference between a one-tailed and a two-tailed test?
Tails
Two-tailed: Do set A and set B come from different distributions?
One-tailed: Does set A come from a distribution with larger mean than set B?
This corresponds to finding differentially regulated genes versus finding up-regulated genes.
Types of tests
Standard t-test assumes the samples are drawn from normal distributions with equal variance and different means.
Welch’s t-test allows for different variances between classes.
Mann-Whitney (Wilcoxon) converts the data to ranks, and does not assume a particular distribution.
The permutation test computes the t-statistic for many random permutations of the labels.
Permutation test
Cost-benefits analysis
t-test assumes both samples are drawn from the same normal distribution.
Welch’s approximation allows the samples to be drawn from different normals.
Mann-Whitney makes no assumption about the distribution.
The tests, as listed, yield decreasing power.
The permutation test gives the most flexibility in choosing a test statistic that reflects prior knowledge, but it can be computationally expensive for small p-values.
Analysis of variance
ANOVA
The t-test and its variants only work when there are two sample pools.
Analysis of variance (ANOVA) is a general technique for handling multiple variables, with replicates.
A simple experiment
Measure response to a drug treatment in two different mouse strains.
Repeat each measurement five times.
Total experiment = 2 strains * 2 treatments * 5 repetitions = 20 arrays
If you look for treatment effects using a t-test, then you ignore the strain effects.
ANOVA lingo
Factor: a variable that is under the control of the experimenter (strain, treatment).
Level: a possible value of a factor (drug, no drug).
Main effect: an effect that involves only one factor.
Interaction effect: an effect that involves two or more factors simultaneously.
Balanced design: an experiment in which each factor and level is measured an equal number of times.
Two-factor design
ANOVA model
is the mean expression level of the gene.
T and S are main effects (treatment, strain) with n and m levels, respectively.
TS is an interaction effect.
p is the number of replicates per group.
represents random error (to be minimized).
ANOVA steps
For each gene on the array
Fit the parameters T and S, minimizing .
Test T, S and TS for difference from zero, yielding three F statistics.
Convert the F statistics into p-values.
ANOVA assumptions
For a given gene, the random error terms are independent, normally distributed and have uniform variance.
The main effects and their interactions are linear.
ANOVA output
A
B
Vehicle
Drug
Gene
p-value
Strain effects
Treatment effects
Interaction effects
Multiple testing correction
This and some following slides are from http://compdiag.molgen.mpg.de/ngfn/docs/2004/mar/DifferentialGenes.pdf.
Multiple testing correction
On an array of 10,000 spots, a p-value of 0.0001 may not be significant.
Bonferroni correction: divide your p-value cutoff by the number of measurements.
For significance of 0.05 with 10,000 spots, you need a p-value of 5 10-6.
Bonferroni is conservative because it assumes that all genes are independent.
Types of errors
False positive (Type I error): the experiment indicates that the gene has changed, but it actually has not.
False negative (Type II error): the gene has changed, but the experiment failed to indicate the change.
Typically, researchers are more concerned about false positives.
Without doing many (expensive) replicates, there will always be many false negatives.
False discovery rate
The false discovery rate (FDR) is the percentage of genes above a given position in the ranked list that are expected to be false positives.
False positive rate: percentage of non-differentially expressed genes that are flagged.
False discovery rate: percentage of flagged genes that are not differentially expressed.
5 FP
13 TP
33 TN
5 FN
FDR = FP / (FP + TP) = 5/18 = 27.8%
FPR = FP / (FP + TN) = 5/38 = 13.2%
Bonferroni vs. FDR
Bonferroni controls the family-wise error rate; i.e., the probability of at least one false positive.
FDR is the proportion of false positives among the genes that are flagged as differentially expressed.
Controlling the FDR
Order the unadjusted p-values p1 p2 … pm.
To control FDR at level α,
Reject the null hypothesis for j = 1, …, j*.
This approach is conservative if many genes are differentially expressed.
(Benjamini & Hochberg, 1995)
FDR example
Choose the threshold so that, for all the genes above it, (jα)/m is less than the corresponding p-value.
Approximately 5% of the examples above the line are expected to be false positives.
Rank (jα)/m p-value
1 0.00005 0.0000008
2 0.00010 0.0000012
3 0.00015 0.0000013
4 0.00020 0.0000056
5 0.00025 0.0000078
6 0.00030 0.0000235
7 0.00035 0.0000945
8 0.00040 0.0002450
9 0.00045 0.0004700
10 0.00050 0.0008900
…
1000 0.05000 1.0000000
False discovery rate
q-value
The p-value for a particular gene G is the probability that a randomly generated expression profile would be as or more extremely differentially expressed.
The q-value for a particular gene G is the proportion of false positives among all genes that are as or more extremely differentially expressed.
Equivalently, the q-value is the minimal FDR at which this gene appears significant.
Number of examples above threshold
Q-value software
http://faculty.washington.edu/~jstorey/qvalue/
http://noble.gs.washington.edu/proj/qvality
“A common mistake is to state that the p value is the probability
a feature is a false positive. We stress that the q value is also not the
probability that the feature is a false positive. In the example
presented above MSH2 has a q value equal to 0.013. This value does
not imply that MSH2 is a false positive with probability 0.013.
Rather, 0.013 is the expected proportion of false positives incurred
if we call MSH2 significant. Because the q-value measure includes
genes that are possibly much more significant than MSH2, the
probability that MSH2 is itself a false positive may be substantially
higher.”
Summary
Individual measurements from microarray experiments are not trustworthy.
Repetition or independent verification are the best means of verification.
For simple designs, use a t-test.
For complex designs, use ANOVA.
Correct for multiple comparisons using FDR and q-values.
Clustering expression data
Why cluster?
Place genes with similar expression profiles into clusters.
What is the gene’s function?
Place experiments / samples with similar expression profiles into clusters.
What is the expression profile of a particular disease or phenotype?
Fibroblast gene clustering
Cholesterol biosynthesis
Cell cycle
Immediate-early response
Signaling and angiogenesis
Wound healing and tissue remodeling
Iyer et al. “The transcriptional program in the response of human fibroblasts to serum.” Science. 283:83-7, 1999.
Soft tissue sarcoma clustering
Post hoc analysis
Select clusters.
Select ordering of genes for visualization.
Determine cluster labels.
Determine significance of clusters.
Comparison of clustering algorithms
Hierarchical clustering
Widely used for expression analysis.
Easy to understand.
Does not require the number of clusters a priori.
Difficult to implement well.
Requires post-processing.
Unstable.
Greediness can lock in early mistakes.
There is no reason to think that expression data is organized hierarchically.
Comparison of clustering algorithms
Self-organizing maps
Less widely used for expression analysis.
Difficult to understand.
Requires the number of clusters a priori.
Easy to implement.
Scales well.
Allows imposition of partial structure.
Stable.
Comparison of clustering algorithms
k-means
Less widely used for expression analysis.
Easy to understand.
Requires the number of clusters a priori.
Easy to implement.
Scales well.
Stable.
Creates unorganized cluster that are hard to interpret.
Some other clustering algorithms
Bayesian clustering
Matrix tree incision
Spectral clustering
Superparamagnetic clustering
Various bi-clustering algorithms.
What clustering can’t do
Identify differentially regulated genes.
Account for complex experimental design.
Provide semantics for discovered clusters.
Say whether a particular group (pathway) of genes is differentially expressed.
Incorporate prior knowledge about relevant gene groups.
Supervised learning from microarray data
Supervised learning
Predictor
Learner
Model
Class
Class
Experiments
Experiments
Genes
Genes
Training set
Test set
Learning gene classes
Predictor
Learner
Model
Class
MYGD
79 experiments
79 experiments
3500
Genes
2465
Genes
Training set
Test set
Eisen et al.
Eisen et al.
Class prediction
Predictions of gene function
Fleischer et al. “Systematic identification and functional screens of uncharacterized proteins associated with eukaryotic ribosomal complexes” Genes Dev, 2006.
Overview
218 human tumor samples spanning 14 common tumor types
90 normal samples
16,063 “genes” measured per sample
Overall SVM classification accuracy: 78%.
Random classification accuracy: 1/14 = 9%.
Cost/Benefits of SVMs
SVMs perform well in high-dimensional data sets with few examples.
Convex optimization implies that you get the same answer every time.
Kernels functions allow encoding of prior knowledge.
Kernel functions handle arbitrary data types.
The hyperplane does not provide a good explanation, especially with a non-linear kernel function.
* Một số tài liệu cũ có thể bị lỗi font khi hiển thị do dùng bộ mã không phải Unikey ...
Người chia sẻ: Nguyễn Xuân Vũ
Dung lượng: |
Lượt tài: 1
Loại file:
Nguồn : Chưa rõ
(Tài liệu chưa được thẩm định)