Microarray Data Analysis
Chia sẻ bởi Nguyễn Xuân Vũ |
Ngày 18/03/2024 |
8
Chia sẻ tài liệu: Microarray Data Analysis thuộc Sinh học
Nội dung tài liệu:
Microarray Data Analysis
Stuart M. Brown
NYU School of Medicine
The Central Dogma of Molecular Biology
DNA is transcribed into RNA which is then translated into protein
Measured by Microarray
What is a Microarray
A simple concept: Dot Blot + Northern
Reverse the hybridization - put the probes on the filter and label the bulk RNA
Make probes for lots of genes - a massively parallel experiment
Make it tiny so you don’t need so much RNA from your experimental cells.
Make quantitative measurements
Microarrays are Popular
At NYU Med Center we are now collecting about 3 GB of microarray data per week (60 chips, 6-10 different experiments)
PubMed search "microarray"= 13,948 papers
2005 = 4406
2004 = 3509
2003 = 2421
2002 = 1557
2001 = 834
2000 = 294
A Filter Array
DNA Chip Microarrays
Put a large number (~100K) of cDNA sequences or synthetic DNA oligomers onto a glass slide (or other subtrate) in known locations on a grid.
Label an RNA sample and hybridize
Measure amounts of RNA bound to each square in the grid
Make comparisons
Cancerous vs. normal tissue
Treated vs. untreated
Time course
Many applications in both basic and clinical research
cDNA Microarray Technologies
Spot cloned cDNAs onto a glass microscope slide
usually PCR amplified segments of plasmids
Label 2 RNA samples with 2 different colors of flourescent dye - control vs. experimental
Mix two labeled RNAs and hybridize to the chip
Make two scans - one for each color
Combine the images to calculate ratios of amounts of each RNA that bind to each spot
Spot your own Chip
(plans available for free from Pat Brown’s website)
Robot spotter
Ordinary glass
microscope slide
Combine scans for Red & Green
False color image is made from digitized fluorescence data,
not by superimposing scanned images
cDNA Spotted Microarrays
Affymetrix “Gene chip” system
Uses 25 base oligos synthesized in place on a chip (20 pairs of oligos for each gene)
RNA labeled and scanned in a single “color”
one sample per chip
Can have as many as 20,000 genes on a chip
Arrays get smaller every year (more genes)
Chips are expensive
Proprietary system: “black box” software, can only use their chips
Affymetrix Gene Chip
Affymetrix Technology
Affymetrix Pivot Table
Data Acquisition
Scan the arrays
Quantitate each spot
Subtract background
Normalize
Export a table of fluorescent intensities for each gene in the array
Automate!!
All of this can be done automatically by software.
Much more consistent
Mistakes will be made (especially in the spot quantitation) but you can’t manually check hundreds of thousands of spots
Affymetrix Software
Affymetrix System is totally automated
Computes a single value for each gene from 40 probes - (using surprisingly kludgy math)
Highly reproducible
(re-scan of same chip or hyb. of duplicate chips with same labeled sample gives very similar results)
Incorporates false results due to image artefacts
dust, bubbles
pixel spillover from bright spot to neighboring dark spots
Goals of a Microarray Experiment
Find the genes that change expression between experimental and control samples
Classify samples based on a gene expression profile
Find patterns: Groups of biologically related genes that change expression together across samples/treatments
Basic Data Analysis
Fold change (relative increase or decrease in intensity for each gene)
Set cutoff filter for low values
(background +noise)
Cluster genes by similar changes - only really meaningful across multiple treatments or time points
Cluster samples by similar gene expression profiles
Streamlined Affy Analysis
Normalize
Raw data
Filter
Classification
Significance
Clustering
Gene lists
(RMA)
•Present/Absent
•Minimum value
•Fold change
•t-test
•SAM
•Rank Product
•PAM
•Machine learning
Sources of Variability
Image analysis (identifying and quantitating each spot on the array)
Scanning (laser and detector, chemistry of the flourescent label))
Hybridization (temperature, time, mixing, etc.)
Probe labeling
RNA extraction
Biological variability
Scatter plot of all genes in a simple comparison of two control (A) and two treatments (B: high vs. low glucose) showing changes in expression greater than 2.2 and 3 fold.
Thomas Hudson, Montreal Genome Center
Normalization
Can control for many of the experimental sources of variability (systematic, not random or gene specific)
Bring each image to the same average brightness
Can use simple math or fancy -
divide by the mean (whole chip or by sectors)
LOESS (locally weighted regression)
No sure biological standards
RMA
Robust Multichip Average
Bolstad, B.M., Irizarry R. A., Astrand, M., and Speed, T.P. (2003), A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics 19(2):185-193
Are the Treatments Different?
Analysis of microarray data has tended to focus on making lists of genes that are up or down regulated between treatments
Before making these lists, ask the question:
"Are the treatments different?"
Use standard statistical methods to evaluate expression profiles for each treatment (t-test or f-test)
If there are differences, find the genes most responsible
If there are not significant overall differences, then lists of genes with large fold changes may only reflect random variability.
Statistics
When you have variability in measurements, you need replication and statistics to find real differences
It’s not just the genes with 2 fold increase, but those with a significant p-value across replicates
Non-parametric (i.e. rank) or paired value statistics may be more appropriate
Multiple Comparisons
In a microarray experiment, each gene (each probe or probe set) is really a separate experiment
Yet if you treat each gene as an independent comparison, you will always find some with significant differences
(the tails of a normal distribution)
False Discovery
Statisticians call false positives a "type 1 error" or a "False Discovery"
False Discovey Rate (FDR) is equal to the p-value of the t-test X the number of genes in the array
For a p-value of 0.01 X 10,000 genes
= 100 false “different” genes
You cannot eliminate false positives, but by choosing a more stringent p-value, you can keep them manageable (try p=0.001)
The FDR must be smaller than the number of real differences that you find - which in turn depends on the size of the differences and varability of the measured expression values
SAM
Significance Analysis of Microarrays
Tusher, Tibshirani and Chu (2001): Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001 98: 5116-5121, (Apr 24).
Excel plugin
Free
Permutation based
Most published method of
microarray data analysis
Higher Level
Microarray data analysis
Clustering and pattern detection
Data mining and visualization
Controls and normalization of results
Statistical validatation
Linkage between gene expression data and gene sequence/function/metabolic pathways databases
Discovery of common sequences in co-regulated genes
Meta-studies using data from multiple experiments
Types of Clustering
Herarchical
Link similar genes, build up to a tree of all
Self Organizing Maps (SOM)
Split all genes into similar sub-groups
Finds its own groups (machine learning)
Principle Component
every gene is a dimension (vector), find a single dimension that best represents the differences in the data
Cluster by color difference
GeneSpring
SOM Clusters
Classification
How to sort samples into two classes based on gene expression data
Cancer vs. normal
Cancer sub-types
(benign vs. malignant)
Responds well to drug vs. poor response
(i.e. tamoxifen for breast cancer)
Support Vector Machines
Fat planes: With an infinitely thin plane the data can always be separated correctly, but not necessarily with a fat one.
Again if a large margin separation exists, chances are good that we found something relevant.
Large Margin Classifiers
PAM: Prediction Analysis for Microarrays
Class Prediction and Survival Analysis for Genomic Expression Data Mining
Performs sample classification from gene expression data,
via "nearest shrunken centroid method`` of Tibshirani, Hastie, Narasimhan and Chu (2002):
"Diagnosis of multiple cancer types by shrunken centroids of gene expression"
PNAS 2002 99:6567-6572 (May 14).
BioConductor
All of these normalization, statistical, and clustering methods are available in a free software package called BioConductor.
www.bioconductor.org
User hostile command line interface
Uses scripts in the `R` statistical language
> data(SpikeIn)
> pms <- pm(spikein)
> mms <- mm(spikein)
> par(mfrow = c(1, 2))
> concentrations <- matrix(as.numeric(samplenames(spikein)), 20,
+ 12, byrow = TRUE)
> matplot(concentrations, pms, log = "xy", main = "PM", ylim = c(30,
+ 20000))
> lines(concentrations[1, ], apply(pms, 2, mean), lwd = 3)
> matplot(concentrations, mms, log = "xy", main = "MM", ylim = c(30,
+ 20000))
> lines(concentrations[1, ], apply(mms, 2, mean), lwd = 3)
Functional Genomics
Take a list of "interesting" genes and find their biological relationships
Gene lists may come from significance/classfication analysis of microarrays, proteomics, or other high-throughput methods
Requires a reference set of "biological knowledge"
Genome Ontology
How to organize biological knowledge?
Biologists work on a variety of different research organisms: yeast, fruit fly, mouse, … human
the same gene can have very different functions (antennapedia)
and very different names
(sonic hedgehog…)
GO
Biologists got together a few years ago and developed a sensible system called Genome Ontology (GO)
3 hierarchical sets of terminology
Biological Process
Cellular Component (location within cell)
Molecular Function
about 1000 categories of functions
Biological Pathways
Microarray Databases
Large experiments may have hundreds of individual array hybridizations
Core lab at an institution or multiple investigators using one machine - data archive and validate across experiments
Data-mining - look for similar patterns of gene expression across different experiments
Public Databases
Gene Expression data is an essential aspect of annotating the genome
Publication and data exchange for microarray experiments
Data mining/Meta-studies
Common data format - XML
MIAME (Minimal Information About a Microarray Experiment)
GEO at the NCBI
Array Express at EMBL
Gene Expression
Technologies
cDNA (EST) libraries
SAGE
Microarray
rt-PCR
RNA-seq
The Cancer Genome Anatomy Project
CGAP has collected a large amount of cDNA and related data online
http://cgap.nci.nih.gov/
cDNA libraries from various tissues
search for genes
compare expression levels
SAGE
Serial Analysis of Gene Expression is a technology that sequences very short fragments of mRNA (10 or 17 bp) that have been randomly ligated together
The short ‘tags’ are assigned to genes and then relative counts for each gene are computed for cDNA libraries from various tissues
SAGE Genie
SAGE Anatomic Viewer
SAGE Digital Gene Expression Displayer
Digital Northern
SAGE Experiment Viewer
Microarray
GEO database at NCBI
Microarray experiments
Defined arrays
Published results
Also lots of inconclusive experiments
Tools to search for specific genes
Unreliable to search for tissue or disease in experiment description text
RNA-seq
Next Generation DNA seqencing
NYU currently has one Illumina Genome Analyser
generates more than 1 million RNA sequences per sample
Currently seeking funding for a Roche/454
produces 100K reads of 250-400 bp
Count Transcripts
Techology exists to accurately count transcripts and compare samples
“Digital Gene Expression”
Can also identify alternate isoforms, splice variants, etc.
Stuart M. Brown
NYU School of Medicine
The Central Dogma of Molecular Biology
DNA is transcribed into RNA which is then translated into protein
Measured by Microarray
What is a Microarray
A simple concept: Dot Blot + Northern
Reverse the hybridization - put the probes on the filter and label the bulk RNA
Make probes for lots of genes - a massively parallel experiment
Make it tiny so you don’t need so much RNA from your experimental cells.
Make quantitative measurements
Microarrays are Popular
At NYU Med Center we are now collecting about 3 GB of microarray data per week (60 chips, 6-10 different experiments)
PubMed search "microarray"= 13,948 papers
2005 = 4406
2004 = 3509
2003 = 2421
2002 = 1557
2001 = 834
2000 = 294
A Filter Array
DNA Chip Microarrays
Put a large number (~100K) of cDNA sequences or synthetic DNA oligomers onto a glass slide (or other subtrate) in known locations on a grid.
Label an RNA sample and hybridize
Measure amounts of RNA bound to each square in the grid
Make comparisons
Cancerous vs. normal tissue
Treated vs. untreated
Time course
Many applications in both basic and clinical research
cDNA Microarray Technologies
Spot cloned cDNAs onto a glass microscope slide
usually PCR amplified segments of plasmids
Label 2 RNA samples with 2 different colors of flourescent dye - control vs. experimental
Mix two labeled RNAs and hybridize to the chip
Make two scans - one for each color
Combine the images to calculate ratios of amounts of each RNA that bind to each spot
Spot your own Chip
(plans available for free from Pat Brown’s website)
Robot spotter
Ordinary glass
microscope slide
Combine scans for Red & Green
False color image is made from digitized fluorescence data,
not by superimposing scanned images
cDNA Spotted Microarrays
Affymetrix “Gene chip” system
Uses 25 base oligos synthesized in place on a chip (20 pairs of oligos for each gene)
RNA labeled and scanned in a single “color”
one sample per chip
Can have as many as 20,000 genes on a chip
Arrays get smaller every year (more genes)
Chips are expensive
Proprietary system: “black box” software, can only use their chips
Affymetrix Gene Chip
Affymetrix Technology
Affymetrix Pivot Table
Data Acquisition
Scan the arrays
Quantitate each spot
Subtract background
Normalize
Export a table of fluorescent intensities for each gene in the array
Automate!!
All of this can be done automatically by software.
Much more consistent
Mistakes will be made (especially in the spot quantitation) but you can’t manually check hundreds of thousands of spots
Affymetrix Software
Affymetrix System is totally automated
Computes a single value for each gene from 40 probes - (using surprisingly kludgy math)
Highly reproducible
(re-scan of same chip or hyb. of duplicate chips with same labeled sample gives very similar results)
Incorporates false results due to image artefacts
dust, bubbles
pixel spillover from bright spot to neighboring dark spots
Goals of a Microarray Experiment
Find the genes that change expression between experimental and control samples
Classify samples based on a gene expression profile
Find patterns: Groups of biologically related genes that change expression together across samples/treatments
Basic Data Analysis
Fold change (relative increase or decrease in intensity for each gene)
Set cutoff filter for low values
(background +noise)
Cluster genes by similar changes - only really meaningful across multiple treatments or time points
Cluster samples by similar gene expression profiles
Streamlined Affy Analysis
Normalize
Raw data
Filter
Classification
Significance
Clustering
Gene lists
(RMA)
•Present/Absent
•Minimum value
•Fold change
•t-test
•SAM
•Rank Product
•PAM
•Machine learning
Sources of Variability
Image analysis (identifying and quantitating each spot on the array)
Scanning (laser and detector, chemistry of the flourescent label))
Hybridization (temperature, time, mixing, etc.)
Probe labeling
RNA extraction
Biological variability
Scatter plot of all genes in a simple comparison of two control (A) and two treatments (B: high vs. low glucose) showing changes in expression greater than 2.2 and 3 fold.
Thomas Hudson, Montreal Genome Center
Normalization
Can control for many of the experimental sources of variability (systematic, not random or gene specific)
Bring each image to the same average brightness
Can use simple math or fancy -
divide by the mean (whole chip or by sectors)
LOESS (locally weighted regression)
No sure biological standards
RMA
Robust Multichip Average
Bolstad, B.M., Irizarry R. A., Astrand, M., and Speed, T.P. (2003), A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics 19(2):185-193
Are the Treatments Different?
Analysis of microarray data has tended to focus on making lists of genes that are up or down regulated between treatments
Before making these lists, ask the question:
"Are the treatments different?"
Use standard statistical methods to evaluate expression profiles for each treatment (t-test or f-test)
If there are differences, find the genes most responsible
If there are not significant overall differences, then lists of genes with large fold changes may only reflect random variability.
Statistics
When you have variability in measurements, you need replication and statistics to find real differences
It’s not just the genes with 2 fold increase, but those with a significant p-value across replicates
Non-parametric (i.e. rank) or paired value statistics may be more appropriate
Multiple Comparisons
In a microarray experiment, each gene (each probe or probe set) is really a separate experiment
Yet if you treat each gene as an independent comparison, you will always find some with significant differences
(the tails of a normal distribution)
False Discovery
Statisticians call false positives a "type 1 error" or a "False Discovery"
False Discovey Rate (FDR) is equal to the p-value of the t-test X the number of genes in the array
For a p-value of 0.01 X 10,000 genes
= 100 false “different” genes
You cannot eliminate false positives, but by choosing a more stringent p-value, you can keep them manageable (try p=0.001)
The FDR must be smaller than the number of real differences that you find - which in turn depends on the size of the differences and varability of the measured expression values
SAM
Significance Analysis of Microarrays
Tusher, Tibshirani and Chu (2001): Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001 98: 5116-5121, (Apr 24).
Excel plugin
Free
Permutation based
Most published method of
microarray data analysis
Higher Level
Microarray data analysis
Clustering and pattern detection
Data mining and visualization
Controls and normalization of results
Statistical validatation
Linkage between gene expression data and gene sequence/function/metabolic pathways databases
Discovery of common sequences in co-regulated genes
Meta-studies using data from multiple experiments
Types of Clustering
Herarchical
Link similar genes, build up to a tree of all
Self Organizing Maps (SOM)
Split all genes into similar sub-groups
Finds its own groups (machine learning)
Principle Component
every gene is a dimension (vector), find a single dimension that best represents the differences in the data
Cluster by color difference
GeneSpring
SOM Clusters
Classification
How to sort samples into two classes based on gene expression data
Cancer vs. normal
Cancer sub-types
(benign vs. malignant)
Responds well to drug vs. poor response
(i.e. tamoxifen for breast cancer)
Support Vector Machines
Fat planes: With an infinitely thin plane the data can always be separated correctly, but not necessarily with a fat one.
Again if a large margin separation exists, chances are good that we found something relevant.
Large Margin Classifiers
PAM: Prediction Analysis for Microarrays
Class Prediction and Survival Analysis for Genomic Expression Data Mining
Performs sample classification from gene expression data,
via "nearest shrunken centroid method`` of Tibshirani, Hastie, Narasimhan and Chu (2002):
"Diagnosis of multiple cancer types by shrunken centroids of gene expression"
PNAS 2002 99:6567-6572 (May 14).
BioConductor
All of these normalization, statistical, and clustering methods are available in a free software package called BioConductor.
www.bioconductor.org
User hostile command line interface
Uses scripts in the `R` statistical language
> data(SpikeIn)
> pms <- pm(spikein)
> mms <- mm(spikein)
> par(mfrow = c(1, 2))
> concentrations <- matrix(as.numeric(samplenames(spikein)), 20,
+ 12, byrow = TRUE)
> matplot(concentrations, pms, log = "xy", main = "PM", ylim = c(30,
+ 20000))
> lines(concentrations[1, ], apply(pms, 2, mean), lwd = 3)
> matplot(concentrations, mms, log = "xy", main = "MM", ylim = c(30,
+ 20000))
> lines(concentrations[1, ], apply(mms, 2, mean), lwd = 3)
Functional Genomics
Take a list of "interesting" genes and find their biological relationships
Gene lists may come from significance/classfication analysis of microarrays, proteomics, or other high-throughput methods
Requires a reference set of "biological knowledge"
Genome Ontology
How to organize biological knowledge?
Biologists work on a variety of different research organisms: yeast, fruit fly, mouse, … human
the same gene can have very different functions (antennapedia)
and very different names
(sonic hedgehog…)
GO
Biologists got together a few years ago and developed a sensible system called Genome Ontology (GO)
3 hierarchical sets of terminology
Biological Process
Cellular Component (location within cell)
Molecular Function
about 1000 categories of functions
Biological Pathways
Microarray Databases
Large experiments may have hundreds of individual array hybridizations
Core lab at an institution or multiple investigators using one machine - data archive and validate across experiments
Data-mining - look for similar patterns of gene expression across different experiments
Public Databases
Gene Expression data is an essential aspect of annotating the genome
Publication and data exchange for microarray experiments
Data mining/Meta-studies
Common data format - XML
MIAME (Minimal Information About a Microarray Experiment)
GEO at the NCBI
Array Express at EMBL
Gene Expression
Technologies
cDNA (EST) libraries
SAGE
Microarray
rt-PCR
RNA-seq
The Cancer Genome Anatomy Project
CGAP has collected a large amount of cDNA and related data online
http://cgap.nci.nih.gov/
cDNA libraries from various tissues
search for genes
compare expression levels
SAGE
Serial Analysis of Gene Expression is a technology that sequences very short fragments of mRNA (10 or 17 bp) that have been randomly ligated together
The short ‘tags’ are assigned to genes and then relative counts for each gene are computed for cDNA libraries from various tissues
SAGE Genie
SAGE Anatomic Viewer
SAGE Digital Gene Expression Displayer
Digital Northern
SAGE Experiment Viewer
Microarray
GEO database at NCBI
Microarray experiments
Defined arrays
Published results
Also lots of inconclusive experiments
Tools to search for specific genes
Unreliable to search for tissue or disease in experiment description text
RNA-seq
Next Generation DNA seqencing
NYU currently has one Illumina Genome Analyser
generates more than 1 million RNA sequences per sample
Currently seeking funding for a Roche/454
produces 100K reads of 250-400 bp
Count Transcripts
Techology exists to accurately count transcripts and compare samples
“Digital Gene Expression”
Can also identify alternate isoforms, splice variants, etc.
* Một số tài liệu cũ có thể bị lỗi font khi hiển thị do dùng bộ mã không phải Unikey ...
Người chia sẻ: Nguyễn Xuân Vũ
Dung lượng: |
Lượt tài: 1
Loại file:
Nguồn : Chưa rõ
(Tài liệu chưa được thẩm định)