Microarray: cancer classification

Chia sẻ bởi Nguyễn Xuân Vũ | Ngày 18/03/2024 | 11

Chia sẻ tài liệu: Microarray: cancer classification thuộc Sinh học

Nội dung tài liệu:

Microarray: cancer classification
Bioinformatics & Machine Learning
Yi, Jia
Oct. 17th, 2005
Contents
Background
Application of Microarray in Cancer Classification
Comparison of Discrimination Methods to Classify Tumors
SVM Algorithm in Multiclass Cancer Diagnosis
Microarray Techniques
DNA microarrays, microscopic arrays of large sets of DNA sequences immobilized on solid substrates, are valuable tools in areas of research that require the identification or quantitation of many specific DNA sequences in complex nucleic acid samples.
In the last decade it has become common in many model systems to sequence large numbers of cDNAs from an organism.
DNA microarrays are perfectly suited for comparing gene expression in different populations of cells.
Typical DNA Microarray Experiment
Current Cancer Diagnosis
A reliable and precise classification of tumors is essential for successful treatment of cancer.
Current methods relies on the subjective interpretation of both clinical histopathological information with an eye toward placing tumors in currently accepted categories based on the tissue of origin of the tumor.
However, clinical information can be misleading or incomplete.
there is a wide spectrum in cancer morphology and many tumors are atypical or lack morphologic features, which result in diagnositic confusion.
DNA Microarray-based Cancer Diagnosis
Molecular diagnostics offer the promise of precise, objective, and systematic cancer classfication, but these test arenot widely applied because characteristic molecular markers for most solid tumors have to be identified.
Recently, DNA microarray tumor gene expression profiles have been used for cancer diagnosis.
By allowing the monitoring of expression levels for thousands of genes simultaneously, such techniques will lead to a more complete understanding of the molecular variations among tumors, hence to a finer and more reliable classification.
Tumor Classification Types
There are three main types of statistical problems associated with tumor classification:
The identification of new tumor classes using gene expression profiles --- unsupervised learning.
The classification of malignancies into known classes
--- supervised learning.
The identifications of “marker” genes that characterie the different tumor classes --- variable selection.
In this presentation, two experiments focusing on the second type of problem will be introduced.
Experiment 1: Comparison of Discrimination
Methods
In this experiment, we compare different discrimination methods including:
Fisher linear discriminant analysis
Maximum likelihood discriminant rules
Nearest neighbors
Classification trees
Aggregating classification
Some Definition
For classification purpose, gene expression data on p genes for n mRNA samples may be summarized by an np matrix .
When the mRNA samples belong to known classes, the data for each observation have a class label , from 1 to K.
A predictor or classifier for K tumor classes partitions the space X into K disjoint subsets.
Learning set , test set
Fisher Linear Discriminant Analysis
FLDA is based on based on finding linear combinations xa of the gene expression levels with large ratios of between-groups to within-groups sum of squares given by , where B and W denote respectively the p  p matrices of between-groups and within-groups sum of squares.
The matrix has most s= min(K-1,p) non-zero eigenvalues, , with corresponding linearly independent eigenvectors , and in particular v1 maximizes .
Fisher Linear Discriminant Analysis (cont.)
For an obervation , let



is the class k average for the the learning set L.

So the predicted class for observation x is
Maximum Likelihood Discriminant Rules
When pr(x|y=k) are known, the maximum likelihood (ML) discriminant rule predicts the class by


In practice, some parameters must be estimated from a learning set. So the rule is
No need for a learning set
Maximum Likelihood Discriminant Rules (cont.)
In this quadratic discriminant rule, it includes some special cases:
When the covariance matrix is same, then


When the class densities have diagonal covariance matrix, , then
Maximum Likelihood Discriminant Rules (cont.)
When the class densities have the same diagonal covariance matrix , then


In this experiment, cases 2 and 3 is separately referred as diagonal quadratic (DQDA) and linear (DLDA) discriminant analysis.
Nearest Neighbor Classifiers
NN methods are based on a distance function for pairs of observations, that is:




The k nearest neighbor rule proceeds as follows to classify test set observations on the basis of the learning set. For each element in the test set:
Find the k closest observations in the learning set;
Predict the class by majority vote, i.e., choose the class that is most common among those k neighbors.
Nearest Neighbor Classifiers (cont.)
The number of neighbors k is chosen by cross-validation, that is , by running the nearest neighbor classifier on the learning set only.
Each observation in L is treated: its distance to all of the other observations is computed and it is classified by the nearest neighbor rule.
The classification for each observation in L is compared to the truth to produce the cross-validation error rate.
A number of k’s are tried, and the k with the smallest cross-validation error is retained for use on the test set.
Classification Trees
Binary tree structured classifiers are constructed by repeated splits of subsets of the measurement space X into two descendant subsets, starting with X itself.
Each terminal subset is assigned a class label and the resulting partition of X corresponds to the classifier.
Three aspects to tree construction:
The selection of the splits
The decision to declare a node terminal or to continue splitting
The assignment of each terminal node to a class
In this experiment, the CART (Classification And Regression Trees) is used.
Aggregating Classification
Main Idea: the gains in accuracy could be obtained by aggregating predictors built from perturbed versions of the learning set.
The key to improved accuracy is the possible instability of the prediction method, i.e., whether small changes in the learning set result in large changes in the predictor.
Let denote the classifier built from the bth perturbed learning set and let denote the weight given to this prediction. Then the predicted class for an observation is :
Aggregating Classification(cont.)
There are 2 main classes of methods for generating perturbed versions of the learning set:
Bagging: perturbed learning sets of the same size as the original learning set are formed by forming bootstrap replicates of the learning set.

Boosting: the data are re-sampled adaptively and predictors are aggregated by weighted voting.
Source of Datasets
Lymphoma dataset
This dataset is the gene expression in the three most prevalent adult lymphoid malignancies: B-CLL,FL and DLBCL.
This study produced gene expression data for p=4,682 genes in n=81 mRNA samples. 29 × B-CLL
9 × FL
43 × DLBCL
http://genome-www.stanford.edu/lymphoma
Correlation Matrix
Source of Datasets (cont.)
Leukemia dataset
This dataset is the gene expression in two types of acute leukemias: ALL and AML.
This study produced gene expression data for p=6,817 genes in n=72 mRNA samples.
47 × ALL (38 B-cell All,9 T-cell All)
25 × AML
http://www.genome.wi.mit.edu/MPR
Source of Datasets (cont.)
NCI 60 dataset
This dataset is used to examine the variation in gene expression among the 60 cell lines from NCI’s anti-cancer drug screen.
This study produced gene expression data for p=5,244 genes in n=61 cell lines.
7 × breast 6 × leukemia
5 × CNS 8 × melanoma
7 × colon 9 × non-small-cell-lung-carcinoma
6 × ovarian 2 × prostate
9 × renal 1 × unknown
http://genome-www.stanford.edu/nci60
Results: Test Set Error Rates
Results: Test Set Error Rates (cont.)
Experiment 1: Discussion
Simple classifer such as DLDA and neighbors perform remarkably well compared to more sophisticated methods such as aggregated classification trees.
FLDA had the highest error rate, which is likely due to the poor estimation of covariance matrices with a small training set and a fairly number of genes p.
In lymphoma and leukemia datasets, except for FLDA, other methods are not sensitive to the impact of variable,i.e., p. And in NCI 60 data, increasing the number will improve the accuracy
We need much larger datasets to get reasonably accurate estimates of error rates
Experiment 2: Multiclass Cancer Diagnosis
Comprehensive gene expression databases have yet to be developed, and there are no established analytical methods capable of solving complex, multiclass, gene expression-based classification problems

In this experiment, the author created a gene expression database containing the expression profiles of 218 tumor samples representing 14 common human cancer classes.

By using an innovative analytical method, the author demonstrate that accurate multiclass cancer classification is indeed possible.
Source of the data
specimens, spanning 14 different tumor classes, were obtained from the National Cancer InstituteCooperative Human Tissue Network, Massachusetts General Hospital Tumor Bank, Dana–Farber Cancer Institute, Brigham and Women’s Hospital, Children’s Hospital (all in Boston), and Memorial Sloan-Kettering Cancer Center (New York).

Targets were hybridized sequentially to oligonucleotide microarrays containing a total of 16,063 probe sets representing 14,030 GenBank and 475 The Institute for Genomic Research (TIGR) accession nos.

Of 314 tumor and 98 normal tissue samples processed, 218 tumor and 90 normal tissue samples passed quality control criteria and were used for subsequent data analysis.
SVM Algorithm
Support Vector Machine (SVM) Algorithm
In this experiment, using an implementation of SVM-FU (available at http://www.ai.mit.edu/projects/cbcl)
The linear SVM algorithm maximizes the distance between a hyperplane, w, and the closest samples to the hyperplane from two tumor classes,with the constraint that the samples fromo the two classes lie on separate sides of the hyperplane.
This distance is calculated in 16,063-dimensional gene space.
The optimization problem: subject to
,for all i.
OVA Classification Scheme
Supervised learning has been used to make pairwise distinctions with gene expression data.
But, making multiclass distinctions can be a considerably more difficult challenge.
So an analytical scheme --- OVA is devised to meet our purpose.
OVA Classification Scheme (cont.)
OVA Classification Scheme (cont.)
The main steps in OVA
Each test sample is presented sequentially to 14 pairwise classifiers;
Each of classifiers claims or rejects that sample.
This method results in 14 sparate OVA classifications per sample, each with an associated confidence.
Each test sample is assigned to the class with the highest OVA classifier confidence.
SVM&OVA vs Other Algorithms
Training Data vs Test Data
Experiment 2: Discussion
“close calls” implies improved accuracy might be possible by increasing the number of samples from these classes in the training set.

Increased gene number likely allows for highly accurate prediction.

The finding of poorly differentiated tumors’ analysis raises the possibilities that poorly differentiated tumors arise from distinct cellular precursors, have different molecular mechanisms of transformation, or have unique natural histories in some other respect. This finding also has important clinical implications in that it suggests that these tumors should be classified distinctly.
Summary
Introduction of DNA microarray techniques.
Microarray helps improve tumor diagnosis.
Based on different limited dataset, compare different supervised learning algorithms’ performance.
Using SVM&OVA in Multiclass cancer classification.
Referenced Resources
http://www.cs.wustl.edu/~jbuhler//research/array/#cells
Comprehensive collections of microarray techniques
http://www.cancer.gov/
The website of National Cancer Institute (NCI)
http://www.beecherinstruments.com/references.html
Recommended reading for tissue microarray research
S. Dudoit, J. Fridlyand, Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data, June 2000
S. Ramaswamy, P. Tamayo,Multiclass Cancer Diagnosis Using Tumor Gene Expression Signatures ,Dec. 2001
* Một số tài liệu cũ có thể bị lỗi font khi hiển thị do dùng bộ mã không phải Unikey ...

Người chia sẻ: Nguyễn Xuân Vũ
Dung lượng: | Lượt tài: 1
Loại file:
Nguồn : Chưa rõ
(Tài liệu chưa được thẩm định)