7.91 / 7.36 / BE.490
Lecture #7
May 4, 2004
DNA Microarrays & Clustering
Chris Burge
DNA Microarrays & Clustering
? Why the hype?
? Microarray platforms
- cDNA vs oligo technologies
? Sample applications
? Analysis of microarray data
- clustering of co-expressed genes
- some classic microarray papers
Stanford U. Dept. of Biochemistry Web Site
http://cmgm.stanford.edu/biochem/
Why Microarrays?
? Changes in gene expression are important in
many biological contexts:
– Development
– Cancer
– Other Diseases
– Environmental Adaptation
? DNA microarrays provide a high throughput
way to study these changes.
What’s new?
… progression to chip technology
? Hybrid detection
– radioactive labeling
– fluorescent labeling
? Solid support for sample fixation
– Southern blots, Northern blots, etc.
? Main advantage of microarrays is scale
– Probes are attached to solid support
– Efficient robotics
– Bioinformatic analysis
? Parallel measurement of thousands of genes at a time
Array Platforms
? cDNA arrays (spotted arrays)
– Probes are PCR products from cDNA libraries or clone collections
– May be printed on glass slides (e.g., P. Brown lab, Stanford), OR
– May be printed on nylon membranes (e.g., Millennium)
– Spots are 100-300 μm in size and about the same distance apart
– ~30,000 cDNAs can be fit onto the surface of a microscope slide
? Oligonucleotide arrays
– 20-25 mers synthesized onto silicon wafers in situ or printed onto glass
slides by:
by:
photolithography (Affymetrix) or ink-jet printing (Rosetta/Agilent)
– Presynthesized oligos can also be printed onto glass slides
? Other technologies (e.g., bead arrays attached to optical fibers)
cDNA Arrays I - Overview
Duggan, DJ, M Bittner, Y Chen, P Meltzer, and JM Trent. "Expression Profiling using
cDNA Microarrays." Nat Genet. 21, no. 1 Suppl (January 1999): 10-4.
Please See
cDNA Arrays II - Printing
1. Templates for genes of interest
obtained and amplified by PCR
2. After purification and quality control,
aliquots of ~5 nl printed on coated
glass microscope slide using high
speed robot
Duggan, DJ, M Bittner, Y Chen, P Meltzer, and JM Trent. "Expression Profiling using
cDNA Microarrays." Nat Genet. 21, no. 1 Suppl (January 1999): 10-4.
Please See
1. Total RNA from test and reference
samples is fluorescently labeled with
Cy5/Cy3 dye using a single round of
reverse transcription
2. Pooled fluorescent targets are
hybridized to the clones on the array
cDNA Arrays III - Labeling, Hybing
Duggan, DJ, M Bittner, Y Chen, P Meltzer, and JM Trent. "Expression Profiling using
cDNA Microarrays." Nat Genet. 21, no. 1 Suppl (January 1999): 10-4.
Please See
1. Laser excitation of hybridized
targets - emission spectra
measured using a scanning
confocal laser microscope
2. Monochrome images (from
scanner) are imported into software
in which images are pseudo-
colored and merged
3. Data analyzed as normalized ratio
(Cy3/Cy5) - gene expression
increase or decrease relative to
reference sample
cDNA Arrays IV - Scanning
Duggan, DJ, M Bittner, Y Chen, P Meltzer, and JM Trent. "Expression Profiling using
cDNA Microarrays." Nat Genet. 21, no. 1 Suppl (January 1999): 10-4.
Please See
cDNA Arrays Oligo Arrays
Schulze, A, and J Downward. "Navigating Gene Expression using Microarrays
--A Technology Review." Nat Cell Biol. 3, no. 8 (August 2001): E190-5.
Please See
Oligo Arrays I - Light-directed printing
? Synthetic linkers modified
with photochemically
removable protecting groups
attached to substrate and
direct light through a
photolithographic mask to
specific areas on the surface
to produce localized
photodeprotection.
? Chemical coupling occurs at those sites that were illuminated in the
preceding step. Next, light is directed to different regions and cycle is
repeated.
? Current versions now exceed one million probes per array.
Lipshutz, RJ, SP Fodor, TR Gingeras, and DJ Lockhart. "High Density Synthetic
Oligonucleotide Arrays." Nat Genet. 21, no. 1 Suppl (January 1999): 20-4.
Please See
Oligo Arrays II - Other types of printing
Bubble-Jet printing technology
for covalent attachment of DNA
Okamoto, T, T Suzuki, and N Yamamoto. "Microarray Fabrication with Covalent Attachment
of DNA using Bubble Jet Technology." Nat Biotechnol. 18, no. 4 (April 2000): 438-41.
Please See
Commercially
Available Microarrays
Lipshutz, RJ, SP Fodor, TR Gingeras, and DJ Lockhart. "High Density Synthetic
Oligonucleotide Arrays." Nat Genet. 21, no. 1 Suppl (January 1999): 20-4.
Please See
cDNA vs Oligo Arrays
? Requirements:
- purified DNA vs sequence info alone
? Reproducibility
?Cost
? Hybridization specificity / probe size
? Applications
Some applications of microarrays
? Temporal order of gene expression program (cell cycle)
? Effect of perturbations of the cellular environment on
gene expression (e.g., medium, temperature, drugs, etc.)
? Differential gene expression in different pathological
conditions / tissue types
? Identification of genes / exon-intron structures
? Mutation analysis
? Mapping binding sites of transcription factors
Microarray Data Analysis -
Normalization & Clustering
? Normalization
- use all genes in sample, OR
- use designated unchanging subset of genes
- measure variance of normalizing set
- use to generate expected variance, confidence
intervals
- use CIs to define up- and down-regulated genes
What is clustering?
? A way of grouping together data samples that are
similar in some way - according to criteria of your
choice
? A form of unsupervised learning – generally don’t
have examples of how the data should be grouped
together
? So, a method of data exploration – a way of looking
for patterns or structure in the data that are of interest
Why cluster?
? Cluster genes (rows)
– Measure expression at multiple time-points, different
conditions, etc.
– Similar expression patterns may suggest similar functions
of genes
? Cluster samples (columns)
– e.g., expression levels of thousands of genes for each
tumor sample
– Similar expression patterns may suggest biological
relationship among samples
Hierarchical Agglomerative Clustering
? Start with each data point in separate cluster
? Keep merging most similar pairs of data
points/clusters until all form one big cluster
? Called bottom-up or agglomerative method
Hierarchical Clustering II
? This produces a binary
tree or dendrogram
? The final cluster is the
root and each data
item is a leaf
? The heights of the bars
indicate how close the
items are
Data items (genes, etc.)
Distance
How do we define “similarity”?
? The goal is to group together “similar” data –
but how to define similarity/distance between points
(or clusters)?
? In general, depends on what we want to find or
emphasize in the data - clustering is an art
? The similarity measure is often more important than
the clustering algorithm used
Euclidean distance
d(x,y)
? Here n is number of dimensions in the data vector
For instance:
– Number of time-points/conditions (when clustering genes)
– Number of genes (when clustering samples)
Correlation
? We might care more about the overall shape of expression
profiles more than the actual magnitudes
? That is, we want to consider genes similar when they go “up”
and “down” together
Time
Time
Log(E
t
/E
0
)
Log(E
t
/E
0
)
Pearson or Product-Moment Correlation
∑
∑
∑∑
∑
=
=
??
??
=
==
=
n
i
i
n
i
i
n
i
i
n
i
i
i
n
i
i
y
n
y
x
n
x
yyxx
yyxx
1
1
)()(
))((
),(
1
2
1
2
1
yxρ
Product of corresponding terms in vector, using
difference from mean rather than value, and
normalizing by the product of the standard
deviations.
? Always between –1 and +1
? Invariant to scaling and shifting (adding a constant)
of the expression values
Linkage in Hierarchical Clustering
? We already know about distance measures between
data items, but what about between a data item and
a cluster or between two clusters?
? We just treat a data point as a cluster with a single
item, so our only problem is to define a linkage
method between clusters
? As usual, there are lots of choices…
Single (Minimum) Linkage
? The minimum of all pairwise distances between
points in the two clusters
? Tends to produce long, “loose” clusters
Complete (Maximum) Linkage
? The maximum of all pairwise distances between
points in the two clusters
? Tends to produce very tight clusters
Average Linkage
? M. Eisen’s cluster program defines average linkage as
follows:
– Each cluster c
i
is associated with a mean vector μ
i
which is the mean of all the data items in the cluster
– The distance between two clusters c
i
and c
j
is then
defined as d(μ
i
, μ
j
)
? This is somewhat non-standard – this method is
usually referred to as centroid linkage and average
linkage is defined as the average of all pairwise
distances between points in the two clusters
Eisen et al., PNAS 1998
Hierarchical Clustering Examples
Clustering 8600 human genes
based on time course of expression
following serum stimulation of
fibroblasts
Genes
(A) cholesterol biosynthesis
(B) the cell cycle
(C) the immediate-early response
(D) signaling and angiogenesis
(E) wound healing and tissue remodeling
Eisen et al (1998) PNAS, 95 14863-14868. Copyright (1998) National Academy of Sciences, U.S.A. Used with permission.
Clustering yeast genes by
co-expression across
many conditions
Conditions
(B) spindle pole body assembly and function
(C) the proteasome
(D) mRNA splicing
(E) Glycolysis
(F) the mitochondrial ribosome
(G) ATP synthesis
(H) chromatin structure
(I) the ribosome and translation
(J) DNA replication
(K) the TCA cycle and respiration
Genes
Eisen et al (1998) PNAS, 95 14863-14868. Copyright (1998) National Academy of Sciences, U.S.A. Used with permission.
Clustering tumor
samples with B- and
T-cell types based
on expression
profiles
Patients with “germinal
center type” expression
profiles generally had
higher five-year survival
rates
Please See
Alizadeh, AA, MB Eisen, RE Davis, C Ma, IS Lossos, A Rosenwald, JC Boldrick, H Sabet, T Tran, X Yu, JI Powell,
L Yang, GE Marti, T Moore, J Hudson Jr, L Lu, DB Lewis, R Tibshirani, G Sherlock, WC Chan, TC Greiner, DD
Weisenburger, JO Armitage, R Warnke, R Levy, W Wilson, MR Grever, JC Byrd, D Botstein, PO Brown, and LM Staudt.
"Distinct Types of Diffuse Large B-cell Lymphoma Identified by Gene Expression Profiling." Nature 403, no. 6769
(3 February 2000): 503-11.
Microarray analysis of alternative
splicing with exon junction probes I
Tissue-specific
splicing of
OCRL1 gene
Please see figure 1 of
Johnson, JM, J Castle, P Garrett-Engele, Z Kan, PM Loerch, CD Armour, R Santos, EE Schadt,
R Stoughton, and DD Shoemaker. "Genome-wide Survey of Human Alternative Pre-mRNA Splicing
with Exon Junction Microarrays." Science 302, no. 5653 (19 December 2003): 2141-4.
Microarray analysis of alternative
splicing with exon junction probes II
Please see figure 2 of
Johnson, JM, J Castle, P Garrett-Engele, Z Kan, PM Loerch, CD Armour, R Santos, EE Schadt,
R Stoughton, and DD Shoemaker. "Genome-wide Survey of Human Alternative Pre-mRNA Splicing
with Exon Junction Microarrays." Science 302, no. 5653 (19 December 2003): 2141-4.
Tissue-specific
splicing of
OCRL1 gene
Please see figure 1 of
JM, Johnson, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour CD, Santos R, Schadt EE,
Stoughton R, and Shoemaker DD. "Genome-wide survey of human alternative pre-mRNA splicing
with exon junction microarrays." Science 302, no. 5653 (Dec 19 2003): 2141-4.
Brain-specific
spliced isoforms
of APP gene
Papers for Thursday
#1
Nature Genetics
34, no. 2
(June 2003):
166-76.
Background reading:
Appendix of Probability & Statistics Primer
Segal, E, M Shapira, A Regev, D Pe'er, D Botstein, D Koller, and N Friedman. "Module Networks:
Identifying Regulatory Modules and Their Condition-specific Regulators from Gene Expression Data."
#2
Beer, Michael A., and Saeed Tavazoie. "Predicting Gene Expression from Sequence."
Cell 117 (16 April 2004): 185-198.
#3
Friedman, N. "Inferring Cellular Networks using Probabilistic Graphical Models."
Science 303, no. 5659 (6 February 2004): 799-805.