10/27/2005 Chaoqun Wu,Fudan University 1
Proteomics
复旦大学生命科学学院吴超群
10/27/2005 Chaoqun Wu,Fudan University 2
Part VIII.
Mass
Spectrometry
10/27/2005 Chaoqun Wu,Fudan University 3
蛋白质组研究的核心是系统识别一个细胞或组织中表达的每一个蛋白质,以及确定每个蛋白质的突出特征。
其分析技术包括
?分离蛋白质和肽的分离技术
?识别和定量分析物的分析技术
?数据管理及分析的生物信息学技术
10/27/2005 Chaoqun Wu,Fudan University 4
 质谱已成为肽和蛋白分析的重要工具,主要归功于一些软电离技术的应用,如电喷雾( ESI)
和基体辅助激光解吸电离(MALDI)。
生物大分子多为极性、难挥发化合物,不易气化,用传统质谱无法测定。但随着新的离子化技术的广泛应用,现已能高效地电离一些完整或片断的大分子生物聚合物,从而进行质谱测定。
10/27/2005 Chaoqun Wu,Fudan University 5
Mass spectrometry (MS)
 
质谱(Mass spectrometry)是带电原子、分子和分子碎片按质荷比(或质量)的大小顺序排列的图谱。
质谱仪是一类能使物质粒子高化成离子并通过适当的电场、磁场将它们按空间位置、
时间先后或者轨道稳定的程度实现质荷比分离,并检测它们的强度的一类仪器。主要由电学系统、分析系统和真空系统组成。
10/27/2005 Chaoqun Wu,Fudan University 6
1,Basic Principles (基本原理):
 质谱仪是利用电磁学原理,使带电的样品离子按质荷比进行分离的装置。离子电离后经加速进入磁场中,其动能与加速电压及电荷z 有关,即,
z e U = 1/2 m ν
2
(其中z为电荷数,e为元电荷(e=1.60×10
-19
C),U
为加速电压,m为离子的质量,ν为离子被加速后的运动速度。)
具有速度ν的带电粒子进入质谱分析器的电磁场中,根据所选择的分离方式,最终实现各种离子按m/z 进行分离。
10/27/2005 Chaoqun Wu,Fudan University 7
Mass Spectrometry
Mass Spectrometry is a tool to analyze the
proteome.
In general a Mass Spectrometer consists of:
– Ion Source
– Mass Analyzer
– Detector
Mass Spectrometers are used to quantify the
mass-to-charge (m/z) ratios of substances.
From this quantification,a mass is determined,
proteins are identified,and further analysis is
performed.
10/27/2005 Chaoqun Wu,Fudan University 8
? Introduce sample to the instrument
? Generate ions in the gas phase
? Separate ions on the basis of differences in m/z
with a mass analyzer
? Detect ions
10/27/2005 Chaoqun Wu,Fudan University 9
用于分析的样品分子(或原子)在离子源中离化成具有不同质量的单电行分子离子和碎片离子,这些单电荷离子在加速电场中获得相同的动能并形成一束离子,进入由电场和磁场组成的分析器,离子束中速度较慢的离子通过电场后偏转大,速度快的偏转小;在磁场中离子发生角速度矢量相反的偏转,即速度慢的离子依然偏转大,速度快的偏转小;当两个场的偏转作用彼此补偿时,它们轨道便相交于一点。与此同时,在磁场中还能发生质量的分离,这样就使具有同一质荷比而速度不同的离子聚焦在同一点上,不同质荷比的离子聚焦在不同的点上,其焦面接近平面,用检测系统进行检测即可得到不同质荷比的谱线,即质谱。通过质谱分析,我们可以获得分析样品的分子量、分子式、分子中同位素构成和分子结构等多方面的信息。
10/27/2005 Chaoqun Wu,Fudan University 10
Diagram of a simple mass spectrometer
10/27/2005 Chaoqun Wu,Fudan University 11
Illustration of the basic components
of a mass spectrometry system.
Ionization
Source
Mass
Analzyer
Detector
Inlet
all ions
selected
ions
Data
System
10/27/2005 Chaoqun Wu,Fudan University 12
1) Ionization source
3) Detector
2) Analyzer
10/27/2005 Chaoqun Wu,Fudan University 13
Analytical tool measuring molecular
weight (MW) of sample
Only picomolar concentrations required
Within an accuracy of 0.01% of total
weight of sample and within 5 ppm for
small organic molecules
For a Mr of 40 kDa,there is a 4 Da error
This means it can detect amino acid
substitutions / post-translational
modifications
10/27/2005 Chaoqun Wu,Fudan University 14
质谱图与质谱表质谱法的主要应用是鉴定复杂分子并阐述其结构,
确定元素的同位素及分布等。
质谱图是以质荷比(m/z)为横坐标,相对强度为纵坐标构成。一般将原始质谱图上最强的离子峰为基峰并定为相对强度为100%,其它离子峰以对基峰的相对百分值表示。
质谱表是用表格形式表示的质谱数据。质谱表中有两项即质荷比和相对强度。从质谱图上可以直观地观察整个分子的质谱全貌,而质谱表则可以准确地给出精确的m/z值及相对强度值,有助于进一步分析。
10/27/2005 Chaoqun Wu,Fudan University 15
2,BioMs (生物质谱技术)
质谱技术最早主要测定元素和同位素的分子量。随着离子光学理论的发展,质谱仪器不断改进,到20世纪50年代后期广泛应用于化合物的测定。现今,质谱分析技术在生命科学领域广泛用来鉴定多肽、蛋白质、低聚核苷酸、低聚糖及糖蛋白等生物大分子。
更为质谱的发展注入了新的活力,形成了独特的生物质谱技术。
10/27/2005 Chaoqun Wu,Fudan University 16
2002 年10 月9 日,瑞典皇家科学院宣布,将2002
年诺贝尔化学奖授予在生物大分子分析领域作出重大贡献的3 位科学家。他们分别是美国耶鲁大学及弗吉尼亚联邦大学教授John B,Fenn (他发明了对生物大分子进行确认和结构分析的方法和对生物大分子的质谱分析方法),日本岛津制作公司研发工程师、分析测量事业部生命科学商务中心、生命科学研究所主任Koichi
Tanaka (他的贡献类似于J,B,Fenn) 及瑞士苏黎世联邦高等理工学校分子生物物理学教授及美国加利福尼亚州拉霍亚市斯克里普斯研究所客座教授Kurt Wü thrich
(他发明了利用核磁共振技术测定溶液中生物大分子三维结构的方法) 。
10/27/2005 Chaoqun Wu,Fudan University 17
 20 世纪80 年代以前,用有机质谱方法测定的分子量范围不超过2000 道尔顿,测定的主要对象是分子量只有几百道尔顿的有机小分子,因为当时的离子化方法容易使极性大的、受热不稳定的生物大分子受到破坏,所以用质谱对生物大分子的测定是个很大的难题。
而在蛋白质、多肽等大分子水平上理解生物学和医学,对它们的鉴定、了解其性能特征、结构构建以及它们之间的相互作用,是非常必要的,是研究生命科学的基础。因此,发展分离、分析生物大分子的方法,是20 世纪80 年代以来的研究热点,“看清”生物大分子的“真面目”曾经是众多科学家的梦想。诺贝尔化学奖获得者
John B,Fenn 和Koichi Tanaka在这一领域的重大突破将这一梦想变成了现实。
10/27/2005 Chaoqun Wu,Fudan University 18
历史表明:质谱分析的发展取决于离子化方法的发展
1912 年荣获1906 年诺贝尔奖的J,J,Thompson
就首次阐明了按分子大小不同和电荷分离分子的可能性,
并以Thompson ( Th) 为单位表达质量/ 电荷比。奠定了质谱分析的基础。
John B,Fenn和Koichi Tanaka以前的离子化方法:
电子轰击电离( EI)
化学电离(CI) 方法场致电离(FI) 及场解吸电离(FD)
快原子轰击( FAB) 电离
10/27/2005 Chaoqun Wu,Fudan University 19
 其中,1981 年,英国科学家M.Barber 等人发明的快原子轰击( FAB) 电离方法,它适用于挥发性极低、极性很强的有机化合物、离子型化合物、热不稳定的分子量较大的化合物,如氨基酸、多肽、糖、低聚糖、金属络合物等等。因此,有人把快原子轰击( FAB) 电离方法的发现誉为质谱学跨入生物学领域的里程碑。
但上述电离方法还不能说明测定超过10
3
Da分子量的可靠性,对于普通蛋白质的分子量,从10
3
到10
5
Da,
甚至可达到5×10
6
Da的大的酶络合物而言,依然是无能为力的。但这些方法对以后的成功给予了重大的推动。
10/27/2005 Chaoqun Wu,Fudan University 20
J,B,Fenn于1984~1987 年间发明的电喷雾电离
( Electro-Spray Ionisation,ESI)技术及其应用。ESI
质谱可检测的分子量达到几十万道尔顿,灵敏度大致在
fmol~pmol 水平。
K,Tanaka于1987~1988 年间发明的软激光解吸附电离(Soft Laser Desorption,SLD)及其应用,基于软激光解析电离发展为基质辅助的激光解吸附离子化技术
(Matrix–Assisted Laser Desorption /Ionization MALDI)。
使质谱方法测定生物大分子的分子量超过10
5
Da。
10/27/2005 Chaoqun Wu,Fudan University 21
他们的成就是重大的历史性突破,使质谱的应用迈进了生物大分子的研究领域,尤其是蛋白质组学的研究领域。ESI 及MALDI 两项软电离技术的发明将质谱方法与生物大分子的研究紧密地联系到一起。从那以后,质谱方法被广泛用来鉴定多肽、
蛋白质、低聚核苷酸、低聚糖及糖蛋白等生物大分子。
前者在较低的摩尔量上测得肽序列;后者是一种可用源后衰变-基质辅助的测序技术。这两种方式不需经过双向凝胶电泳,蛋白质混合物只需以溶液的方式存在就可以了。
10/27/2005 Chaoqun Wu,Fudan University 22
Soft method,but
harder than ESI or
MALDI
To 6000
Daltons
Sample mixed
in
viscous matrix
Carbs/peptides.
Non-volatile.
Fast Atom
Bombardment (FAB)
Soft method.
Very high mass
range.
To 500,000
Daltons
Sample mixed
in
solid matrix
Peptides/protei
ns.
Non-volatile.
Matrix Assisted Laser
Desorption (MALDI)
Soft method,Ions
often multiply
charged.
To 200,000
Daltons
Liquid
Chromatogra
-phy
Peptides/protei
ns.
Non-volatile.
Electrospray (ESI)
Soft method,
Molecular ion peak
[M+H]
+
To 1000
Daltons
GC or liquid
or solid probe
Relatively
small.
Volatile.
Chemical Ionization
(CI)
Hard method.
Provides structural
info
To 1000
Daltons
GC or liquid
or solid probe
Relatively
small.
Volatile.
Electron Impact (EI)
Method HighlightsMass
Range
Sample
Introduction
Typical
Analytes
Ionization Method
Ionization techniques
10/27/2005 Chaoqun Wu,Fudan University 23
Web sites for Ionization
Methods:
1,For MALDI beginner:
http://www.srsmaldi.com/Maldi/Guide.html
2,For MALDI lab user:
http://www.srsmaldi.com/Maldi/Lab.html
3,For MALDI tutorial:
http://ms.mc.vanderbilt.edu/tutorials/maldi/maldi-ie_files/frame.html
4,Ionization Methods 1:
http://www.jeol.com/ms/docs/ionize.html
5,Ionization Methods 2:
http://www.waters.com/Waters_Website/Applications/lcms/lcms_itq.html
10/27/2005 Chaoqun Wu,Fudan University 24
两种离子化技术把质谱引进蛋白质组学电喷雾离子化(electrospray ionization ESI)
基质辅助的激光解吸附离子化(matrix–assisted
laser desorption /ionization MALDI)
都是所谓“软电离”方法,即样品分子电离时,
保留整个分子的完整性,不会形成碎片离子。
现在,这两种技术已被认为是两种在生物大分子研究中具有重要意义的离子化技术。
电喷雾离子化()
基质辅助的激光解吸附离子化都是所谓软电离方法,即样品分子电离时,
保留整个分子的完整性,不会形成碎片离子。
现在,这两种技术已被认为是两种在生物大分子研究中具有重要意义的离子化技术。
10/27/2005 Chaoqun Wu,Fudan University 25
ESI 的原理
J,B,Fenn 发明的电喷雾电离ESI 方法是将进样毛细管置于具有3~6kV 电压的电场中。其工作原理是:样品溶液从毛细管流出时,在电场及辅助气流作用下喷成雾状的带电微液滴,在加热气体作用下,液滴中溶剂被蒸发,导致液滴直径逐渐变小,表面电荷密度增加。当达到雷利限度时,即表面电荷产生的库仑排斥力与液滴表面张力大致相等,则会发生“库仑爆炸”,把液滴炸碎,产生带电的更小微滴;这些液滴中溶剂再蒸发;此过程不断重复,直到液滴变得足够小,
表面电荷形成的电场足够强,最终把样品分子离子从液滴中解吸出来。这些样品离子通过锥孔、聚焦透镜进入质谱仪分析器后被检测。
10/27/2005 Chaoqun Wu,Fudan University 26
SLD/MALDI的基本原理
MALDI 是指用小分子有机物作为基质,样品与基质体积摩尔比为1∶(100~50000),均匀混合后,在真空下干燥或自然干燥后送入离子源。混合物在真空下受激光辐照,基质吸收激光能量,并转变成基质的电子激发能,瞬间使固态基质转变成气态,并被电离成基质离子。而不带电的中性样品与基质离子、质子及金属阳离子之间的碰撞过程中,发生样品的离子化,从而产生质子化分子、阳离子化分子或多电荷离子或多聚体离子。离子化后的分子被电场加速进入飞行时间质量分析器而被检测。
基质在MALDI-TOF MS的方法中的作用主要是从激光脉冲中吸收能量并使被测分子分离成单分子状态。
10/27/2005 Chaoqun Wu,Fudan University 27
3,Applications of MS:
Structural information can be generated
Particularly using tandem mass
spectrometers
Fragment sample & analyse products
Useful for peptide & oligonucleotide
sequencing
Plus identification of individual
compounds in complex mixtures
10/27/2005 Chaoqun Wu,Fudan University 28
How does a Mass Spectrometry work:
3 fundamental parts,the ionisation
source,the analyser,the detector
Samples easier to manipulate if ionised
Separation in analyser according to
mass-to-charge ratios (m/z,质荷比)
Detection of separated ions and their
relative abundance
Signals sent to data system and
formatted in a m/z spectrum
10/27/2005 Chaoqun Wu,Fudan University 29
Typical MS experiment
Mass Spectrum
Mass Analyzer
Ionization
M+/Fragmentation
Sample
Molecule (M)
Data Analysis
On-line Search
10/27/2005 Chaoqun Wu,Fudan University 30
The Components of a Mass
SpectrometerInlet system
Ion Source Analyzer
Ion
Detector
Mass Spectrum
m/z
Computer
10/27/2005 Chaoqun Wu,Fudan University 31
Mass spectrometry is a very powerful method to
analyse the structure of organic compounds,but
suffers from 3 major limitations:
Compounds cannot be characterised without clean
samples
This technique has not the ability to provide sensitive
and selective analysis of complex mixture
For big molecules like peptides spectra are very complex
and very difficult to interpret
10/27/2005 Chaoqun Wu,Fudan University 32
Uses of MS in three major areas.
characterization and quality control of
recombinant proteins and other
macromolecules,
protein identification,
detection and characterization of
posttranslational modifications and potentially
identifying any covalent modification that
alters the mass of a protein.
10/27/2005 Chaoqun Wu,Fudan University 33
A mass spectrometer is an instrument that
determines the molecular weight of
chemical compounds
In practice,
(1) Protein identification by
mass spectrometry
what is MS?
Ionisation of the molecules
Separating the ions according to their
m/z ratio
10/27/2005 Chaoqun Wu,Fudan University 34
Protein identification by mass spectrometry
1) peptide mapping
strategies
MS analysis
trypsin
Set of masses
=
Peptide map
Mp1Mp2
Mp3
Mp4
Mp5
Theoretical peptide maps
from all the sequences in
the databases
protein
p
1
p
3
p
5
Set of
peptides
p
2
p
4
Database
search
Experimental
peptide map
10/27/2005 Chaoqun Wu,Fudan University 35
Protein identification by mass spectrometry
2) peptide sequencing
strategies
Experimental sequence
MS MS analysis
trypsin
MS analysis
Set of peptides
p
2
p
4
p
5
Database search
Sequences in the databases
Mp
1
Mp
2
Mp
3
Mp
4
Mp
5
Mp1
aa1
aa3
aa4
aa5
aa6aa7
aa2
protein
p
1
p
3
Peptide map
Sequence
10/27/2005 Chaoqun Wu,Fudan University 36
肽质谱指纹图
(peptide mass fingerprint)
分析方法是将2-DE胶上的蛋白质点经酶解后得到的一组片段,通过质谱测定出各片段的准确质量,然后将获得的数据在蛋白质数据库中进行搜寻鉴定出该种蛋白质。由于其灵敏度高,分析迅速,已成为2-DE胶上鉴定蛋白质点的最广泛使用的方法。
肽质谱指纹图
()
分析方法是将胶上的蛋白质点经酶解后得到的一组片段,通过质谱测定出各片段的准确质量,然后将获得的数据在蛋白质数据库中进行搜寻鉴定出该种蛋白质。由于其灵敏度高,分析迅速,已成为胶上鉴定蛋白质点的最广泛使用的方法。
10/27/2005 Chaoqun Wu,Fudan University 37
MALDI-TOF 基质辅助激光解析/电离飞行时间质谱仪,
用于高通量量蛋白质、多肽或寡核苷酸精确分子量测定、肽质量指纹谱测试,可进行高通量的蛋白质筛选和单核苷酸多态性分析。具有PSD功能,通过CID分析可获得相关的结构和序列信息。
? MALDI/TOF-MS
? Matrix-assisted
? Laser desorption /
ionisation
? Time-of-flight analysis
? Generation of peptide
mass map (fingerprint)
10/27/2005 Chaoqun Wu,Fudan University 38
Peptide mapping using MALDI-TOF
1) ionisation
+
+
Sample target
Laser beam
detector
Flying tube
Matrix-Assisted Laser Desorption Ionisation-Time Of Flight
10/27/2005 Chaoqun Wu,Fudan University 39
2) Separation / detection
+
+
E
Sample target
Electric field
detector
0.01 ns
Flying tube
10/27/2005 Chaoqun Wu,Fudan University 40
2) Separation / detection
+
+
Sample target
detector
0.01 ns
Flying tube
10/27/2005 Chaoqun Wu,Fudan University 41
2) Separation / detection
+
+
Sample target
detector
1.00 ns
Flying tube
10/27/2005 Chaoqun Wu,Fudan University 42
2) Separation / detection
+
+
Sample target
20.01 ns
detector
Flying tube
10/27/2005 Chaoqun Wu,Fudan University 43
Mp1Mp2
Mp3
Mp4
Mp5
Peptide map
10/27/2005 Chaoqun Wu,Fudan University 44
Tandem Mass Spectrometry
Compare peptide sequence
with protein databases
(trypsin)
Isolates individual peptide fragments for
2
nd
mass spec – can obtain peptide
sequence
10/27/2005 Chaoqun Wu,Fudan University 45
MS analysis
Peptide mapping
Spot 14
450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300
m/z0
100
%
g009 47 (0.925) Cm (46:66) TOF MS ES+
3.81e3
854.411
854.172
644.355
544.392
525.288481.272
585.544
563.565
586.555
853.921
653.322
790.362
669.302
790.051
847.438
854.675
854.914
1057.521
1057.029
901.494
855.165
855.429
901.998
902.478
965.535
962.460
966.541
968.502
1174.069
1173.564
1058.014
1138.904
1058.534
1138.573
1059.026
1138.241
1174.588
1182.064
1182.570
1183.077
1183.598
10/27/2005 Chaoqun Wu,Fudan University 46
Spot 14
450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300
m/z
0
100
%
g009 47 (0.925) Cm (46:66) TOF MS ES+
3.81e3
854.411
854.172
644.355
544.392
525.288481.272
585.544
563.565
586.555
853.921
653.322
790.362
669.302
790.051
847.438
854.675
854.914
1057.521
1057.029
901.494
855.165
855.429
901.998
902.478
965.535
962.460
966.541
968.502
1174.069
1173.564
1058.014
1138.904
1058.534
1138.573
1059.026
1138.241
1174.588
1182.064
1182.570
1183.077
1183.598
MS analysis
MS/MS analysis
Spot 14 901.49 2x
100 200 300 400 500 600 700 800
900 10001100
1200 1300 1400
1500 1600 1700 1800
m/z0
100
%
g009seq4 17 (0.313) Cm (2:18)
TOF MSMS 901.49ES+
1.23e3
901.519
902.514
903.007
903.499
10/27/2005 Chaoqun Wu,Fudan University 47
Spot 14
450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300
m/z0
100
%
g009 47 (0.925) Cm (46:66) TOF MS ES+
3.81e3
854.411
854.172
644.355
544.392
525.288481.272
585.544
563.565
586.555
853.921
653.322
790.362
669.302
790.051
847.438
854.675
854.914
1057.521
1057.029
901.494
855.165
855.429
901.998
902.478
965.535
962.460
966.541
968.502
1174.069
1173.564
1058.014
1138.904
1058.534
1138.573
1059.026
1138.241
1174.588
1182.064
1182.570
1183.077
1183.598
MS analysis
Spot 14 901.49 2x
100 200 300 400 500 600 700 800 900 100011001200130014001500160017001800
m/z0
100
%
g009seq4 21 (0.386) Cm (19:244) TOF MSMS 901.49ES+
2.40e3902.023
684.393
514.275
185.138
147.126
443.240
330.146
246.191
401.187
571.307
603.369
803.501
716.471
892.519
809.465
810.455
1118.659
902.502
1031.630
903.609
916.576
917.593
1032.642
1288.789
1119.672
1231.766
1121.673
1289.774
1359.813
1360.795
1488.864
1655.925
MS/MS analysis
10/27/2005 Chaoqun Wu,Fudan University 48
Analysed MS/MS spectrum
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800
M/z
0
100
%
Spot 14 901.49 2x
g009seq4 MaxEnt 3 126 [Ev-305818,It50,En1] (0.030,200.00,0.050,1400.00,2,Cmp) 1,TOF MSMS 901.49ES+
K V I D E L S L D S L G A E E A L yMax
1802.03(M+H) +
1118.66
y10
684.40
b7
514.29
b5
443.24
b4
314.19
b3
185.14
b2
147.13
y1
50.06
401.19
571.31
b6
803.51
y7
716.46
y6
1031.64
y9
901.60
886.47
b9
999.57
b10
1288.79
y12
1231.75
y11
1119.76
1359.82
y13
1784.02
1617.93
y15
1488.89
y14
1443.77
b14
1556.90;b15
1688.96
y16
1802.85
1804.43
10/27/2005 Chaoqun Wu,Fudan University 49
Identifications by homology
Symbiosis between Medicago
truncatula & Sinorhizobium melilot
G,Bestel-Corre & E,Dumas-Gaudot INRA - Dijon
LLDQGQAGDNLQLL R
LLDQGQAGDNVGLLLREF
LLDQGQAGDNVGLLLR
LLDQGQAGDNAGLLLR
LLDQGQAGDNVGI LLR
EF
EF
EF
Query,
Sbjct 1,
Sbjct 2,
Sbjct 3,
Sbjct 4,
Peptide 1
1652,9 Da
-TU [Mycobacterium tuberculosis]
-TU [Mycobacterium leprae]
-TU [Xylella fastidiosa]
-TU [Thiomonas cuprina]
Peptide 2
1639,9 Da
Sbjct 1,
Sbjct 2,
VGEEI EIVGIRPT TK
VGEEVEIVGIRPTS K
VGEEVEIVGIRPTQK
VGEEI EIVGIRPT l K
VGEEVEIVGIKPTSK
EF
EF
EF
EF
Query,
Sbjct 3,
Sbjct 4,
-TU [Agrobacterium tumefaciens]
-TU [Stigmatella aurantiaca]
-TU [Thiomonas cuprina]
-TU [Buchnera aphidicola]
Peptide 3
1786,9 Da
Sbjct 1,
Sbjct 2,
GITIS TAHVEYETNPR
GITINTAHVEYETNDR
GITIS TAHVEYETETR
GITIS TAHVEYETEAR
GITIS TAHVEYETQNR
EF
EF
EF
EF
Query,
Sbjct 3,
Sbjct 4,
-TU [Tribonema aequale]
-TU [Bacillus subtilis]
-TU [Bacillus stearothermophilus]
-TU [Rickettsia prowazekii]
10/27/2005 Chaoqun Wu,Fudan University 50
The protein database used
Consist of a combination of several widely available
databases,Swiss-Prot,Protein Identification
Resource,and a translation of GenBank
The resulting database contains 18.8 million
residues,representing over 91,000 entries
The C-language source code for FRAGFIT can be
obtained via E-mail to ckw@gene.com
10/27/2005 Chaoqun Wu,Fudan University 51
质谱的肽测序串联质谱(tandem-MS)。即在第一级质谱得到肽的分子离子,选取目标肽的离子与惰性气体碰撞,使肽链中的肽键断裂,形成一系列离子,即N端离子系列(B系列)和C端片段离子系列(Y系列),将这些碎片离子系列综合分析,可得出肽段的氨基酸序列。
源后衰变-基质辅助的激光解吸离子化质谱
( post-source decay MALDI-MS,PSD-MALDI-
MS)
质谱的肽测序串联质谱()。即在第一级质谱得到肽的分子离子,选取目标肽的离子与惰性气体碰撞,使肽链中的肽键断裂,形成一系列离子,即端离子系列(系列)和端片段离子系列(系列),将这些碎片离子系列综合分析,可得出肽段的氨基酸序列。
源后衰变基质辅助的激光解吸离子化质谱

10/27/2005 Chaoqun Wu,Fudan University 52
Peptide Sequencing
Peptides of 2.5 kDa or less give best data
Protein sample often taken from 2-D gels and digested
A protein digest can be analysed as entire mix
Initial MS spectrum showing Mr of all components in
digest (peptide map) may be enough for a database
search and identification
Peptides fragmented along the amino acid backbone in
tandem mass spectrometry
Some peptides generate enough info for full sequence,
others only generate partial sequences of 4-5 amino acids
Often this,tag” sequence is sufficient for database
identification
10/27/2005 Chaoqun Wu,Fudan University 53
Amino Acid Composition (Edmund)
Pioneering method of obtaining
information from proteins.
Cumbersome and tedious by today’s
standards.
Requires the use of terrible smelling?-
mercaptoethanol,
Not,high-throughput” by today’s standards,
hence,aa comp is no longer the most
widely used technique,
10/27/2005 Chaoqun Wu,Fudan University 54
Protein
Sequencing by
MS
step 1,fragmenting
into peptides
10/27/2005 Chaoqun Wu,Fudan University 55
Protein Sequencing
step 2,sequencing the peptides by Edmund
degradation.
Separation by HPLC
and detect by
absorbance at
269nm.
10/27/2005 Chaoqun Wu,Fudan University 56
4,Types of Spectrometery
ESI/MS (电喷雾电离质谱)
MALDI/TOF-MS(激光辅助解析/飞行时间质谱)
Time-of -Flight Instruments(时间飞行质谱)
Magnetic Sector Instruments(扇形磁场质谱)
Quadrupole Instruments(四极杆质谱)
ESI-MS/MS(电喷雾电离串联质谱)
Fourier Transform Ion Cyclotron Resonance (FT-
ICR) (傅立叶交换离子回旋共振质谱)
10/27/2005 Chaoqun Wu,Fudan University 57
(1),Time-of-Flight
A time-of-flight (TOF) analyzer is one of the simplest mass
analyzing devices and is commonly used with MALDI,Time-
of-flight analysis is based on accelerating a set of ions to a
detector with the same amount of energy,Because the ions
have the same energy,yet a different mass,the ions reach
the detector at different times,
10/27/2005 Chaoqun Wu,Fudan University 58
Time of Flight Reflectron
10/27/2005 Chaoqun Wu,Fudan University 59
Today’s TOF Instruments
? Ionization Techniques TOF Analyzers
Electrospray TOF
Q-TOF
MALDI linear TOF
reflector TOF
TOF-TOF
Trap-TOF
Hybrid Mass
spectrometery
10/27/2005 Chaoqun Wu,Fudan University 60
Time-of-Flight Mass Spectrometry
The Detector System The Vacuum Chamber
The Detector System The Vacuum Chamber
Information on Time-of-Flight Mass Spectrometry
10/27/2005 Chaoqun Wu,Fudan University 61
(2),MALDI/MS
The mass range below 500 daltons
(Da) is often obscured by matrix-
related ions in MALDI,
therefore MALDI is mostly applied to
the analysis of peptides.
10/27/2005 Chaoqun Wu,Fudan University 62
Matrix-assisted laser desorption ionization
(MALDI),developed by Karas &Hillenkamp in
the late 1980s.
To generate gas phase,protonated molecules,
a large excess of matrix material is
coprecipitated with analyte molecules (that is,
the molecules to be analyzed) by pipetting a
submicroliter volume of the mixture onto a
metal substrate and allowing it to dry,The
resulting solid is then irradiated by nanosecond
laser pulses,usually from small nitrogen lasers
with a wavelength of 337 nm.
10/27/2005 Chaoqun Wu,Fudan University 63
10/27/2005 Chaoqun Wu,Fudan University 64
An example of protein identification using
MALDI mass spectrometry (matrix assisted
Laser desorption/ionization ),
10/27/2005 Chaoqun Wu,Fudan University 65
The matrix is alfa-cyano-4-hydroxycinnamic
acid or dihydrobenzoic acid (DHB),
The alfa-cyano matrix,which generally leads
to the highest sensitivity in MALDI,is,hotter”
than DHB,so the latter is preferred when
the ions need to be stable for milliseconds in
trapping experiments rather than
microseconds in time-of-flight experiments,
10/27/2005 Chaoqun Wu,Fudan University 66
MALDI-TOF versus ESI-MS MS
Complementary approaches
Protein must be in the database
clean sample
easy data acquisition
easy data interpretation
fast
extremely sensitive
Easy to automate
peptide sequencingpeptide mapping
Sequence information
extremely clean sample
more complex data acquisition
more complex data interpretation
can be time consuming
sensitive
More difficult to automate
10/27/2005 Chaoqun Wu,Fudan University 67
MALDI-TOF + ESI-MS MS
PEPTIDE
MAPPING
Identification?
YES
Mix of peptides
NO
PEPTIDE
SEQUENCING
Identification?
YES
NO NEW PROTEIN
10/27/2005 Chaoqun Wu,Fudan University 68
(3),Electrospray mass
spectrometry (ESI/MS)
Electrospray mass spectrometry (ESI/MS) has been
developed for use in biological mass spectrometry by
Fenn et al,
Liquid containing the analyte is pumped at low
microliter-per-minute flow rates through a hypodermic
needle at high voltage to electrostatically disperse,or
electrospray,small,micrometer-sized droplets,which
rapidly evaporate and which impart their charge onto
the analyte molecules.
10/27/2005 Chaoqun Wu,Fudan University 69
Tandem MS or MS/MS has 2 mass
spectrometers in series
In first mass spectrometer (MS1) is used to SELECT,from the
primary ions,those of a particular m/z value which then pass
intothe Fragmentation Region,The ion selected by the MS1 is
the parent ion and can be a molecular ion resulting from the
primary fragmentation,DISSOCIATION occurs in the
fragmentation region,The daughter ions are analysed in the
Second Spectrometer (MS2),In fact,the MS1 can be viewed
as an ion source for MS2,
MS2MS1
10/27/2005 Chaoqun Wu,Fudan University 70
“Mass Spec” Analyses can be run in
Tandem —MS/MS
MS/MS refers to two MS experiments
performed,in tandem.”
Among other things,MS/MS allows for the
determination of sequence information,
usually in the form of peptides (small parts
of a protein).
This information is used by algorithms to
identify a protein on the basis of mass of a
constituent peptide.
10/27/2005 Chaoqun Wu,Fudan University 71
Example MS/MS Spectrum
This spectrum shows the fragmentation of a peptide,which
is used to determine the sequence of the peptide,via a
search algorithm.
10/27/2005 Chaoqun Wu,Fudan University 72
(4),Q-TOF四极-飞行时间串联质谱仪
 
备有ESI、Nano-ESI和MALDI等多种电离源,具有高灵敏度(fmol)、高分辨率(20000)、高质量精度
(<5ppm)和高通量的MS和MS/MS扫描功能。特别适合进行肽段的氨基酸序列分析,蛋白质翻译后修饰
(磷酸化与糖基化等)、蛋白质-蛋白质相互作用等研究。采用多维毛细管液相色谱与串联质谱仪在线联用方式,可实现对复杂体系蛋白质的快速鉴定和定量分析(MudPIT)。
10/27/2005 Chaoqun Wu,Fudan University 73
The quadrupole ion trap mass analyser consists
of four hyperbolic electrodes,the ring electrods,
the entrance endcap electrode and the exit
endcap electrode,These electrodes form a cavity
in which it is possible to trap and analyse ions,
Both endcap electrodes have a small hole in
their centres
through which
the ions can
travel,The ring
electrode is
located halfway
between the
two endcap
electrodes,
10/27/2005 Chaoqun Wu,Fudan University 74
(5),LC/Mass Spectrometer
10/27/2005 Chaoqun Wu,Fudan University 75
Liquid chromatography/mass spectrometry
(LC/MS) analysis
10/27/2005 Chaoqun Wu,Fudan University 76
(6),Capillary Mass Spectrometer
Capillary electrophoresis/mass spectro-
Metry (CE/MS) analysis (毛细管电泳质谱)
Capillary Liquid Chromatography / Mass
Spectrometry (cLC/MS)
(毛细管液相色谱/质谱联用)
10/27/2005 Chaoqun Wu,Fudan University 77
5,Special Considerations for
Protein Preparation Methods
1,Typically,protein purification starts with a whole-cell
lysate and ends with a gel-separated protein band or
spot,
2,Sensitivity of the overall procedure is usually
determined more by the purification strategy than by
the sensitivity of the MS instrument,
3,In principle,any of the classical separation methods
such as centrifugation,column chromatography,and
affinity-based procedures can precede the final gel
electrophoresis.
4,Generally,silver-stained amounts are necessary for
successful MS identification of proteins [5–50 ng or
0.1–1 pmol for a 50-kilodalton (kDa) protein],
10/27/2005 Chaoqun Wu,Fudan University 78
5,It is important to minimize contamination with
keratins,which are introduced by dust,chemicals,
handling without gloves,etc,as the keratin
peptides can easily dominate the spectrum,
6,Most detergents and salts are incompatible with
both 2-D gels and MS.
7,If the protein can be eluted from reversed-phase
media,the best sample preparation is achieved on
small,low-pressure traps that can be incorporated
into MS injection ports,
8,In affinity-based protocols it is important that the
bait is pure,as contaminating proteins,for example
bacterial proteins in a Glutathione S,transferase
(GST) fusion preparation,will hinder analysis.
10/27/2005 Chaoqun Wu,Fudan University 79
ISOTOPE-CODED AFFINITY TAG
(ICAT),a quantitative method
Label protein samples with heavy and light
reagent
Reagent contains affinity tag and heavy or light
isotopes
Chemically reactive group,forms a
covalent bond to the protein or peptide
Isotope-labeled linker,heavy or light,
depending on which isotope is used
Affinity tag,enables the protein or
peptide bearing an ICAT to be isolated by
affinity chromatography in a single step
10/27/2005 Chaoqun Wu,Fudan University 80
Example of an ICAT Reagent
S
O
I
N
H
*
*
*
*
O
O
O
N
H
O
O
NH
NH
Biotin Affinity tag,
Binds tightly to
streptavidin-agarose
resin
Linker,Heavy version will
have deuteriums at *
Light version will have
hydrogens at *
Reactive group,Thiol-
reactive group will bind to Cys
10/27/2005 Chaoqun Wu,Fudan University 81
How ICAT works?
Lyse &
Label
MIX
Affinity isolation
on streptavidin
beads
Quantification
MS
100
m/z
200 400
600
0
100
550 570 590
0
m/z
Light
Heavy
NH
2
-EACDPLR-COOH
Identification
MS/MS
Proteolysis
(eg trypsin)
10/27/2005 Chaoqun Wu,Fudan University 82
Advantages vs,Disadvantages
Estimates relative
protein levels between
samples with a
reasonable level of
accuracy (within 10%)
Can be used on
complex mixtures of
proteins
Cys-specific label
reduces sample
complexity
Peptides can be
sequenced directly if
tandem MS-MS is used
Yield and non specificity
Slight chromatography
differences
Expensive
Tag fragmentation
Meaning of relative
quantification information
No presence of cysteine
residues or not accessible
by ICAT reagent
10/27/2005 Chaoqun Wu,Fudan University 83
Part IX.
Post-MS
Analysis
10/27/2005 Chaoqun Wu,Fudan University 84
Using N-terminal amino acid
information to search:
peptide mass database with
peptide masses
Search protein sequence database
with peptide sequences
10/27/2005 Chaoqun Wu,Fudan University 85
Outline
How to identify protein using MALDI-Tof
How good of data you need to identify
protein
– Mass Accuracy? & Resolution?
– How many peptides you need?
How to improve the search specific to
achieve the right positive protein
10/27/2005 Chaoqun Wu,Fudan University 86
Protein identifying
2-DE gel/western blot
Direct mass
Determination
(MALDI-MS)
Peptide ladder
sequencing
(MALDI-MS/ESI-MS)
Proteolytic
digestion
Nanospray
ESI-MS/MS
HPLC
Peptide mass
profiling (ESI-MS)
Peptide mass profiling
(MALDI-MS)
Sequence tag
(MALDI-PSD)
Sequence tag
(ESI-MS/MS)
HPLC
Internal
Sequencing
(Edman degradation)
Western
immunoprobing
N-terminal sequencing
(Edman degradation)
N-terminal sequence tag
(3-residues)
Amino acid
Compositional analysis
2D analysis:
LC 2D
2D gel
Western blot
10/27/2005 Chaoqun Wu,Fudan University 87
What you need for peptide
mass mapping
Some peptides
Protein Database
–GenBank,Swiss-Prot,dbEST,etc.
Search engines
– MasCot,Prospector,Sequest,etc.
10/27/2005 Chaoqun Wu,Fudan University 88
How good of data you need
to identify protein
– Mass Accuracy? & Resolution?
– How many peptides you need?
– Additional information such as
taxonomy,modification,etc,?
10/27/2005 Chaoqun Wu,Fudan University 89
Searching parameters
modifications (e.g,cysteine residues,
arginine residues,lysine residues etc.)
number of allowable missed cleavages
Data properties,monoisotopic/average,
charge state,amino acid composition
MS/MS data
10/27/2005 Chaoqun Wu,Fudan University 90
Iterative search protocol used for
database searching with
protein mixtures,
10/27/2005 Chaoqun Wu,Fudan University 91
Schematic of the working
of some search engine
h
1
h
2
h
3
h
4
h
x-1
h
x
m
1
m
2
sweep
error
step
error
error
Where,
sweep = mass range between m
1
and m
2
as a
sub-interval
step = sub-interval mass range divided into the
sweep
error = mass error,error must be greater or equal to
half of step
10/27/2005 Chaoqun Wu,Fudan University 92
The effect of mass error to
matching peptides
a) mass error = 0.5 Da
b) mass error = 0.35 Da
c) mass error = 0.2 Da
The number of matching peptides between mass range 1000 to
1003 Da (Database,S,cerevisiae; enzyme,trypsin; monoisotopic)
10/27/2005 Chaoqun Wu,Fudan University 93
d) mass error = 0.1 Da
e) mass error = 0.05 Da
f) mass error = 0.01 Da
The number of matching peptides between mass range 1000 to
1003 Da (Database,S,cerevisiae; enzyme,trypsin; monoisotopic)
10/27/2005 Chaoqun Wu,Fudan University 94
How to increase the search specific
increase mass accuracy
additional information
N-terminal amino acid
10/27/2005 Chaoqun Wu,Fudan University 95
The nomenclature of the common peptide fragment ions
developed by Roepstorff and Fohlman
CCH
2
N
H
R
1
N
H
O
C
H
R
2
C
O
N
H
C
H
R
n-1
C
O
N
H
C
H
R
n
C
O
OH
n-1
Y
n-1
X
n-1
Y
2
Z
2
X
1
Y
1
Z
1
C
n-1
B
n-1
A
n-1
C
n-2
A
1
B
1
C
1
A
2
B
2
X
n-2
CH C O
R
1
H
2
N
B
1
H
3
NCH
R
n-1
C
O
NH CH
R
n
COOH
Y
2
''
Z
10/27/2005 Chaoqun Wu,Fudan University 96
The proposed mechanism of formation of b and y-type ions
O
N
NH
2
OH
R
H
2
N-Peptide
Peptide-COOH
NH
H
2
N-Peptide
NH-Peptide-COOH
OH
O
O
HN
NH
OH
R
H
2
N-Peptide
Peptide-COOH
O
N
NH
2
OH
R
R'
O
N
NH
2
O
R
R'
H
O
N
OH
R
H
2
N-peptide
+
H
2
NPeptide-COOH
O
N
O
R
H
2
N-peptide
H
3
N Peptide-COOH
+
(b
m
)
+
(y
n
)
+
10/27/2005 Chaoqun Wu,Fudan University 97
Edman degradation
NCS H
2
NC
H
CH
3
C
O
Asp Phe Phe Arg C
O
O
-
+
NC
H
CH
3
C
O
Asp Phe Phe Arg C
O
O
-
C
SH
N
H
Labeling
N
N
S
O
CH
3
H
PTH-alanine
Asp Phe Phe Arg C
O
O
-
H
2
N+
Release
Peptide shorthened by one residue
Phenyl isothiocyanate
10/27/2005 Chaoqun Wu,Fudan University 98
Proposed mechanism of fragmentation of [M+2H]
2+
ions of
peptide PTC derivatives to yield b
1
and y
n
product ions
NH
NH
S
R
NH Peptide COOH
OH
+
H
+
NH
+
HN
S
NH Peptide COOH
R
OH
H
+
H
+
NH
S
NH
2
+
Peptide COOH
OH
N
R
N
S
NH
2
+
R'
OH
R
N
S
NH
2
+
O
R
H
R'
NH
S
N
R
OH
+
H
+
Peptide COOHH
2
N
+
(b
1
)
+
(y
n
)
+
NH
S
N
R
O Peptide COOH
+
H
3
N
H
+
+
(y
n
)
2+
10/27/2005 Chaoqun Wu,Fudan University 99
Theoretical tryptic digest of four model proteomes,
The number of protein matches for a single peptide mass
in the range 1000 to 2000 Da is estimated for four model
proteomes.
10/27/2005 Chaoqun Wu,Fudan University 100
Effect of N-terminal residue information on database search
efficiency in three proteomes,calculated using monoisotopic
peptide masses with 500 ppm (A) and 50 ppm (B) mass accuracy,
10/27/2005 Chaoqun Wu,Fudan University 101
Protein identification from mixtures using different
database search strategies.
10/27/2005 Chaoqun Wu,Fudan University 102
Protein identification in database searching with varying mass
accuracy for a single peptide in the S,cerevisiae proteome,
10/27/2005 Chaoqun Wu,Fudan University 103
Comparison of the level of unambiguous protein identification in
a mixture of three S,cerevisiae proteins with different experimental
search strategies,
10/27/2005 Chaoqun Wu,Fudan University 104
ESI mass spectrum of bradykinin (Bk),(Top) after N-terminal
derivatization with phenyl isothiocyanate (PITC) (Top)
ESI –MS/MS spectrum of (PTC)Bk
2+
(bottom)
10/27/2005 Chaoqun Wu,Fudan University 105
Horse heart myoglobin
SWISS-PROT database,red = PTC-derivative peptides
green = native peptides
a
Values used for the search,the letter is the N-terminal amino-acid,as identified by the
Edman fragmentation,the number is the rounded mass,H-396 denotes H48-K50,
Mr 396.2485; (I/L)-1270 denotes L32-K42,Mr 1270.656; and Y-1884 denotes Y103-
K118,Mr 1884.015.
b
As noted in the text,photofragmentation of (I/L)-1270,leading to the partial N-terminal
sequence (I/L)FTGH.
10/27/2005 Chaoqun Wu,Fudan University 106
Conclusion
In conclusion,the derivatisation and promoted
fragmentation techniques evaluated in this study promise to help
improve proteome protein identification in mixtures,
They may offer a useful alternative to extended sequence
tagging approaches via MS/MS,particularly when high sensitivities
of analysis are required,
These approaches can reduce the chance of a false positive
identification in protein mixtures and can achieve more efficient
and reliable protein identifications than do standard peptide mass
fingerprinting,even when the latter uses data ten-fold superior in
mass accuracy.
10/27/2005 Chaoqun Wu,Fudan University 107
Part X.
Bioinformatics
For Proteomics
10/27/2005 Chaoqun Wu,Fudan University 108
What is Bioinformatics
Coined by Hwa Lim in the late 1980s,popularized in the
1990s through its association with the HGP.
? Electrophoresis,Supercomputing and The Human Genome,Edited by
Charles R,Cantor and Hwa A,Lim,New Jersey,World Scientific
Publishing Co,(URL is http://www.wspc.co.uk),1991.
? http://www.google.com/search?q=definition+of+bioinformatics
Definitions from NIH:
– Bioinformatics,
Research,development,or application of computational tools and
approaches for expanding the use of biological,medical,behavioral or
health data,including those to acquire,store,organize,archive,analyze,
or visualize such data.
– Computational Biology,
The development and application of data-analytical and theoretical
methods,mathematical modeling and computational simulation
techniques to the study of biological,behavioral,and social systems.
10/27/2005 Chaoqun Wu,Fudan University 109
Mathematical
sciences
Computer
science
Life sciences
10/27/2005 Chaoqun Wu,Fudan University 110
Objects studied
Image from the DOE Human Genome Program
10/27/2005 Chaoqun Wu,Fudan University 111
Topics
Role of Bioinformatics/Computational
Biology in Proteomics Research
Functional Annotation of Proteins
Classification of Proteins
Bioinformatics Databases and Analytical
Tools
Sequence function
10/27/2005 Chaoqun Wu,Fudan University 112
Functional Genomics/Proteomics
Proteomics studies biological systems based on global
knowledge of genomes,transcriptomes,proteomes,
metabolomes,Functional genomics studies biological
functions of proteins,complexes,pathways based on
the analysis of genome sequences,Includes functional
assignments for protein sequences.
– Genome,All the Genetic Material in the Chromosomes
– Transcriptome,Entire Set of Gene Transcripts
– Proteome,Entire Set of Proteins
– Metabolome,Entire Set of Metabolites
Genome MetabolomeTranscriptome Proteome
10/27/2005 Chaoqun Wu,Fudan University 113
Data for Proteomics
Data,Gene Expression Profiling
- Genome-Wide Analyses of Gene Expression
Data,Structural Genomics
- Determine 3D Structures of All Protein Families
Data,Genome Projects (Sequencing)
- Functional genomics
- Knowing complete genome sequences of a
number of organisms is the basis of the
proteomics research
10/27/2005 Chaoqun Wu,Fudan University 114
DNA
Sequence
Gene
Protein
Sequence
Function
Genomic DNA Sequence
5' UTRPromoter
Exon1
Intron Exon2 Intron Exon3 3' UTR
A
G
G
T
A
G
Gene Recognition
Exon2Exon1 Exon3
C
A
C
A
C
A
A
T
T
A
T
A
Protein Sequence
A
T
G
A
A
T
A
A
A
Structure
Determination
Protein Structure
Function
Analysis
Gene Network
Metabolic Pathway
Protein Family
Molecular Evolution
Family
Classification
G
T
Gene Gene
10/27/2005 Chaoqun Wu,Fudan University 115
1,Bioinformatics and
Genomics/Proteomics
Sequence,
Other Data
Pathways and
Regulatory Circuits
Hypothetical Cell
Unknown
Genes
Putative
Functional Groups
10/27/2005 Chaoqun Wu,Fudan University 116
Most new proteins come from
genome sequencing projects
Mycoplasma genitalium - 484 proteins
Escherichia coli - 4,288 proteins
S,cerevisiae (yeast) - 5,932 proteins
C,elegans (worm) ~ 19,000 proteins
Homo sapiens ~ 40,000 proteins
..,and have unknown functions
10/27/2005 Chaoqun Wu,Fudan University 117
Advantages of knowing the
complete genome sequence
? All encoded proteins can be predicted
and identified
? The missing functions can be
identified and analyzed
? Peculiarities and novelties in each
organism can be studied
? Predictions can be made and verified
10/27/2005 Chaoqun Wu,Fudan University 118
The changing face of protein
science
20
th
century
Few well-studied
proteins
Mostly globular
with enzymatic
activity
Biased protein
set
21
st
century
Many,hypothetical”
proteins
Various,often with
no enzymatic activity
Natural protein set
10/27/2005 Chaoqun Wu,Fudan University 119
Properties of the natural
protein set
? Unexpected diversity of even common
enzymes (analogous,paralogous,
xenologous,etc,enzymes )
? Conservation of the reaction chemistry,
but not the substrate specificity
? Functional diversity in closely related
proteins
? Abundance of new structures
10/27/2005 Chaoqun Wu,Fudan University 120
2,Objectives of functional analysis
for different groups of proteins
? Experimentally characterized
Best annotated protein database,SwissProt
?,Knowns” = Characterized by similarity (closely
related to experimentally characterized)
? Make sure the assignment is plausible
? Function can be predicted
? Extract maximum possible information
? Avoid errors and overpredictions
? Fill the gaps in metabolic pathways
?,Unknowns” (conserved or unique)
? Rank by importance
10/27/2005 Chaoqun Wu,Fudan University 121
Escherichia coli Methanococcus jannaschii
Yeast Human
E,coli M,jannaschii S,cerevisiae H,sapiens
Characterized experimentally 2046 97 3307 10189
Characterized by similarity 1083 1025 1055 10901
Unknown,conserved 285 211 1007 2723
Unknown,no similarity 874 411 966 7965
(from Koonin and Galperin,2003,with modifications)
10/27/2005 Chaoqun Wu,Fudan University 122
(1) Problems in functional
assignments for,knowns”
? Previous low quality annotations
- misinterpreted experimental results
(e.g,suppressors,cofactors)
- biologically senseless annotations
Deinococcus,head morphogenesis protein
Arabidopsis,separation anxiety protein-like
Helicobacter,brute force protein
Methanococcus,centromere-binding protein
Plasmodium,frameshift
- propagated mistakes of sequence comparison
10/27/2005 Chaoqun Wu,Fudan University 123
Problems in functional assignments for
“knowns”
Multi-domain organization of proteins
Histidine
kinase
Periplasmic sensor domain His kinase domain
Periplasmic sensor domain Uncharacterized domain
10/27/2005 Chaoqun Wu,Fudan University 124
Problems in functional assignments
for,knowns”
Low sequence complexity (coiled-coil,
non-globular regions)
Non-orthologous gene displacement
Enzyme evolution (divergence in
sequence and function)
10/27/2005 Chaoqun Wu,Fudan University 125
Enzyme recruitment:
Minor mutational changes convert a
glycerol kinase into gluconate kinase
GNTK_BACSU ---MTSYMLGIDIGTTSTKAVLFSENGDVVQKESIGYPLYTPDISTAEQNPEEIFQAVIHTTARITKQHPEKR--ISFISFSSAMHS-VIAIDENDKPLTPCITWADNRSEGWAHKIKE- 113
GNTK_BACLI ---MTSYMLGIDIGTTSTKAVLFSEKGDVIQKESIGYALYTPDISTAEQNPDEIFQAVIQSTAKIMQQHPDKQ--PSFISFSSAMHS-VIAMDENDKPLTSCITWADNRSEGWAHKIKE- 113
GLPK_BACSU ---METYILSLDQGTTSSRAILFNKEGKIVHSAQKEFTQYFPHPGWVEHNANEIWGSVLAVIASVISESGISASQIAGIGITNQRETTVVWDKDTGSPVYNAIVWQSRQTSGICEELRE- 116
GLPK_MYCGE MDLKKQYIIALDEGTSSCRSIVFDHNLNQIAIAQNEFNTFFPNSGWVEQDPLEIWSAQLATMQSAKNKAQIKSHEVIAVGITNQRETIVLWNKENGLPVYNAIVWQDQRTAALCQKFNED 120
XYLB_BACSU ----MKYVIGIDLGTSAVKTILVNQNGKVCAETSKRYPLIQEKAGYSEQNPEDWVQQTIEALAELVSISNVQAKDIDGISYSGQMHGLVLLDQDR-QVLRNAILWNDTRTTPQCIR--MT 113
XYLB_ECOLI ------MYIGIDLGTSGVKVILLNEQGEVVAAQTEKLTVSRPHPLWSEQDPEQWWQATDRAMKALGDQHSLQDVKALGIAGQMHGATLLDAQQ-R--VLRPAILWNDGRCAQECT---LL 108
GNTK_BACSU ELNGHEVYKRTGTPIHPMAPLSKIAWITNERKEIASKAKK-----YIGIKEYIFKQLFN-EYVIDYSLASATGMMNLKGLDWDEEALRIAGITPDHLSKLVPTTEIFQHCSPEIAIQMGI 227
GNTK_BACLI EMNGHNVYKRTGTPIHPMAPLSKITWIVNEHPEIAVKAKK-----YIGIKEYIFKKLFD-QYVVDYSLASAMGMMNLKTLAWDEEALAIAGITPDHLSKLVPTTAIFHHCNPELAAMMGI 227
GLPK_BACSU KGYNDKFREKTGLLIDPYFSGTKVKWILDNVEGAREKAEKGELLFGTIDTWLIWKMSGGKAHVTDYSNASRTLMFNIYDLKWDDQLLDILGVPKSMLPEVKPSSHVYAETVDY--HFFGK 236
GLPK_MYCGE KLIQTKVKQKTGLPINPYFSATKIAWILKNVPLAKKLMEQKKLLFGTIDSWLIWKLTNGKMHVTDVSNASRTLLFDIVKMEWSKELCDLFEVPVSILPKVLSSNAYFGDIETNHWSSNAK 240
XYLB_BACSU EKFGDHLLDITKNRVLEGFTLPKMLWVKEHEPELFKKTAV------FLLPKDYVRFRMTGVIHTEYSDAAGTLLLHITRKEWSNDICNQIGISADICPPLVESHDCVGSLLPHVAAKTGL 227
XYLB_ECOLI EARVPQSRVITGNLMMPGFTAPKLLWVQRHEPEIFRQIDK------VLLPKDYLRLRMTGEFASDMSDAAGTMWLDVAKRDWSDVMLQACDLSRDQMPALYEGSEITGALLPEVAKAWGM 222
GNTK_BACSU DPETPFVIGASDGVLSNLGVNAIKKGEIAVTIGTSGAIRTIIDKPQTDEKGRIFCYALTDK----HWVIGGPVNNGGIVLRWIRDEFASSEIETATRLGIDPYDVLTKIAQRVRPGSDGL 343
GNTK_BACLI DPQTPFVIGASDGVLSNLGVNAIKKGEIAVTIGTSGAIRPIIDKPQTDEKGRIFCYALTEN----HWVIGGPVNNGGIVLRWIRDEFASSEIETAKRLGIDPYDVLTKIAERVRPGADGL 343
GLPK_BACSU GKNIPIAGAAGDQQSALFGQACFEEGMGKNTYGTGCFMLMNTGEKAIKSEHGLLTTIAWGIDGKVNYALEGSIFVAGSAIQWLRDGLRMFQDSSLSES-----------YAEKVDSTDGV 341
GLPK_MYCGE G-IVPIRAVLGDQQAALFGQLCTEPGMVKNTYGTGCFVLMNIGDKPTLSKHNLLTTVAWQENHPPVYALEGSVFVAGAAIKWLRDALKIIYSEKESDFY----------AELAKENEQNL 350
XYLB_BACSU LEKTKVYAGGADNACGAIGAGILSSGKTLCSIGTSGVILSYEEEKERDFKGKVHFFNHGKKDSFYTMGVTLAAGYSLDWF-------------KRTFAPNESFEQLLQGVEAIPIGANGL 334
XYLB_ECOLI A-TVPVVAGGGDNAAGAVGVGMVDANQAMLSLGTSGVYFAVSEGFLSKPESAVHSFCHALPQRWHLMSVMLSAASCLDW--------------AAKLTGLSNVPALIAAAQQADESAEPV 327
GNTK_BACSU LFHPYLAGERAPLWNPDVRGSFFGLTMSHKKEHMIRAALEGVIYNLYTVFLALTECMDGPVTRIQATGGFARSEVWRQMMSDIFESEVVVPESYESSCLGACILGLYATGKIDSFDAVSD 463
GNTK_BACLI LFHPYLAGERAPLWNPDVPGSFFGLTMSHKKEHMIRAALEGVIYNLYTVFLALTECMDGPVARIQATGGFARSDVWRQMMADIFESEVVVPESYESSCLGACILGLYATGKIDSFDVVSD 463
GLPK_BACSU YVVPAFVGLGTPYWDSDVRGSVFGLTRGTTKEHFIRATLESLAYQTKDVLDAMEADSNISLKTLRVDGGAVKNNFLMQFQGDLLNVPVERPEINETTALGAAYLAGIAVGFWKDRSEIAN 461
GLPK_MYCGE VFVPAFSGLGAPWWDASARGIILGIEASTKREHIVKASLESIAFQTNDLLNAMASDLGYKITSIKADGGIVKSNYLMQFQADIADVIVSIPKNKETTAVGVCFLAGLACGFWKDIHQLEK 470
XYLB_BACSU LYTPYLVGERTPHADSSIRGSLIGMDGAHNRKHFLRAIMEGITFSLHESIELFR-EAGKSVHTVVSIGGGAKNDTWLQMQADIFNTRVIKLENEQGPAMGAAMLAAFGSGWFESLEECAE 453
XYLB_ECOLI WFLPYLSGERTPHNNPQAKGVFFGLTHQHGPNELARAVLEGVGYALADGMDVVHACGIKP-QSVTLIGGGARSEYWRQMLADISGQQLDYRTGGDVGPALGAARLAQIAANPEKSLIELL 446
Differences between gluconate
and glycerol/xylulose kinases
Differences between gluconate
and glycerol/xylulose kinases
Leads to non-orthologous gene displacement
10/27/2005 Chaoqun Wu,Fudan University 126
Objectives of functional analysis
for different groups of proteins
? Experimentally characterized
?,Knowns” = Characterized by similarity
(closely related to experimentally characterized)
? Make sure the assignment is plausible
? Function can be predicted
? Extract maximum possible information
? Avoid errors and overpredictions
? Fill the gaps in metabolic pathways
?,Unknowns” (conserved or unique)
? Rank by importance
10/27/2005 Chaoqun Wu,Fudan University 127
Functional Prediction:
Dealing with,hypothetical” proteins
? Computational analysis
? Sequence analysis of the new ORFs
? Mutational analysis
? Functional analysis
? Expression profiling
? Tracking of cellular localization
? Structural analysis
? Determination of the 3D structure
10/27/2005 Chaoqun Wu,Fudan University 128
Structural Genomics
Protein Structure Initiative,Determine 3D Structures of All
Proteins
– Family Classification:
Organize Protein Sequences into Families,collect
families without known structures
– Target Selection:
Select Family Representatives as Targets
– Structure Determination:
X-Ray Crystallography or NMR Spectroscopy
– Homology Modeling:
Build Models for Other Proteins by Homology
– Functional prediction based on structure
10/27/2005 Chaoqun Wu,Fudan University 129
Structural Genomics,
Structure-Based Functional
Assignments
Methanococcus jannaschii MJ0577 (Hypothetical Protein)
Contains bound ATP ATPase or ATP-Mediated
Molecular Switch
Confirmed by biochemical experiments
10/27/2005 Chaoqun Wu,Fudan University 130
Crystal structure is not a function!
10/27/2005 Chaoqun Wu,Fudan University 131
(2) Functional Prediction,
Improving functional assignments
for,unknowns”
Detailed manual analysis of
sequence similarities
Cluster analysis of protein
families (family databases)
Use of sophisticated database
searches (PSI-BLAST,HMM)
10/27/2005 Chaoqun Wu,Fudan University 132
Using comparative genomics
for protein analysis
Those amino acids that are conserved in
divergent proteins (archaeal and bacterial,
hyperthermophilic and mesophilic) are
likely to be important for catalytic activity.
Comparative analysis allows us to find subtle
sequence similarities in proteins that
would not have been noticed otherwise
Prediction of the 3D fold and general function
is much easier than prediction of exact
biological (or biochemical) function.
10/27/2005 Chaoqun Wu,Fudan University 133
Using comparative genomics
for protein analysis
For some reason,the reaction chemistry often
remains conserved even when sequence
diverges almost beyond recognition
Sequence database searches that use exotic or
highly divergent query sequences often
reveal more subtle relationships than those
using queries from humans or standard
model organisms (E,coli,yeast,worm,fly).
Sequence analysis complements structural
comparisons and can greatly benefit
from them
10/27/2005 Chaoqun Wu,Fudan University 134
Poorly characterized protein families
? Enzyme activity can be predicted,the
substrate remains unknown
(ATPases,GTPases,oxidoreductases,
methyltransferases,acetyltransferases)
? Helix-turn-helix motif proteins (predicted
transcriptional regulators)
? Membrane transporters
10/27/2005 Chaoqun Wu,Fudan University 135
Improving functional assignments
for,unknowns”
? Phylogenetic distribution (系统发生)
? Wide - most likely essential
? Narrow - probably clade(进化树)-specific
? Patchy(补丁的) - most intriguing,
niche (小群体)-specific
? Domain association – Rosetta Stone for
multidomain proteins
? Gene neighborhood (operon organization)
10/27/2005 Chaoqun Wu,Fudan University 136
Using genome context
for functional prediction
10/27/2005 Chaoqun Wu,Fudan University 137
Problems in functional
assignments/predictions
Identification of protein-coding regions
Delineation of potential function(s) for
distant paralogs
Identification of domains in the absense
of close homologs
Analysis of proteins with low
sequence complexity
10/27/2005 Chaoqun Wu,Fudan University 138
“Unknown unknowns”
Phylogenetic distribution
– Wide - most likely essential
– Narrow - probably clade-specific
– Patchy - most intriguing,
niche-specific
10/27/2005 Chaoqun Wu,Fudan University 139
3,To deal with the ocean of new
se-quences,need,natural”
protein classification
Protein families are real and reflect evolutionary
relationships
Protein classification systems can be used to
– Improve sensitivity of protein identification
– Provide new protein sequence annotation,simplifying
the search for non-obvious relationships
– Detect and correct genome annotation errors
systematically
– Drive other annotations (actve site etc)
– Provide basis for evolution,genomics and proteomics
research
Discovery of New Knowledge by
Using Information Embedded within
Families of Homologous Sequences and Their Structures
10/27/2005 Chaoqun Wu,Fudan University 140
The ideal system would be:
Comprehensive (全面的全面的
),with each sequence
classified either as a member of a family or as an
“orphan” sequence,a family of one
Hierarchical (有等级的有等级的
),with families united into
superfamilies on the basis of distant homology
Allow for simultaneous use of the whole protein and
domain information (domains mapped onto proteins)
Allow for automatic classification/annotation of new
sequences when these sequences are classifiable
into the existing families
Expertly curated (family name,function,evidence
attribution (experimental vs predicted),background
etc),This is the only way to avoid annotation errors
and prevent error propagation
10/27/2005 Chaoqun Wu,Fudan University 141
The ideal system has yet to
be created,but there are
several very useful systems
10/27/2005 Chaoqun Wu,Fudan University 142
Levels of Protein Classification
Evolution by
recent duplication
and loss
Origin traceable
to a single gene
in LCA
Evolution by
ancient
duplications
Monophyletic
origin
Possible
monophyly above
and below
No relationships
Paralogy within a lineagePA3131 and
PA3181
LSE
Orthology for a given set
of species; biochemical
activity; biological function
2-keto-3-
deoxy-6-
phosphoglucon
ate aldolase
COG
High sequence similarity
(alignments); biochemical
properties
Class I
Aldolase
Family
Recognizable sequence
similarity (motifs); basic
biochemistry
AldolaseSuperfamily
Topology of folded
backbone
TIM-BarrelFold
Composition of structural
elements
α/βClass
10/27/2005 Chaoqun Wu,Fudan University 143
Protein Evolution
Tree of Life & Evolution
of Protein Families
(Dayhoff,1978)
Can build a tree
representing evolution of
a protein family,based
on sequences
Othologus Gene Family,
Organismal and
Sequence Trees Match
Well
10/27/2005 Chaoqun Wu,Fudan University 144
Protein Evolution
Homolog (同源性物同源性物
)
– Common Ancestors
– Common 3D
Structure
– Common Active Sites
or Binding Domains
Ortholog (直向同源物直向同源物
)
– Derived from
Speciation
Paralog (横向同源物横向同源物
)
– Derived from
Duplication
10/27/2005 Chaoqun Wu,Fudan University 145
Paralog:来源于基因复制形成的同源物.
同一生物体中同一基因复制而产生的多个蛋白,译为旁系同源物横向同源。
Ortholog:来源于物种形成的同源物.
起源于同一组先,在不同生物体中行使同一个功能的基因群,又称为直向同源。
10/27/2005 Chaoqun Wu,Fudan University 146
Orthologs and Paralogs
M
y
x
i
n
i
d
a
e
T
e
l
e
o
s
t
o
m
i
V
e
r
t
e
b
ra
t
a
T
e
t
ra
p
o
d
a
M
a
m
m
a
l
i
a
A
m
p
h
i
b
i
a
Hb
(Hagfish)
Myo
(Cod)
HbA
(Cod)
HbB
(Cod)
Myo
(Frog)
HbA
(Frog)
HbB
(Frog)
Myo
(Rat)
HbA
(Rat)
HbB
(Rat)
Myo
Craniata
(Hagfish)
10/27/2005 Chaoqun Wu,Fudan University 147
Orthologs and Paralogs
Hb
(Hagfish)
Myo
(Cod)
HbA
(Cod)
HbB
(Cod)
Myo
(Frog)
HbA
(Frog)
HbB
(Frog)
Myo
(Rat)
HbA
(Rat)
HbB
(Rat)
Myo
LCA of Craniata
COG
myoglobins
COG
hemoglobins
(Hagfish)
10/27/2005 Chaoqun Wu,Fudan University 148
Orthologs and Paralogs
Orthologs
(COG Myo)
Orthologs
(COG Hb)
Out-paralogs
(globin family)
Myo (Hagfish)
Myo (Cod)
Myo (Frog)
Myo (Rat)
HbA (Cod)
HbB (Cod)
HbA (Frog)
HbB (Frog)
HbA (Rat)
Hb (Hagfish)
SubCOG
In-paralogs
(LSE in Vertebrata)
SubCOG
HbB (Rat)
10/27/2005 Chaoqun Wu,Fudan University 149
Orthologs and Paralogs
Hb
(Hagfish)
Myo
(Cod)
HbA
(Cod)
HbB
(Cod)
Myo
(Frog)
HbA
(Frog)
HbB
(Frog)
Myo
(Rat)
HbA
(Rat)
HbB
(Rat)
Myo
COG
myoglobins
COG
hemoglobins
COG
hemoglobins A
LCA of Vertebrata
(Hagfish)
10/27/2005 Chaoqun Wu,Fudan University 150
Orthologs and Paralogs
Myo (Cod)
Orthologs
(COG Myo)
Myo (Frog)
Myo (Rat)
HbA (Cod)
HbB (Cod)
HbA (Frog)
HbB (Frog)
HbA (Rat)
HbB (Rat)
Orthologs
(COG HbA)
Out-paralogs
(globin family)
Orthologs
(COG HbB)
10/27/2005 Chaoqun Wu,Fudan University 151
Levels of Protein Classification
Evolution by
recent duplication
and loss
Origin traceable
to a single gene
in LCA
Evolution by
ancient
duplications
Monophyletic
origin
Possible
monophyly above
and below
No relationships
Paralogy within a lineagePA3131 and
PA3181
LSE
Orthology for a given set
of species; biochemical
activity; biological function
2-keto-3-deoxy-
6-
phosphoglucona
te aldolase
COG
High sequence similarity
(alignments); biochemical
properties
Class I AldolaseFamily
Recognizable sequence
similarity (motifs); basic
biochemistry
AldolaseSuperfamily
Topology of folded
backbone
TIM-BarrelFold
Composition of structural
elements
α/βClass
10/27/2005 Chaoqun Wu,Fudan University 152
Protein Family-Domain-Motif
Domain,Evolutionary/Functional/Structural Unit
Domain = structurally compact,independently folding
unit that forms a stable three-dimentional structure and
shows a certain level of evolutionary conservation,
Usually,corresponds to an evolutionary unit.
A protein can consist of a single domain or multiple
domains,Proteins have modular structure.
Motif,Conserved Functional/Structural Site
10/27/2005 Chaoqun Wu,Fudan University 153
Protein Evolution,Sequence
Change vs,Domain Shuffling(混乱混乱
)
If enough similarity
remains,one can trace the
path to the common origin
What about these?
10/27/2005 Chaoqun Wu,Fudan University 154
Recent Domain Shuffling
SF006786
CM (AroQ type) PDH
SF001501
CM (AroQ type) SF001499PDH
SF005547
ACT
PDH
SF001424ACT
PDT
SF001500
PDTCM (AroQ type) ACT
10/27/2005 Chaoqun Wu,Fudan University 155
Protein classification,proteins and
domains
Option 1,classify domains
- take individual domain sequences,
consider them as independently evolving
units and build a classification system
- allows to go all the way to the deepest
possible level,the last point of traceable
homology and common origin (fold)
- domain databases (Pfam,SMART,CDD)
allow to map domains onto a query
sequence
10/27/2005 Chaoqun Wu,Fudan University 156
Protein classification,proteins and
domains
Option 2,classify full-length proteins
- In cases of multidomain proteins,does not
allow to go deep along the evolutionary
tree
- All proteins in a family will often have a
common biological function,which is very
convenient for annotation
- Domains will be mapped onto protein
families
10/27/2005 Chaoqun Wu,Fudan University 157
Practical Classification of Proteins:
Setting Realistic Goals
We strive to reconstruct the natural classification
of proteins to the fullest possible extent
BUT
Domain shuffling rapidly degrades the continuity in the protein
structure (faster than sequence divergence degrades similarity)
THUS
The further we extend the classification,the finer is
the domain structure we need to consider
SO
We need to compromise between the depth of analysis
and protein integrity
10/27/2005 Chaoqun Wu,Fudan University 158
Classification,current status
PIR Superfamilies:
Proteins in PIRPSD,283,289
Proteins classified,187,871
2/3 of the PIR proteins
COGs,
~ 70% of each microbial genome
~ 50% of each Eukaryotic genome in 3-clade
COG
~ 20%? of each Eukaryotic genome in LSEs
10/27/2005 Chaoqun Wu,Fudan University 159
4,Workshop Tools and Websites
http://pga.lbl.gov/Workshop/April2002/Classwork/webtools9.htm
Protein family and domain
keyword search
Prosite Site DB of
Proteins Families
and Domains
ProteinProsite
multiple dbs,dna and
proteomics tools and links
Swiss Inst,Of
Bioinformatics Mol,
Bio,Server
multiple
dbs
ExPaSy
functionally annotated protein
sequences
Protein DataProteinPIR
Protein Data Bank,search by
ID or field match
Protein Data BankProteinPDB
Phylogenetic classification of
proteins from 21 complete
genomes
Clusters of
Orthologous Groups
of Proteins
ProteinCOG
searchable annotated protein
sequence database
Swiss Prot Protein
Database
toolsSwiss-Pro
mirrorslinks to servers of the
Structural Classification of
Proteins DB
Structural
Classification of
Protein (SCOP)
ProteinSCOP
10/27/2005 Chaoqun Wu,Fudan University 160
COGs
Clusters of Orthologous Groups of proteins
蛋白质直系同源簇数据库
(http://www.ncbi.nlm.nih.gov/COG/)
Phylogenetic classification of proteins encoded
in complete genomes,COGs were delineated by
comparing protein sequences encoded in
complete genomes,representing major
phylogenetic lineages,Each COG consists of
individual proteins or groups of paralogs from at
least 3 lineages and thus corresponds to an
ancient conserved domain.
10/27/2005 Chaoqun Wu,Fudan University 161
PIR (Protein Information Resource),
蛋白质序列数据库资源蛋白质序列数据库资源
(http://www.nbrf.georgetow.edu/pir/)
PIR produces the Protein Sequence
Database (PSD) of functionally annotated
protein sequences,which grew out of the
Atlas of Protein Sequence and Structure
(1965-1978) edited by Margaret Dayhoff and
has been incorporated into an integrated
knowledge base system of value-added
databases and analytical tools.
10/27/2005 Chaoqun Wu,Fudan University 162
iProClass
A central point for exploration of protein
information,provides summary descriptions of
protein family,function and structure for PIR-
PSD,Swiss-Prot,and TrEMBL sequences,
with links to over 50 biological databases,
Release 2.15,3-Feb-2003 contains 877,059
entries
10/27/2005 Chaoqun Wu,Fudan University 163
PIR-NREF
A comprehensive database for sequence
searching and protein identification,contains
non-redundant protein sequences from PIR-
PSD,Swiss-Prot,TrEMBL,RefSeq,GenPept,
and PDB,Release 1.15,3-Feb-2003 contains
1,131,046 entries.
10/27/2005 Chaoqun Wu,Fudan University 164
PIR Web Site (http://pir.georgetown.edu)
10/27/2005 Chaoqun Wu,Fudan University 165
PIR Superfamily Concept
– Whole (Full-Length) Proteins
– Homeomorphic (Common Domain
Architecture)
– Monophyletic (Common Evolutionary
Origin)
– Hierarchical structure (Family and
Superfamily)
– Non-Overlapping placement within each
level
10/27/2005 Chaoqun Wu,Fudan University 166
PIR Superfamily vs,Other Concepts
–Evolution,Superfamily hierarchy reflects
orthology and paralogy
–Structure,PIR superfamily generally
corresponds to SCOP family
–Domain,Domains are mapped onto the
Superfamily
–Motif,Motifs (functional/structural sites)
are mapped onto the Superfamily
–Function,a Superfamily may contain
divergent functions
10/27/2005 Chaoqun Wu,Fudan University 167
PIR Superfamilies
Created by automated clustering by % identity
with coverage-by-length requirements,
Creation of new Superfamilies is an ongoing
process.
Automated classification rules are refined by
expert curation:
- Evolution rates are very different in different
“branches” of the protein universe,so need
very different score cutoffs
- Verify/add members
10/27/2005 Chaoqun Wu,Fudan University 168
?Annotation (at level of orthology),Superfamily
Name,Description,Bibliography
?In some cases,more than one orthologous
group will be included into a single Superfamily;
these Superfamilies will often be very large and
diverse
?Depth of hierarchy will be different for single-
domain and multidomain proteins
This is work in progress and will become
available through PIR (iProClass) and InterPro
10/27/2005 Chaoqun Wu,Fudan University 169
CM-Related Superfamilies
Chorismate Mutase (CM),AroQ class
– SF001501 – CM (Prokaryotic type) [PF01817]
– SF001499 – tyrA bifunctional enzyme (Prok) [PF01817-PF02153]
– SF001500 – pheA bifunctional enzyme (Prok) [PF01817-PF00800]
– SF017318 – CM (Eukaryotic type) [Regulatory Dom-PF01817]
Chorismate Mutase,AroH class
– SF005965 – CM [PF01817]
AroH
AroQ Euk
AroQ Prok
10/27/2005 Chaoqun Wu,Fudan University 170
iProClass Superfamily Report (I)
10/27/2005 Chaoqun Wu,Fudan University 171
iProClass Superfamily Report (II)
10/27/2005 Chaoqun Wu,Fudan University 172
InterPro
-InterPro is an integrated resource for protein families,
domains and sites,
- InterPro combines a number of databases that use
different methodologies,By uniting the member
databases,InterPro capitalizes on their individual
strengths,producing a powerful integrated diagnostic tool.
Member databases,PROSITE,PRINTS,Pfam,SMART,
ProDom,and TIGRFAMs
PIR to be added soon
SWISSPROT and TrEMBL matches used as examples
10/27/2005 Chaoqun Wu,Fudan University 173
InterPro Entry Type defines the entry as a Family,
Domain,Repeat,or Post-translational modification site
(other sites to be added,binding site,active site),
Family = protein family,PIR SFs will generally belong to
this type.,Contains” field lists domains within this
protein
“Found in”,for domain entries,lists families which
contain this domain
InterPro Entry
10/27/2005 Chaoqun Wu,Fudan University 174
InterPro Entry Type = Family
SF001500
Bifunctional chorismate mutase / prephenate dehydratase
(P-protein)
PIR Superfamilies are being
integrated into InterPro
CM PDT
ACT
10/27/2005 Chaoqun Wu,Fudan University 175
COGs
Clusters of Orthologous Groups
-complete genomes
- reciprocal best hits
- no score cutoffs
Comparative genomics
- a branch of computational biology
that uses complete genome sequences
10/27/2005 Chaoqun Wu,Fudan University 176
Construction of COGs:
Genome 2
Genome 1
10/27/2005 Chaoqun Wu,Fudan University 177
Construction of COGs:
Yeast
YLR377c
Synechocystis
slr0952
Bidirectional
best hit
Triangle -
the simplest
COG
Bidirectional
best hit
E,coli
fbp
Bidirectional
best hit
10/27/2005 Chaoqun Wu,Fudan University 178
Construction of COGs:
Merge triangles
10/27/2005 Chaoqun Wu,Fudan University 179
Construction of COGs:
Add all homologs
Synechocystis
slr0952
Yeast
YLR377c
E,coli
fbp
New
protein
10/27/2005 Chaoqun Wu,Fudan University 180
10/27/2005 Chaoqun Wu,Fudan University 181
10/27/2005 Chaoqun Wu,Fudan University 182
10/27/2005 Chaoqun Wu,Fudan University 183
In COGs,the dilemma between the depth of analysis and
protein integrity is approached by keeping proteins intact
whenever possible,and dividing into modules (single- or
multidomain) when necessary
10/27/2005 Chaoqun Wu,Fudan University 184
Case Study 1:
Prediction verified,GGDEF domain
Proteins containing this domain,Caulobacter crescentus PleD
controls swarmer cell - stalk cell transition (Hecht and Newton,
1995),In Rhizobium leguminosarum,Acetobacter xylinum,
required for cellulose biosynthesis (regulation)
Predicted to be involved in signal transduction because it is found
in fusions with other signaling domains (receiver,etc)
10/27/2005 Chaoqun Wu,Fudan University 185
?In Acetobacter xylinum,cyclic di-GMP is a
specific nucleotide regulator of cellulose
synthase (signalling molecule),Multidomain
protein with GGDEF domain was shown to
have diguanylate cyclase activity (Tal et al.,
1998)
?Detailed sequence analysis tentatively
predicts GGDEF to be a diguanylate cyclase
domain (Pei and Grishin,2001)
?Complementation experiments prove
diguanylate cyclase activity of GGDEF
(Ausmees et al.,2001)
10/27/2005 Chaoqun Wu,Fudan University 186
Case study 2:
Defining a novel domain family
Prokaryotic Response Regulatiors (RRs)
Variable
- DNA-binding
- Enzymatic
CheY-like
receiver
Output
What if domain is not described yet?
CheY
receiver
PSY-BLAST with C-terminal portion alone
10/27/2005 Chaoqun Wu,Fudan University 187
Two Groups of Unusual RRs
[Receiver-X] SF006198,COG3279
1,AlgR-related
? Pseudomonas aeruginosa (AlgR),alginate
biosynthesis
? Klebsiella pneumoniae (MrkE),formation of
adhesive fimbriae
? Clostridium perfringens (VirR),virulence factors
2,Regulators of autoinduced peptide-controlled
regulons
? Staphylococcus aureus (AgrA),virulence factors
? Lactobacillus plantarum (PlnC,PlnD):
bacteriocin production
? Streptococcus pneumoniae (ComE),competence
10/27/2005 Chaoqun Wu,Fudan University 188
Properties of the CheY- LytTR transcriptional
regulators
?Regulate secreted and extracellular factors
?Often regulate their own expression
?Bind to imperfect direct repeat sites in -80 to
- 40 area (or in UAS)
?Can be phosphorylated by His kinases,but
form operons with HisK-type sensor ATPases
?Contain a conserved LytTR-type DNA-
binding domain
10/27/2005 Chaoqun Wu,Fudan University 189
LytTR - a new DNA-binding domain
not similar to HTH,winged helix,
or ribbon-helix-helix DNA-binding domains
Gram-negative response regulators
AlgR_Psaer 142 ISARTRKGIELIPLEEVIFFIAD-HKYVTLRHAQGEVLLDEPLKALEDEFG--ERFVRIHRNALVARERIERLQRTPLGHFQLYLKGLDGDALTVSRRHVAGVRRLMHQ 247
AlgR_Pssyr 142 ISARTRKGIELIPLDQVIYFIAD-HKYVTLRHEGGEVLLDEPLKALEDEFG--DRFVRIHRNALVARERIERLQRTPLGHFQLFLRGLNGDALIVSRRHVAGVRKMMQQ 247
AlgR_Azoto 142 ISARTRRGIELVPVDKAIFFIAD-HKYVTLRHESGEVLLDEPLKALEDEFG--DRFVRIHRNALVARDRIERLQRTPLGHFQLYLRGLGDAALTVSRRHVAGVRKLMHN 247
PprA_Psput 141 ISARTRKGIELIPLPQVIYFIAD-HKYVTLRHETGEVLLDEPLKALEDEFG--DQFVRIHRNALVARERIERLQRTPLGHFQLFLKGLDGDALTVSRRHVAGVRKMMQT 246
MrkE_Ecoli 139 INLVKDERIIVTPINDIYYAEAH-EKMTFVYTRRESYVMPMNITEFCSKLPP-SHFFRCHRSFCVNLNKIREIEPWFNNTYILRLKDLD-FEVPVSRSKVKEFRQLMHL 244
MrkE_Klepn 145 INLIKDERIIVTSIHDIYYAEAH-EKMTFVYTRRESFVMPMNITEFCSKLPT-AHFFRCHRSYCVNLSKIREIEPWFNNTYVLRLRDLE-FQVPVSRSKVKEFRQLMNL 250
CC3036_Ccr 159 LASTPSEAPPSADTSNVLYLRME-DHYVRIRTEHGSRLEMGPLARVTAMLTG-IEGLQTHRSWWVARRAIAGVVRDGR-NLRLRLV--DGETAPVSRASVAKLRAAGWL 262
VC0693_Vch 135 IPCTGLNRIVLLPINEVEFAYSD-ISGVNVQTAQQKATSQLTLKVLEEK----TALVRCHRQYLVNLKAIREIKLLENGLAEMITH--AGHKVPVSRRYLKELKEMLGF 236
VCA0850_Vc 157 LKASKGEEIHLIAVNELLYVKAE-DKYLSLYKV5HEYLLRSSLKELLAQLDP-NQFWQIHRSIVVNVGKIDKVTRDFGGKMWVHID---RLQLPVSRALQHLFKVS--- 261
YehT_Ecoli 137 IPCTGHSRIYLLQMKDVAFVSSR-MSGVYVTHE2KEGFTELTLRTLESR----TPLLRCHRQYLVNLAHLQEIRLEDNGQAELILR--NGLTVPVSRRYLKSLKEAIGL 239
YPO3287_Yp 131 IPCSGHNRIFLLKIEEVEYLSSE-LSGVHVVGTVQSGYTQLSLKTLEEK----TPFIRCHRQYMVNTEQLGEIQLMDNGAAEVITR--SGKHIPVSRRYLKSLKEKLGI 237
Gram-positive response regulators
LytR_Staph 141 LPVEIDDKIHMLKQQNIIGIGTH-NGITTIHTTNHKYETTEPLNRYEKRLNP-TYFIRIHRSYIINTKHIKEVQQWFNYTYMVILT--NGVKMQVGRSFMKDFKASIGL 245
LytT_Bacsu 137 LALSVGESIVIVDTKDIIYAGTE-DGHVNVKTFDHSYTVSDTLVVIEKKLPD-SDFIRVHRSFVVNTEYIKEIQPWFNSTYNLIMK--DGSKIPVSRTYAKELKKLLHI 241
AgrA_Staph 143 IELKRGSNSVYVQYDEVMFFESS2SHRLIAHLDNRQIEFYGNLKQISQID---DRFFRCHNSFVVNRHNISSINSKE--RVVYFK---NGEHCYASVRNVKKV----- 238
AgrA_Clobu 144 FTIKADDRIINIEFQKILFFETS2IHKVVLHSVNRQIEFYAKMKDIEGELD--DSFYRCHKSYIVNKKNIKEININK--RRIYMI---NGEECLISTRMLKGLIK---- 241
VirR_Clost 147 ITIKDKNNVLKIRTEDILFLETF-ERKVIIHTNSQDYIVKMSMNKLEKELNN-KGFFRCHTSYIVNLIKIEEIKKD----YLLI----NKFTLPVSKHRMKNLKLRLTS 245
EntR_Enter 146 FVFSIGSQTFTFDINDIYFVEAS2PHRLSLCTKDGQYEFYGRISELEKKY---PMLTRISRACLVNIFNVKEIDFKK--RSLYFD-------SELARNFTLGKAQKIKE 243
PlnC_Lacto 144 LGYKIGTRFFSVPINDVIMLSTN3PGSIRLTAKNKVADFPGNLNSFENKY---SQFFRCDKSSLVNIDYVDSYDYQKKKELTMIDN-ICSVSYRKSRELNKILKKK--- 247
PlnD_Lacto 144 FNYKLGTRYFSLALDDVILLSTS3PGSVQLHAINKVAEFPGNLNALEEKY---PQFFRCDKSL-VNLNHLRSFDYKEK-ELLLDGEIRCKASFRKSRELNKLLRDN--- 247
SapR_Lacto 146 FTFSIGSQVFSIDKREILFIESS2PHRVILNTKNGHYEFYGNLNDLSEKY---PYLFRINRNCLVNLKNIIEIDFKS--REILFG---SDLSRKFARGKSNQLKRAFSQ 247
BlpR_Spneu 142 FYFKSKFAQFQYPFKEVYYLETS2AHRVILYTKTDRLEFTASLEEVFKQE---PRLLQCHRSFLINPANVVHLDKKE--KLLFFP---NGGSCLIARYKVREVSEAINK 243
ComE_Spneu 142 FDYNYKGNDLKIPYHDILYIETT2SHKLRIIGKNFAKEFYGTMTDIQEKDKHTQRFYSPHKSFLVNIGNIREIDRK---NLEIVFY--EDHRCPISRLKIRKLKDILEK 246
ORF1_Bfibr 130 LKIRLDGENHFFKESTISYVEGM-GKNCILHFC3EKMECRETLGAIEARLSS-KKFYRCYKSYLINLAQVDSY-----NHEEVTMS--TGEKLLISRLKYKEFNNIYAD 231
Other
BlpS_Spneu 4 MIIQTQKTVYKVNIDDIYYIQTH3AHTVQIVTEEASFNMLQNLSNLENQCG--ETLMRCHRNCLVNLDKLKSIDFQE--RILFLGEE-GQYAVKYARRRYREIRQKWLK 109
Lmo0984_Lm 42 ISVKKDGATYLLEPKAILYFEAV-ESKIFVYTEKEIYEIHWKLYELEEKFKE-SSFFRCSKSMILNIEWIEKIAPGF-NGKFEAKLL-NREKVIISRQYAKVLKQKLNM 146
SP0161_Spn 46 IKGKIDDQVYLVEIGKIQRFYIE-NRKVLAETASQTYSIDLRLYQVLKLLP--TNFIQISQSEIINIDSISHLKLTP-NGLVEIFLK-NESFTYSSRRYLKTIKEKLEL 146
SA2153_Sau 44 LVGYIDKEIHIINVSDVITFQVI-NKNVTAITSNQKFKLKLRLYELEKQLP--QHFIRISKSEIVNKYYIEKLLLEP-NGLIRMYLK-DAHYTYSSRRYLKSIKERLSI 147
BH3894_Bha 238 IPAKVNEKMILFDPTEIDYIESH-EGFANLHVKGQVFLCTMTLNELDGRLKP-FGFFRCHRSYLVNLQKVREVMTWSRNSYSLILDTAQHDSIPLSKGRLAELKQLIGI 344
CoxC_Oligo 290 LPVEKYHGSKVINTADIFSVHAN-AHYTYVFNGREDIFCPLSIGEIIERLPP-DTFFRVHRSYIINIHCVARLKRAGDNGIAELDSP-VRRSVPVSRSRLPQLREQLRE 393
CoxH_Oligo 280 LPVEKNGQRCTMLISRLFAVQPQ-AHYTLLFDGEATWFCPLSLSQVAKVLDS-ANFAQVHRSHIIDLDRFRLVRGTGNGGMFEAISK-TPYKVPLSRARRAWIKQQLQV 385
RpfD_Xanth 190 FLVRKLGRDFLVAAADIEWLQAS-GNYVNLHVRGHDYPLRSTMTAIEAKLDP-AVFVRIHRSYIVNLGCVVSIEPLDAGEARVHLS--GNGPLPCSRRYLASLRRAAGH 294
CC0551_Ccr 407 LTVASARGVELVPLAEILSVVGA-DDYVELRLV2RSLLHAARLDALTTQLP--VSFLRVHRSVIANLTHAQRLERDGD-RWRLHLNE-GP-PLPVSRSRQPAVREALAD 510
H.pseudoflava 1 -----------VDTAEVLSLESQ-AHYTRVLTREGFHFCNLSIGDLQARLDP-AQFMRVHRCFIVNLQAVAELGREGS-KTQIVLKGKHREPIPVSRSSGAAAEALGCW 91
Unfinished genome sequences
B.cepacia 161 IPVYRKNRVILLDLKDIVRFQGD-GHYTTIVTRDDRYLSNLSLADLELRLDS-SIYLRVHRSHIVSLQYAVELVKLD-ESVNLVMDDAEQTQVPVSRSRTAQLKELLGV 266
Cl.difficile ISLWKGDKLVVIDIDDIYYCEAN-ERQTFIYTEKEKFILKEGISEVENLIND-KTFFRTHRSYIVNLTKVKEIIPWFNNTYILKLKN-SDYEVTVSRSKVKEFRLLMHI
C.hydrogenof,KGGKTVLVGEQEIIYAYTE-QDYVYLKTYQDKYFTRFTLKELEGRLNP-TVFFRCHRCFIVNLQKVKEIIPFFNGTYTLVVDDKEKSEVPVSRAQAKKLRKILGL
Desulfovibrio DGKTLLIPYGEIAFVEAY-EDYSYVHTADDKYLTSYRLKNLEGRLRP-HRFFRVHRKYLVNLDMVTELAAVSGGNCLLRTAGRTRIELPISRRRLGELKQILGL 309
STRUCTURE PhD LL.L.LL.EEE.LLL..EE....-.LL.EEE.....EEEL..HHHHH..LLL-LL.EEEEEEEE.L.HHHHHH..L.LL..EEEE.LL-LLL...LL.HHHHHHHH.LL
Rel_sec 99454781786167632751322-468167704423555532355531499999-256858876174446785215465327866279-97732366469999998179
STRUCTURE PSI,CCCCCCC-EEEEEHHHEEEEEEE-CCEEEEEECCCEEECCCHHHHHHHHCCHH--HHHHHHHHHHHHHHHHEEECCCCCEEEEEEECCCCCEECCCCHHHHHHHHHHCC
Conf,84322770498535444556756-77699998086486043178889761610--688899999988640034118898689999889988402131457899987519
STRUCTURE T99 EEEEELLLEEEEEHHHEEEEEEL-EEEEELLLLLLLEEELHHHHHHHHHHLL--LEEEEEHHHHHHHHHHHHHLLLLLLLEEEEELLLLLLLLLLLHHHHHHHHHHHHL
jhmm,--------EEEEE---EEEEE-- --EEEEEE---EEE----HHHHHHH-- --EEE--HHHHHHHHHHHHEE--- --EEEEE-------EEHHHHHHHHHHHHH--
nnssp,EEEE----EEEE----EEEEE-- --EEEEEE----EEE---HHHHHHHHH HHHHHH----EE-HHHHHHHHH-- ---HHHHH----------HHHHHHHHHHHHH
phd,--EEE---EEEEE---EEEEEE- --EEEEEE----EEE---HHHHHHHH- HHHHHHHHHHHEEHHHHHHHHH-- ---EEEEE----------HHHHHHHHHHH--
pred,--------EEEE----EEEEE-- --EEEEEE---HHHHHHHHHHHHHH-- ---EEEHHHHHHHHHHHHHHHH-- ----EEEE---------HHHHHHHHHHH---
10/27/2005 Chaoqun Wu,Fudan University 190
Domain organization of LytTR proteins
other than CheY-LytTR
Stand-alone LytTR Streptococcus pneumoniae BlpS
Pseudomonas phage D3 Orf50
40aa - LytTR Lactococcus lactis L121252
Listeria monocytogenes Lmo0984
Staphylococcus aureus SA2153
Streptococcus pneumoniae SP0161
ABC - LytTR Bacillus halodurans BH3894
MHYT - LytTR Oligotropha carboxydovora CoxC,CoxH
3TM - LytTR Xanthomonas campestris RpfD
Caulobacter crescentus CC1610
Mesorhizobium loti mll0891
3TM - LytTR Caulobacter crescentus CC0295
4TM - LytTR Caulobacter crescentus CC0330,CC3036
8TM - LytTR Caulobacter crescentus CC0551
PAS - LytTR Burkholderia cepacia
Geobacter sulfurreducens
10/27/2005 Chaoqun Wu,Fudan University 191
Consensus binding site for the
LytTR domains
10/27/2005 Chaoqun Wu,Fudan University 192
Predicted LytTR-regulated genes
?Expected
?Bacillus subtilis natAB (Na
+
-ATPase)
?Oligotropha carboxidovorans comC,comH (CO growth)
?Staphylococcus aureus lrgAB (autolysis)
?Streptococcus pneumoniae hld (hemolysin delta)
?Unexpected
?Bacillus subtilis alr,dinB,rapI,veg,
? ybaJ,ybbI,yceA,ydbS,
ydjL,yebB,yfiV,ykuA
? Staphylococcus aureus capO,coa,hsdR,
SA0096,SA0257,SA0285,
SA0302,SA0357,SA0358,
SA0513
10/27/2005 Chaoqun Wu,Fudan University 193
Impact of genomics
Single protein level
Discovery of new enzymes and superfamilies
Prediction of active sites and 3D structures
Pathway level
Identification of,missing” enzymes
Prediction of alternative enzyme forms
Identification of potential drug targets
Cellular metabolism level
Multisubunit protein systems
Membrane energy transducers
Cellular signaling systems
10/27/2005 Chaoqun Wu,Fudan University 194
Examples for analysis:
1,Retrieve one of the following protein sequences,
PIR,C69086 D64376 GenBank GI:15679635,Using
analysis tools available on the web,check if the functional
annotation is correct,and provide correct annotation
without looking at internal PIR or COG annotations
(Run BLAST with CDsearch and SMART to start with),
When you are done,look at the PIR curated SF annotation
(still at internal site only):
http://pir.georgetown.edu/test-cgi/sf/pirclassif.pl?id=SF006549
http://pir.georgetown.edu/cgi-bin/ipcSF1?id=SF006549
(compare with original automatic SF annotation at the
public site),and at COG annotations,What caused the
wrong annotations? In BLAST outputs for these
sequences,do you see other wrongly annotated proteins?
Next,analyze the C-terminal domain of these proteins by
PSI-BLAST (and alignment analysis) and suggest any
speculations as to its function (homework).
10/27/2005 Chaoqun Wu,Fudan University 195
Examples for analysis:
2.
Retrieve the following sequence,GI:7019521
Take a look at the associated publication (reference).
Analyze the sequence to see if any additional
information can be obtained (run PSI-BLAST,and (as
a homework) construct multiple alignment),
Take a look at taxonomy report,what does it tell you?
Find experimental paper associated with one of the
sequences found by PSI-BLAST,What annotation is
appropriate for this sequence and for the entire family?
10/27/2005 Chaoqun Wu,Fudan University 196
Examples for analysis:
3.
Predict the function of the following proteins:
GenBank,GI,27716853
E,coli YjeE protein
Verify and/or correct the following functional annotations,
Can you explain why the erroneous annotations were
made?
PIR,H87387
GenBank,GI:15606003 GI:15807219
PIR,F70338
10/27/2005 Chaoqun Wu,Fudan University 197
Examples for analysis:
4,Homework,an exercise in transitive relationships:
Start with
>gi|20093648|ref|NP_613495.1| Uncharacterized membrane protein,
conserved in Archaea [Methanopyrus kandleri AV19]
(this is a short membrane protein); run PSI-BLAST,make sure you have
filtering,complexity and CD-search off,There are no good hits but a bunch
of sub-threshold ones,Collect "suspect" relations,use them as queries
and expand the net,You will be able to come up with two proteins:
>gi|21227474|ref|NP_633396.1| hypothetical protein [Methanosarcina
mazei Goe1] and
>gi|14324537|dbj|BAB59464.1| hypothetical protein [Thermoplasma
volcanium]
When used as a PSI-BLAST query,the first will tie the Methanopyrus
protein into a group,while the second will tie this group to the Sec61
subunit of preprotein translocase.
Then,of course,you can obtain the same result with CD-search in a single
step,
10/27/2005 Chaoqun Wu,Fudan University 198
References
? Nature Insight,Proteomics,Nature 422,
191-237,2003
? Zhu,H,et al,Proteomics,Annual Review
of Biochemistry 72,783-812,2002
? Griffiths et al,Modern Genetic Analysis,
Online,http://ncbi.nih.gov
10/27/2005 Chaoqun Wu,Fudan University 199
Internet sites
www.astbury.leeds.ac.uk/Facil/MStut/mstutorial.ht
m
(Dr Alison E,Ashcroft at Leeds)
www.asms.org (The American Society for Mass
Spectroscopy)
www.spectroscopynow.com(Base Peak)
Mass Spec tools
www.expasy.ch/tools/#proteome
http://prowl.rockefeller.edu
www.mann.embl-heidelberg.de
10/27/2005 Chaoqun Wu,Fudan University 200
蛋白质序列库:
SWISS-PROT http://www.expasy.ch/sport/
PIR http://www,nbrf.georgetown.
edu./pir/pir.psihtml
二维电泳(2D PAGE)
WORLD-2DPAGE—国际2D PAGE库的完整索引
http://www.expasy.ch/ch2d/2d-index.html
2DWG—二维电泳图谱元库
http://www.lecb.ncifcrf.gov/2dwgDB
翻译后修饰
deltamass—翻译后修饰汇编
http://www.medstv.unimelb.edu.au/WWWDOCS/
SVIMRDOCS/MassSpec/deltam ass V2.html