7.91 – Lecture #1
Introduction to Bioinformatics
Focus on Kinases
& Pairwise Sequence Comparisons
ARDFSHGLLENKLLGCDSMRWE
.::. .:::. .:::: :::.
GRDYKMALLEQWILGCD-MRWD
Reading:
This lecture: Mount pp.1-7, 29-35, 45-48, 51-64
Next lecture: Mount pp. 8-9, 65-89, 96-115, +more
The Protein Data Bank (PDB - http://www.pdb.org/) is the single
biological macromolecular structure data.
Research 28 (2000): 235-242.
(PDB Advisory Notice on using materials available in the archive:
worldwide repository for the processing and distribution of 3-D
Research 28 (2000): 235-242.
(PDB Advisory Notice on using materials available in the archive: http://www.rcsb.org/pdb/advisory.html)
Michael Yaffe
Berman, H. M., J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat,
H. Weissig, I. N. Shindyalov, and P. E. Bourne. The Protein Data
Bank. Nucleic Acids Research 28 (2000): 235-242.
http://www.rcsb.org/pdb/advisory.html)
Outline
? Review of biological fundamentals
? Genetics example
? Biological databases/NCBI resources
? Simplified sequence analysis – where to start…
? Simple sequence comparisons
? Definitions of related sequences
? Concepts and types of alignments – the good,
the bad, and the ugly
? Dot matrix alignments
? Computational efficiency
? Recursion and dynamic programming
? Substitution matrices: PAM, BLOSUM, Gonnet
Outline (cont)
?Gaps
? Applied dynamic programming: global
alignments: Needleman-Wunsch
? Applied dynamic programming: local
alignments – Smith-Waterman
? Basic statistics of sequence alignments
Review of biological fundamentals
Genetic material
? gene as a concept DNA
? DNA as hereditary material
RNA
Protein
Critical
cell functions
DNA structure
A
G
C
T
U
Bases
DNA structure
A
G
C
T
Base pairing
DNA structure
1
5
4
3 2
1
5
4
3 2
Nucleoside
Nucleotide
DNA structure
5’
3’
DNA structure
5’
3’
DNA structure
5’
3’
3’
5’
RNA structure
1
5
4
3 2
1
5
4
3 2
Transcription
5’
3’
3’
5’
OH
OH
OH
DNA
RNA
Protein structure
Aspartic acid (Asp, D)
Arginine (Arg, R)
C C
O
-
O
H
Glycine (Gly, G) Alanine (Ala, A) Valine (Val, V)
H
H
3
N
+
C C
O
-
O
H
CH
3
CH
3
CH
3
H
3
N
+
C C
O
-
O
H
CH
H
3
N
+
C C
O
-
O
H
CH
H
3
N
+
H
3
C
OH CH
3
CH
3
CH
3
C C
O
-
O
H
CH
2
CH
2
CH
3
C C
O
-
O
H
H
3
N
+
CH
2
CH
2
C C
O
-
O
H
H
3
N
+
CH
2
S
CH
3
H
3
N
+
C C
O
-
O
H
H
3
N
+
CH
2
C C
O
-
O
O
H
H
3
N
+
CH
2
C C
O
-
O
H
H
3
N
+
CH
2
CH
2
C
NH
2
C
C C
O
-
O
H
H
3
N
+
CH
2
C C
O
-
O
H
H
2
N
+
H
2
C
C C
O
-
O
O
H
H
3
N
+
CH
2
C C
O
-
O
H
H
3
N
+
CH
2
CH
2
-
O
C
O
C C
O
-
O
H
H
3
N
+
CH
2
CH
2
CH
2
CH
2
NH
3
+
NH
2
+
NH
+
NH
C C
O
-
O
H
H
3
N
+
CH
2
C
C
O
-
O
H
H
3
N
+
CH
2
CH
2
CH
2
C
NH
NH
2
-
O
C
O
NH
2
CH
2
CH
2NH
C C
O
-
O
H
H
3
N
+
CH
2
C C
O
-
O
H
H
3
N
+
CH
2
SH
OH
C C
O
-
O
H
H
3
N
+
CH
OH
Leucine (Leu, L) Isoleucine (IIe, I)
Methionine (Met, M)
Serine (Ser, S)
Aspartic acid (Asp, D)
Glutamic acid (Glu, E)
Lysine (Lys, K)
Arginine (Arg, R)
Histidine (His, H)
Threonine (Thr, T) Cysteine (Cys, C) Tyrosine (Tyr, Y) Asparagine (Asn, N) Glutamine (Gln, Q)
Acidic Basic
Phenylalanine (Phe, F)
Tryptophan (Trp, W)
Proline (Pro, P)
Electrically charged
Polar, Hydrophillic R-groups
Nonpolar, Hydrophobic R-groups
Codon Table
phe (UUU)
phe
leu
leu
leu
leu
leu
leu
ile
ile
ile
met
(and initiation)
val
val
val
val
ser
ser
ser
ser
tyr
tyr
termination
termination
cys
U C
(5')....pNpNpN....(3') in mRNA
Middle Base of Codon
Base at 5' End
of Codon
Base at 3' End
of Codon
A G
cys
termination
trp
pro
pro
pro
pro
thr
thr
thr
thr
his
his
gln
gln
asn
asn
lys
lys
arg
arg
arg
arg
ser
ser
arg
arg
ala
ala
ala
ala
asp
asp
glu
glu
gly
gly
gly
gly
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
DNA structure
5’
3’
3’
5’
“Sense”
strand
“Anti-Sense”
strand
Convention:
Sense strand
Read 5’? 3’
Reverse Complement
Genetics experiment:
phenotype
Isolate a yeat mutant that has increased chromosome number
Can rescue the phenotype with a piece of DNA
…corresponds to a site of mutation in the yeasts DNA
genotype
CGTTTTTCCTGTAAAGCGCTTAATTTGTTTACCATTCTATAAAAACCTTGAGCTAAGGCCAACTGATGCA
ATTGCTCAAGTGAATGCATAAACAAAGCAAGATCATTCTTAGCGCAAAAAAAACTGGGATTTTGAAATAC
AACAAAAGAAAGAAGTAAAAAGGGAATGCAACGCAATAGTTTAGTAAATATCAAACTAAACGCTAATTCG
CCATCGAAAAAGACCACAACAAGACCAAATACGTCCAGGATCAATAAACCATGGAGAATATCCCATTCGC
CGCAGCAAAGAAACCCGAATTCAAAAATACCTTCACCTGTAAGAGAAAAATTGAACAGATTACCTGTAAA
CAATAAGAAGTTTTTGGATATGGAAAGCTCCAAAATTCCATCACCTATAAGGAAAGCGACTTCTTCCAAA
ATGATACACGAAAATAAGAAGCTACCTAAATTTAAATCCCTATCACTCGATGACTTTGAACTGGGGAAGA
AATTAGGAAAGGGTAAATTCGGTAAAGTTTATTGCGTTCGGCACAGGAGTACAGGATATATTTGCGCACT
GAAAGTAATGGAGAAGGAAGAAATAATAAAGTATAATTTACAGAAACAATTCAGAAGGGAGGTAGAAATA
CAAACATCGCTAAATCATCCGAATCTAACTAAATCATACGGCTATTTTCATGATGAAAAAAGAGTGTACC
TGCTAATGGAATACTTAGTCAATGGGGAAATGTATAAACTATTGAGGTTACACGGACCCTTCAACGATAT
TTTAGCATCAGATTATATTTATCAAATTGCCAATGCCCTAGATTATATGCATAAAAAGAATATTATTCAT
AGAGATATTAAACCTGAAAATATACTAATAGGGTTCAATAATGTCATTAAGTTAACGGACTTCGGATGGA
GTATAATAAATCCGCCAGAAAATAGAAGGAAAACTGTCTGTGGGACAATTGACTACCTTTCTCCAGAAAT
GGTGGAGTCAAGGGAATATGATCACACTATAGATGCATGGGCTCTTGGCGTCCTGGCGTTTGAACTACTG
ACCGGTGCCCCTCCGTTCGAAGAAGAAATGAAAGATACTACATATAAAAGGATAGCAGCACTGGATATCA
AAATGCCCAGTAACATTTCTCAGGATGCGCAAGATTTAATACTTAAACTACTAAAATACGACCCCAAAGA
TAGAATGCGCCTTGGAGACGTAAAAATGCATCCTTGGATACTAAGAAACAAGCCCTTTTGGGAAAATAAG
CGGTTATAGAATTAAAGTATGACAGAATCGTTTGAAGGGCACTATTAATCACTCCCGCACATATCACATA
ATACTAAGTATCCATTTCTAATATTTCATTACTCTTTTCGGCATCGTATATTGCGATATTTGATTAAATT
TTCTTGTTCATTTTTTCCTCTTTTCTTTCGCCTTTGTCGAAAGAAAAGAGGAAAACAAGCTGAAAATTGC
TATGCATTAAAGTAGCAGATTTACTTTGTTGAGTTGGTTCTGATCAATAATAAGAGTAATGAAAGAAAGC
AAAAAAATGGCTAAAGATAATTTAACTAATTTGCTCTCTCAATTGAACATTCAATTGTCTCAA
The National Center for Biotechnology
Information
? Created as a part of NLM in 1988
– Establish public databases
– Perform research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Molecular Databases
?
–
–
information
?
?
–
?
?
–
?
–
?
Primary Databases
Original submissions by experimentalists
Database staff organize but don’t add additional
Example: GenBank
Derivative Databases
Human curated
compilation and correction of data
Example: SWISS-PROT, NCBI RefSeq mRNA
Computationally Derived
Example: UniGene
Combinations
Example: NCBI Genome Assembly
What is GenBank?
? Nucleotide only sequence database
? Archival in nature
?
– Direct submissions individual records (BankIt,
Sequin)
– Batch submissions via email (EST, GSS, STS)
– ftp accounts sequencing centers
? Data shared three collaborating databases
– GenBank
– DNA Database of Japan (DDBJ).
–
(EMBL) at EBI.
NCBI’s Primary Sequence Database
GenBank Data
European Molecular Biology Laboratory Database
Relationships of chromosomes to genome sequencing markers.
163 Mb
The X chromosome is about 163 Mb in length. In this diagram, there are 16 overlapping
BAC clones that span the entire length. In reality, 1,408 BACs were needed to span the X
chromosome. Arrows (top) mark STSs scattered throughout the chromosome and on
overlapping BACs.
BACs
STSs
X Chromosome
GenBank: NCBI’s Primary
Sequence Database
? full release every two months
? incremental and cumulative updates daily
ftp://ftp.ncbi.nih.gov/genbank/
Release 133 December 2003
19,808,101 Records
28,507,990,166 Nucleotides
110,000 +
? available only through internet
Species
107.07 Gigabytes of data
GenBank Divisions
Bulk Sequence Divisions
PAT Patent
EST Expressed Sequence Tags
STS Sequence Tagged Sites
GSS Genome Survey Sequences
HTG High Throughput Genome
HTC High Throughput cDNA
CON Contig
Traditional Divisions
BCT INV MAM PHG PLN PRI
ROD SYN UNA VRL VRT
The International Sequence
Database Collaboration
EBI
GenBank
DDBJ
EMBL
EMBL
Entrez
SRS
getentry
NIG
CIB
NCBI
NIH
?Submissions
?Updates
?Submissions
?Updates
?Submissions
?Updates
Web
Access
Text
BLAST
VAST
Entrez
Sequence
Structure
http://www.ncbi.nlm.nih.gov
Genetics experiment:
phenotype
Isolate a yeat mutant that has increased chromosome number
Can rescue the phenotype with a piece of DNA
…corresponds to a site of mutation in the yeasts DNA
genotype
CGTTTTTCCTGTAAAGCGCTTAATTTGTTTACCATTCTATAAAAACCTTGAGCTAAGGCCAACTGATGCA
ATTGCTCAAGTGAATGCATAAACAAAGCAAGATCATTCTTAGCGCAAAAAAAACTGGGATTTTGAAATAC
AACAAAAGAAAGAAGTAAAAAGGGAATGCAACGCAATAGTTTAGTAAATATCAAACTAAACGCTAATTCG
CCATCGAAAAAGACCACAACAAGACCAAATACGTCCAGGATCAATAAACCATGGAGAATATCCCATTCGC
CGCAGCAAAGAAACCCGAATTCAAAAATACCTTCACCTGTAAGAGAAAAATTGAACAGATTACCTGTAAA
CAATAAGAAGTTTTTGGATATGGAAAGCTCCAAAATTCCATCACCTATAAGGAAAGCGACTTCTTCCAAA
ATGATACACGAAAATAAGAAGCTACCTAAATTTAAATCCCTATCACTCGATGACTTTGAACTGGGGAAGA
AATTAGGAAAGGGTAAATTCGGTAAAGTTTATTGCGTTCGGCACAGGAGTACAGGATATATTTGCGCACT
GAAAGTAATGGAGAAGGAAGAAATAATAAAGTATAATTTACAGAAACAATTCAGAAGGGAGGTAGAAATA
CAAACATCGCTAAATCATCCGAATCTAACTAAATCATACGGCTATTTTCATGATGAAAAAAGAGTGTACC
TGCTAATGGAATACTTAGTCAATGGGGAAATGTATAAACTATTGAGGTTACACGGACCCTTCAACGATAT
TTTAGCATCAGATTATATTTATCAAATTGCCAATGCCCTAGATTATATGCATAAAAAGAATATTATTCAT
AGAGATATTAAACCTGAAAATATACTAATAGGGTTCAATAATGTCATTAAGTTAACGGACTTCGGATGGA
GTATAATAAATCCGCCAGAAAATAGAAGGAAAACTGTCTGTGGGACAATTGACTACCTTTCTCCAGAAAT
GGTGGAGTCAAGGGAATATGATCACACTATAGATGCATGGGCTCTTGGCGTCCTGGCGTTTGAACTACTG
ACCGGTGCCCCTCCGTTCGAAGAAGAAATGAAAGATACTACATATAAAAGGATAGCAGCACTGGATATCA
AAATGCCCAGTAACATTTCTCAGGATGCGCAAGATTTAATACTTAAACTACTAAAATACGACCCCAAAGA
TAGAATGCGCCTTGGAGACGTAAAAATGCATCCTTGGATACTAAGAAACAAGCCCTTTTGGGAAAATAAG
CGGTTATAGAATTAAAGTATGACAGAATCGTTTGAAGGGCACTATTAATCACTCCCGCACATATCACATA
ATACTAAGTATCCATTTCTAATATTTCATTACTCTTTTCGGCATCGTATATTGCGATATTTGATTAAATT
TTCTTGTTCATTTTTTCCTCTTTTCTTTCGCCTTTGTCGAAAGAAAAGAGGAAAACAAGCTGAAAATTGC
TATGCATTAAAGTAGCAGATTTACTTTGTTGAGTTGGTTCTGATCAATAATAAGAGTAATGAAAGAAAGC
AAAAAAATGGCTAAAGATAATTTAACTAATTTGCTCTCTCAATTGAACATTCAATTGTCTCAA
Using
Entrez
An integrated database
search and retrieval system
3-D
Structur
e
Entrez: Database
Integration
VAST
Phylogeny
Genomes
Taxonomy
PubMed
abstracts
Nucleotide
sequences
Protein
sequences
3 -D
Structur
e
Word weight
BLASTBLAST
The (ever) Expanding Entrez
Entrez
Journals
UniGene
PubMed Nucleotide
Protein
SNP
Genome
Books
OMIM
Taxonomy
3D Domains
UniSTS
Structure
System
ProbeSet
CDD
PopSet
Entrez Databases
PubMed Biomedical literature
Books Online textbooks
Nucleotide GenBank, EMBL, DDBJ, RefSeq, PDB
Protein [GenBank, EMBL, DDBJ], RefSeq,
SWISS-PROT, PIR, PRF, PDB
Genome Complete genomes
Taxonomy Organisms in NCBI sequence databases
Structure MMDB: experimental 3D structures
Domains CDD: conserved protein domains
3D Domains Compact 3D protein domains in MMDB
OMIM Online Mendelian Inheritance in Man
SNP Single nucleotide polymorphisms
UniSTS Sequence Tagged Site markers
ProbeSet Gene expression and microarray datasets
PopSet Population study datasets
UniGene Gene-based expressed sequence clusters
A Traditional GenBank Record
Definition =Title
ACCESSION U07163
VERSION U07163.1 GI:460243
Accession Number
Version Number
GI Number
NCBI’s Taxonomy
FASTA Format
>
FASTA Definition Line
>gi|460243|gb|U07163.1|SCU07163
Database Identifiers
gb GenBank
emb EMBL
dbj DDBJ
sp SWISS-PROT
pdb
pir
prf
ref RefSeq
Accession number
Locus Name
gi number
Protein Databank
PIR
PRF
Abstract Syntax Notation:
ASN.1
FASTA
Nucleotide
FASTA
Protein
GenPept GenBank
ASN.1
***************************************/
{"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL},
,NULL ,TRUE,'e',A
*
* asn2ff.c
*
*
***********************************
#include <accentr.h>
#include "asn2ff.h"
#include "asn2ffp.h"
#include "ffprint.h"
#include <subutil.h>
#include <objall.h>
#include <objcode.h>
#include <lsqfetch.h>
#include <explore.h>
#include <accid1.h>
#endif
FILE *fpl;
{"Input is a Seq-entry","F", NULL RG_BOOLEAN,0.0,0,NULL},
Toolbox Sources
ftp> open ftp.ncbi.nih.gov
.
.
ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools
NCBI Toolbox
/************************************************************************
convert an ASN.1 entry to flat file format, using the FFPrintArray.
#ifdef ENABLE_ID1
Args myargs[] = {
{"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL},
{"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL},
{"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},
ftp> cd toolbox
ftp> cd ncbi_tools
/protein_id="AAA20496.1"
/db_xref="GI:460244"
GenPept Protein IDS
Why align sequences?
? Functional predictions based on identifying
homologues.
Assumes:
conservation of
conservation of
sequence function
BUT: Function carried out at level of proteins, i.e.
3-D structure
Sequence conservation carried out at level of DNA
1-D sequence
Implicit Assumption of Evolution in
Models of Sequence Homology
Assume:
Sequence conservation
Structure conservation
Note that the converse is NOT necessarily true!
Structure conservation
X
Sequence conservation
Definitions
? Homologue:
a relatively non-specific term (meaningless?) that conveys the idea that
two sequences are somehow related
? Orthologue:
Ortho = (greek) straight….implies direct descent, 1 ancestor
Vav human
Vav bovine
human Vav
-or-
avian Vav
bovine Vav
Vav avian
dinosaur Vav
Vav elegans
Ye’ olde Vav
Definitions
? Paralogue:
Para = (greek) along side of….implies some type of gene duplication
event
Vav1 human Vav2human
Vav3human
-or
Complex problem
Chicken Vav
Vav1 human Vav2human Vav3human
Orthologue or Paralogue?
Need to know evolutionary relationships
Can try and get these from alignments as well…
Alignments- Good Bad and Ugly
HgbA-human
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL
:+ +::+:::::+ :+++++::+:++ +++++::+::+ ::
GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
HgbB-human
Alignments- Good Bad and Ugly
GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE
HgbA-human
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL
::+ ++: + ++:+: ++ :+ :+ : +:+ : ++::+
Nematode glutathione S-transferase
SPURIOUS ALIGNMENT
HgbA-human
GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL
++ ++++:+ ::+ ++ +:++ + +: :+ +:+ :
NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG
Leghemoglobin, yellow lupin
Alignments
? Types:
? Local
? Global
? Ungapped
? Gapped (2 types- linear, affine)
? Methods:
? Dot matrix
? Dynamic Programming
? Word, k-tup
Simplest alignment representation – Dot plot
Model: Need a metric of similarity between amino acid pairs
Simplest metric – unitary matrix, identity matrix
K I H G F E D C A
0 0 0 0 0 0 0 0 1 A
C
0 0 0 0 0 0 0 1
D
0 0 0 0 0 0 1
E
0 0 0 0 0 1
F
0 0 0 0 1
G
0 0 0 1
H
0 0 1
I
0 1
K
1
Example – self alignment
V E S F E L R K F S D F G
0 0 0 0 0 0 0 0 0 0 0 0 1 G
F 0 0 0 0 1 0 0 1 1 0 0 0
D 0 0 0 0 0 0 1 0 0 0 0
S 0 0 0 0 0 1 0 1 0 0
F 0 0 0 0 1 1 0 0 0
K 0 0 0 1 0 0 0 0
R 0 0 1 0 0 0 0
L 0 1 0 0 0 0
E 1 0 0 1 0
F 1 0 0 0
S 1 0 0
Example – self alignment…..with sliding window
V E S F E L R K F S D F G
0 0 0 0 0 0 0 0 0 0 0 0 1 G
F 0 0 0 0 1 0 0 1 1 0 0 0
D 0 0 0 0 0 0 1 0 0 0 0
S 0 0 0 0 0 1 0 1 0 0
F 0 0 0 0 1 1 0 0 0
K 0 0 0 1 0 0 0 0
R 0 0 1 0 0 0 0
L 0 1 0 0 0 0
E 1 0 0 1 0
F 1 0 0 0
S 1 0 0
Example – self alignment
V E S F E L R K F S D F G
0 0 0 0 0 0 0 0 0 0 0 0 1 G
F 0 0 0 0 1 0 0 1 0 0 0 0
D 0 0 0 0 0 0 1 0 0 0 0
S 0 0 0 0 0 1 0 0 0 0
F 0 0 0 0 1 0 0 0 0
K 0 0 0 1 0 0 0 0
R 0 0 1 0 0 0 0
L 0 1 0 0 0 0
E 1 0 0 0 0
F 1 0 0 0
S 1 0 0
Simple alignments
1
1
200
1
1
200
200
200
DNA
Self alignment – Dot Matrix
symmetric
Now align two different sequences
* Consider other similarity matrices
besides identity….
? Chemical similarity – binary decision
? Amino acid conservation in aligned protein
families – min. similarity score (+/- window)
? Average of multiple scoring systems
Dot Matrix Alignments
Sequence#1
Sequence#1
1
n
1
m
Note – NOT symmetric
A Local Alignment
Dot Matrix Alignments
Sequence#1
Sequence#1
Insertion
Insertion
1
n
1
m
A Global Alignment
General Rules for Dot Matrices
? Advantage – let’s your eyes/brain do the work –
VERY EFFICIENT!!!!
? DNA comparisons – long windows and high
stringencies (11/7, 15/11)
? Proteins – short windows and stringencies (1/1)
EXCEPT in looking for domains – then longer
windows and smaller stringencies (15/5).
? Note – we can use these types of self-aligned matrices
for more than just sequence comparisons…i.e. distance
between side Cα atoms in 3d-structures etc….maybe
later in the course!
Computational Efficiency
Measure efficiency in cpu run time and memory
O() = “big-oh” notation
Both scale as size of the problem, measured in number
of units, n, in the problem, i.e. run time is f(n).
Analyse the asymptotic worst-case running time….
Sometimes just do the experiment and measure it….
If problem scales as square of the number of units
in the problem
O(n
2
) (=order n-squared)
Examples
O(n
k
) is “polynomial time” as long as
K<3 …..tractable
Consider our un-gapped dot matrix
Global alignment:
1 n
1
12345678….
12345678….
12345678….
12345678….
12345678….
m
12345678….
….essentially an O(mn) problem
O.K. Examples
O(n) better than O(n log(n)), better than
O(n
2
), better than O(n
3
)
Terrible Examples
O(k
n
) = exponential time….horrible!!!!
NP problems- no known polynomial time
Solutions = non-deterministic polynomial
Problems.