7.91 – Lecture #1
Introduction to Bioinformatics
Focus on Kinases 
& Pairwise Sequence Comparisons 
ARDFSHGLLENKLLGCDSMRWE
.::. .:::. .:::: :::.
GRDYKMALLEQWILGCD-MRWD
Reading:
This lecture: Mount pp.1-7, 29-35, 45-48, 51-64
Next lecture: Mount pp. 8-9, 65-89, 96-115, +more
The Protein Data Bank (PDB - http://www.pdb.org/) is the single 
biological macromolecular structure data.
 Research 28 (2000): 235-242.
(PDB Advisory Notice on using materials available in the archive:
 
worldwide repository for the processing and distribution of 3-D 
 
 
 Research 28 (2000): 235-242.
(PDB Advisory Notice on using materials available in the archive: http://www.rcsb.org/pdb/advisory.html)
Michael Yaffe 
Berman, H. M., J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, 
H. Weissig, I. N. Shindyalov, and P. E. Bourne. The Protein Data 
Bank. Nucleic Acids Research 28 (2000): 235-242.
 
http://www.rcsb.org/pdb/advisory.html)
Outline
?	 Review of biological fundamentals 
?	 Genetics example 
?	 Biological databases/NCBI resources 
?	 Simplified sequence analysis – where to start…
?	 Simple sequence comparisons 
?	 Definitions of related sequences 
?	 Concepts and types of alignments – the good, 
the bad, and the ugly 
?	 Dot matrix alignments 
?	 Computational efficiency 
?	 Recursion and dynamic programming 
?	 Substitution matrices: PAM, BLOSUM, Gonnet
Outline (cont)
?Gaps
?	 Applied dynamic programming: global 
alignments: Needleman-Wunsch 
?	 Applied dynamic programming: local 
alignments – Smith-Waterman 
?	 Basic statistics of sequence alignments
Review of biological fundamentals
Genetic material 
? gene as a concept DNA 
? DNA as hereditary material 
RNA 
Protein 
Critical 
cell functions
DNA structure
A 
G 
C 
T 
U 
Bases
DNA structure
A 
G 
C 
T 
Base pairing
DNA structure
1 
5 
4 
3 2 
1 
5 
4 
3 2 
Nucleoside 
Nucleotide
DNA structure
5’ 
3’ 
DNA structure
5’
3’ 
DNA structure 
5’ 
3’ 
3’ 
5’ 
RNA structure
1 
5 
4 
3 2 
1 
5 
4 
3 2 
Transcription 
5’ 
3’ 
3’ 
5’ 
OH 
OH 
OH 
DNA 
RNA 
Protein structure
Aspartic acid (Asp, D)
Arginine (Arg, R)
C C
O
-
O
H
Glycine (Gly, G) Alanine (Ala, A) Valine (Val, V)
H
H
3
N
+
C C
O
-
O
H
CH
3
CH
3
CH
3
H
3
N
+
C C
O
-
O
H
CH
H
3
N
+
C C
O
-
O
H
CH
H
3
N
+
H
3
C
OH CH
3
CH
3
CH
3
C C
O
-
O
H
CH
2
CH
2
CH
3
C C
O
-
O
H
H
3
N
+
CH
2
CH
2
C C
O
-
O
H
H
3
N
+
CH
2
S
CH
3
H
3
N
+
C C
O
-
O
H
H
3
N
+
CH
2
C C
O
-
O
O
H
H
3
N
+
CH
2
C C
O
-
O
H
H
3
N
+
CH
2
CH
2
C
NH
2
C
C C
O
-
O
H
H
3
N
+
CH
2
C C
O
-
O
H
H
2
N
+
H
2
C
C C
O
-
O
O
H
H
3
N
+
CH
2
C C
O
-
O
H
H
3
N
+
CH
2
CH
2
-
O
C
O
C C
O
-
O
H
H
3
N
+
CH
2
CH
2
CH
2
CH
2
NH
3
+
NH
2
+
NH
+
NH
C C
O
-
O
H
H
3
N
+
CH
2
C
C
O
-
O
H
H
3
N
+
CH
2
CH
2
CH
2
C
NH
NH
2
-
O
C
O
NH
2
CH
2
CH
2NH
C C
O
-
O
H
H
3
N
+
CH
2
C C
O
-
O
H
H
3
N
+
CH
2
SH
OH
C C
O
-
O
H
H
3
N
+
CH
OH
Leucine (Leu, L) Isoleucine (IIe, I)
Methionine (Met, M)
Serine (Ser, S)
Aspartic acid (Asp, D)
Glutamic acid (Glu, E)
Lysine (Lys, K)
Arginine (Arg, R)
Histidine (His, H)
Threonine (Thr, T) Cysteine (Cys, C) Tyrosine (Tyr, Y) Asparagine (Asn, N) Glutamine (Gln, Q)
Acidic Basic
Phenylalanine (Phe, F)
Tryptophan (Trp, W)
Proline (Pro, P)
      Electrically charged
Polar, Hydrophillic R-groups
Nonpolar, Hydrophobic R-groups
Codon Table 
phe (UUU)
phe 
leu
leu
leu
leu
leu
leu
ile
ile
ile
met 
(and initiation)
val
val
val
val
ser
ser
ser
ser
tyr
tyr
termination
termination
cys
U C
(5')....pNpNpN....(3') in mRNA
Middle Base of Codon
Base at 5' End
of Codon
Base at 3' End
of Codon
A G
cys
termination
trp
pro
pro
pro
pro
thr
thr
thr
thr
his
his
gln
gln
asn
asn
lys
lys
arg
arg
arg
arg
ser
ser
arg
arg
ala
ala
ala
ala
asp
asp
glu
glu
gly
gly
gly
gly
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
DNA structure 
5’ 
3’ 
3’ 
5’ 
“Sense” 
strand 
“Anti-Sense” 
strand 
Convention: 
Sense strand 
Read 5’? 3’ 
Reverse Complement 
Genetics experiment:
phenotype 
Isolate a yeat mutant that has increased chromosome number 
Can rescue the phenotype with a piece of DNA
…corresponds to a site of mutation in the yeasts DNA 
genotype 
CGTTTTTCCTGTAAAGCGCTTAATTTGTTTACCATTCTATAAAAACCTTGAGCTAAGGCCAACTGATGCA 
ATTGCTCAAGTGAATGCATAAACAAAGCAAGATCATTCTTAGCGCAAAAAAAACTGGGATTTTGAAATAC 
AACAAAAGAAAGAAGTAAAAAGGGAATGCAACGCAATAGTTTAGTAAATATCAAACTAAACGCTAATTCG 
CCATCGAAAAAGACCACAACAAGACCAAATACGTCCAGGATCAATAAACCATGGAGAATATCCCATTCGC 
CGCAGCAAAGAAACCCGAATTCAAAAATACCTTCACCTGTAAGAGAAAAATTGAACAGATTACCTGTAAA 
CAATAAGAAGTTTTTGGATATGGAAAGCTCCAAAATTCCATCACCTATAAGGAAAGCGACTTCTTCCAAA 
ATGATACACGAAAATAAGAAGCTACCTAAATTTAAATCCCTATCACTCGATGACTTTGAACTGGGGAAGA 
AATTAGGAAAGGGTAAATTCGGTAAAGTTTATTGCGTTCGGCACAGGAGTACAGGATATATTTGCGCACT 
GAAAGTAATGGAGAAGGAAGAAATAATAAAGTATAATTTACAGAAACAATTCAGAAGGGAGGTAGAAATA 
CAAACATCGCTAAATCATCCGAATCTAACTAAATCATACGGCTATTTTCATGATGAAAAAAGAGTGTACC 
TGCTAATGGAATACTTAGTCAATGGGGAAATGTATAAACTATTGAGGTTACACGGACCCTTCAACGATAT 
TTTAGCATCAGATTATATTTATCAAATTGCCAATGCCCTAGATTATATGCATAAAAAGAATATTATTCAT 
AGAGATATTAAACCTGAAAATATACTAATAGGGTTCAATAATGTCATTAAGTTAACGGACTTCGGATGGA 
GTATAATAAATCCGCCAGAAAATAGAAGGAAAACTGTCTGTGGGACAATTGACTACCTTTCTCCAGAAAT 
GGTGGAGTCAAGGGAATATGATCACACTATAGATGCATGGGCTCTTGGCGTCCTGGCGTTTGAACTACTG 
ACCGGTGCCCCTCCGTTCGAAGAAGAAATGAAAGATACTACATATAAAAGGATAGCAGCACTGGATATCA 
AAATGCCCAGTAACATTTCTCAGGATGCGCAAGATTTAATACTTAAACTACTAAAATACGACCCCAAAGA 
TAGAATGCGCCTTGGAGACGTAAAAATGCATCCTTGGATACTAAGAAACAAGCCCTTTTGGGAAAATAAG 
CGGTTATAGAATTAAAGTATGACAGAATCGTTTGAAGGGCACTATTAATCACTCCCGCACATATCACATA 
ATACTAAGTATCCATTTCTAATATTTCATTACTCTTTTCGGCATCGTATATTGCGATATTTGATTAAATT 
TTCTTGTTCATTTTTTCCTCTTTTCTTTCGCCTTTGTCGAAAGAAAAGAGGAAAACAAGCTGAAAATTGC 
TATGCATTAAAGTAGCAGATTTACTTTGTTGAGTTGGTTCTGATCAATAATAAGAGTAATGAAAGAAAGC 
AAAAAAATGGCTAAAGATAATTTAACTAATTTGCTCTCTCAATTGAACATTCAATTGTCTCAA 
The National Center for Biotechnology 
Information
? Created as a part of NLM in 1988 
– Establish public databases 
– Perform research in computational biology
– Develop software tools for sequence analysis 
– Disseminate biomedical information 
Molecular Databases
? 
– 
– 
information 
? 
? 
– 
? 
? 
– 
? 
– 
? 
Primary Databases 
Original submissions by experimentalists 
Database staff organize but don’t add additional 
Example: GenBank 
Derivative Databases 
Human curated 
compilation and correction of data 
Example: SWISS-PROT, NCBI RefSeq mRNA 
Computationally Derived 
Example: UniGene 
Combinations 
Example: NCBI Genome Assembly 
What is GenBank?
? Nucleotide only sequence database 
? Archival in nature 
? 
– Direct submissions individual records (BankIt, 
Sequin) 
– Batch submissions via email (EST, GSS, STS) 
– ftp accounts sequencing centers 
? Data shared three collaborating databases 
– GenBank 
– DNA Database of Japan (DDBJ). 
– 
(EMBL) at EBI. 
NCBI’s Primary Sequence Database 
GenBank Data 
European Molecular Biology Laboratory Database 
Relationships of chromosomes to genome sequencing markers.
163 Mb
The X chromosome is about 163 Mb in length. In this diagram, there are 16 overlapping
BAC clones that span the entire length. In reality, 1,408 BACs were needed to span the X
chromosome. Arrows (top) mark STSs scattered throughout the chromosome and on
overlapping BACs.
BACs
STSs
X Chromosome
GenBank: NCBI’s Primary 
Sequence Database
? full release every two months 
? incremental and cumulative updates daily 
ftp://ftp.ncbi.nih.gov/genbank/ 
Release 133 December 2003 
19,808,101 Records 
28,507,990,166 Nucleotides 
110,000 + 
? available only through internet 
Species 
107.07 Gigabytes of data
GenBank Divisions
Bulk Sequence Divisions
PAT Patent 
EST Expressed Sequence Tags
STS Sequence Tagged Sites
GSS Genome Survey Sequences
HTG High Throughput Genome
HTC High Throughput cDNA
CON Contig 
Traditional Divisions 
BCT INV MAM PHG PLN PRI 
ROD SYN UNA VRL VRT 
The International Sequence
Database Collaboration
EBI 
GenBank
DDBJ
EMBL
EMBL
Entrez 
SRS 
getentry 
NIG
CIB 
NCBI 
NIH
?Submissions 
?Updates 
?Submissions 
?Updates 
?Submissions 
?Updates 
Web 
Access
Text
BLAST 
VAST 
Entrez 
Sequence 
Structure 
http://www.ncbi.nlm.nih.gov 
Genetics experiment:
phenotype 
Isolate a yeat mutant that has increased chromosome number 
Can rescue the phenotype with a piece of DNA
…corresponds to a site of mutation in the yeasts DNA 
genotype 
CGTTTTTCCTGTAAAGCGCTTAATTTGTTTACCATTCTATAAAAACCTTGAGCTAAGGCCAACTGATGCA 
ATTGCTCAAGTGAATGCATAAACAAAGCAAGATCATTCTTAGCGCAAAAAAAACTGGGATTTTGAAATAC 
AACAAAAGAAAGAAGTAAAAAGGGAATGCAACGCAATAGTTTAGTAAATATCAAACTAAACGCTAATTCG 
CCATCGAAAAAGACCACAACAAGACCAAATACGTCCAGGATCAATAAACCATGGAGAATATCCCATTCGC 
CGCAGCAAAGAAACCCGAATTCAAAAATACCTTCACCTGTAAGAGAAAAATTGAACAGATTACCTGTAAA 
CAATAAGAAGTTTTTGGATATGGAAAGCTCCAAAATTCCATCACCTATAAGGAAAGCGACTTCTTCCAAA 
ATGATACACGAAAATAAGAAGCTACCTAAATTTAAATCCCTATCACTCGATGACTTTGAACTGGGGAAGA 
AATTAGGAAAGGGTAAATTCGGTAAAGTTTATTGCGTTCGGCACAGGAGTACAGGATATATTTGCGCACT 
GAAAGTAATGGAGAAGGAAGAAATAATAAAGTATAATTTACAGAAACAATTCAGAAGGGAGGTAGAAATA 
CAAACATCGCTAAATCATCCGAATCTAACTAAATCATACGGCTATTTTCATGATGAAAAAAGAGTGTACC 
TGCTAATGGAATACTTAGTCAATGGGGAAATGTATAAACTATTGAGGTTACACGGACCCTTCAACGATAT 
TTTAGCATCAGATTATATTTATCAAATTGCCAATGCCCTAGATTATATGCATAAAAAGAATATTATTCAT 
AGAGATATTAAACCTGAAAATATACTAATAGGGTTCAATAATGTCATTAAGTTAACGGACTTCGGATGGA 
GTATAATAAATCCGCCAGAAAATAGAAGGAAAACTGTCTGTGGGACAATTGACTACCTTTCTCCAGAAAT 
GGTGGAGTCAAGGGAATATGATCACACTATAGATGCATGGGCTCTTGGCGTCCTGGCGTTTGAACTACTG 
ACCGGTGCCCCTCCGTTCGAAGAAGAAATGAAAGATACTACATATAAAAGGATAGCAGCACTGGATATCA 
AAATGCCCAGTAACATTTCTCAGGATGCGCAAGATTTAATACTTAAACTACTAAAATACGACCCCAAAGA 
TAGAATGCGCCTTGGAGACGTAAAAATGCATCCTTGGATACTAAGAAACAAGCCCTTTTGGGAAAATAAG 
CGGTTATAGAATTAAAGTATGACAGAATCGTTTGAAGGGCACTATTAATCACTCCCGCACATATCACATA 
ATACTAAGTATCCATTTCTAATATTTCATTACTCTTTTCGGCATCGTATATTGCGATATTTGATTAAATT 
TTCTTGTTCATTTTTTCCTCTTTTCTTTCGCCTTTGTCGAAAGAAAAGAGGAAAACAAGCTGAAAATTGC 
TATGCATTAAAGTAGCAGATTTACTTTGTTGAGTTGGTTCTGATCAATAATAAGAGTAATGAAAGAAAGC 
AAAAAAATGGCTAAAGATAATTTAACTAATTTGCTCTCTCAATTGAACATTCAATTGTCTCAA 
Using 
Entrez
An integrated database 
search and retrieval system
3-D 
Structur
e
Entrez: Database 
Integration
VAST 
Phylogeny 
Genomes 
Taxonomy 
PubMed 
abstracts 
Nucleotide 
sequences 
Protein 
sequences 
3 -D 
Structur 
e 
Word weight 
BLASTBLAST 
The (ever) Expanding Entrez 
Entrez 
Journals 
UniGene 
PubMed Nucleotide 
Protein 
SNP 
Genome 
Books 
OMIM 
Taxonomy 
3D Domains 
UniSTS 
Structure 
System 
ProbeSet 
CDD 
PopSet 
Entrez Databases
PubMed Biomedical literature 
Books Online textbooks 
Nucleotide GenBank, EMBL, DDBJ, RefSeq, PDB 
Protein [GenBank, EMBL, DDBJ], RefSeq, 
SWISS-PROT, PIR, PRF, PDB 
Genome Complete genomes 
Taxonomy Organisms in NCBI sequence databases 
Structure MMDB: experimental 3D structures 
Domains CDD: conserved protein domains 
3D Domains Compact 3D protein domains in MMDB 
OMIM Online Mendelian Inheritance in Man 
SNP Single nucleotide polymorphisms 
UniSTS Sequence Tagged Site markers 
ProbeSet Gene expression and microarray datasets 
PopSet Population study datasets 
UniGene Gene-based expressed sequence clusters 
A Traditional GenBank Record
Definition =Title 
ACCESSION U07163 
VERSION U07163.1 GI:460243 
Accession Number 
Version Number 
GI Number 
NCBI’s Taxonomy 
FASTA Format
> 
FASTA Definition Line 
>gi|460243|gb|U07163.1|SCU07163 
Database Identifiers 
gb GenBank 
emb EMBL 
dbj DDBJ 
sp SWISS-PROT 
pdb 
pir 
prf 
ref RefSeq 
Accession number 
Locus Name
gi number 
Protein Databank 
PIR 
PRF 
Abstract Syntax Notation: 
ASN.1
FASTA 
Nucleotide 
FASTA 
Protein 
GenPept GenBank 
ASN.1 
***************************************/
{"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL},
,NULL ,TRUE,'e',A
* 
* asn2ff.c 
* 
* 
*********************************** 
#include <accentr.h> 
#include "asn2ff.h" 
#include "asn2ffp.h"
#include "ffprint.h"
#include <subutil.h> 
#include <objall.h>
#include <objcode.h>
#include <lsqfetch.h>
#include <explore.h> 
#include <accid1.h> 
#endif 
FILE *fpl; 
{"Input is a Seq-entry","F", NULL RG_BOOLEAN,0.0,0,NULL}, 
Toolbox Sources 
ftp> open ftp.ncbi.nih.gov 
. 
. 
ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools 
NCBI Toolbox 
/************************************************************************ 
convert an ASN.1 entry to flat file format, using the FFPrintArray. 
#ifdef ENABLE_ID1 
Args myargs[] = { 
{"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, 
{"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, 
{"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, 
ftp> cd toolbox 
ftp> cd ncbi_tools 
/protein_id="AAA20496.1"
/db_xref="GI:460244" 
GenPept Protein IDS 
Why align sequences?
?	 Functional predictions based on identifying 
homologues. 
Assumes:
conservation of 
 conservation of 
sequence function 
BUT: Function carried out at level of proteins, i.e. 
3-D structure 
Sequence conservation carried out at level of DNA 
1-D sequence 
Implicit Assumption of Evolution in
Models of Sequence Homology
Assume: 
Sequence conservation 
 Structure conservation 
Note that the converse is NOT necessarily true! 
Structure conservation 
X 
       Sequence conservation 
Definitions
? Homologue:
a relatively non-specific term (meaningless?) that conveys the idea that 
two sequences are somehow related 
? Orthologue: 
Ortho = (greek) straight….implies direct descent, 1 ancestor 
Vav human 
Vav bovine 
human Vav 
-or-
avian Vav 
bovine Vav 
Vav avian 
dinosaur Vav
Vav elegans
Ye’ olde Vav
Definitions
? Paralogue:
Para = (greek) along side of….implies some type of gene duplication 
event 
Vav1 human Vav2human
 Vav3human 
-or 
Complex problem
Chicken Vav 
Vav1 human Vav2human Vav3human 
Orthologue or Paralogue?
Need to know evolutionary relationships
Can try and get these from alignments as well…
Alignments- Good Bad and Ugly
HgbA-human 
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL
:+ +::+:::::+ :+++++::+:++ +++++::+::+ ::
GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
HgbB-human 
Alignments- Good Bad and Ugly
GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE 
HgbA-human 
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL
::+ ++: + ++:+: ++ :+ :+ : +:+ : ++::+ 
Nematode glutathione S-transferase 
SPURIOUS ALIGNMENT 
HgbA-human 
GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL
++ ++++:+ ::+ ++ +:++ + +: :+ +:+ :
NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG
Leghemoglobin, yellow lupin 
Alignments
? Types: 
? Local 
? Global 
? Ungapped 
? Gapped (2 types- linear, affine) 
? Methods:
? Dot matrix 
? Dynamic Programming 
? Word, k-tup 
Simplest alignment representation – Dot plot
Model: Need a metric of similarity between amino acid pairs
Simplest metric – unitary matrix, identity matrix 
K I H G F E D C A
0 0 0 0 0 0 0 0 1 A
C
 0 0 0 0 0 0 0 1
D
 0 0 0 0 0 0 1
E
 0 0 0 0 0 1
F
 0 0 0 0 1
G
 0 0 0 1
H
 0 0 1
I
 0 1
K
 1
Example – self alignment
V E S F E L R K F S D F G 
0 0 0 0 0 0 0 0 0 0 0 0 1 G 
F 0 0 0 0 1 0 0 1 1 0 0 0 
D 0 0 0 0 0 0 1 0 0 0 0 
S 0 0 0 0 0 1 0 1 0 0 
F 0 0 0 0 1 1 0 0 0 
K 0 0 0 1 0 0 0 0 
R 0 0 1 0 0 0 0 
L 0 1 0 0 0 0 
E 1 0 0 1 0 
F 1 0 0 0 
S 1 0 0 
Example – self alignment…..with sliding window
V E S F E L R K F S D F G 
0 0 0 0 0 0 0 0 0 0 0 0 1 G 
F 0 0 0 0 1 0 0 1 1 0 0 0 
D 0 0 0 0 0 0 1 0 0 0 0 
S 0 0 0 0 0 1 0 1 0 0 
F 0 0 0 0 1 1 0 0 0 
K 0 0 0 1 0 0 0 0 
R 0 0 1 0 0 0 0 
L 0 1 0 0 0 0 
E 1 0 0 1 0 
F 1 0 0 0 
S 1 0 0 
Example – self alignment
V E S F E L R K F S D F G 
0 0 0 0 0 0 0 0 0 0 0 0 1 G 
F 0 0 0 0 1 0 0 1 0 0 0 0 
D 0 0 0 0 0 0 1 0 0 0 0 
S 0 0 0 0 0 1 0 0 0 0 
F 0 0 0 0 1 0 0 0 0 
K 0 0 0 1 0 0 0 0 
R 0 0 1 0 0 0 0 
L 0 1 0 0 0 0 
E 1 0 0 0 0 
F 1 0 0 0 
S 1 0 0 
Simple alignments
1 
1 
200 
1
1 
200 
200 
200 
DNA
Self alignment – Dot Matrix 
symmetric
Now align two different sequences
* 	Consider other similarity matrices 
besides identity…. 
?	 Chemical similarity – binary decision
?	 Amino acid conservation in aligned protein 
families – min. similarity score (+/- window) 
?	 Average of multiple scoring systems
Dot Matrix Alignments
Sequence#1 
Sequence#1 
1 
n 
1 
m 
Note – NOT symmetric 
A Local Alignment 
Dot Matrix Alignments
Sequence#1 
Sequence#1 
Insertion 
Insertion 
1 
n 
1 
m 
A Global Alignment 
General Rules for Dot Matrices
?	 Advantage – let’s your eyes/brain do the work –
VERY EFFICIENT!!!! 
?	 DNA comparisons – long windows and high
stringencies (11/7, 15/11) 
?	 Proteins – short windows and stringencies (1/1) 
EXCEPT in looking for domains – then longer 
windows and smaller stringencies (15/5). 
?	 Note – we can use these types of self-aligned matrices 
for more than just sequence comparisons…i.e. distance 
between side Cα atoms in 3d-structures etc….maybe
later in the course! 
Computational Efficiency
Measure efficiency in cpu run time and memory
O() = “big-oh” notation 
Both scale as size of the problem, measured in number 
of units, n, in the problem, i.e. run time is f(n). 
Analyse the asymptotic worst-case running time….
Sometimes just do the experiment and measure it….
If problem scales as square of the number of units 
in the problem 
O(n
2
) (=order n-squared) 
Examples
O(n
k
) is “polynomial time” as long as 
K<3 …..tractable 
Consider our un-gapped dot matrix 
Global alignment:
1 n 
1 
12345678….
12345678…. 
12345678….
12345678…. 
12345678….
m 
12345678…. 
….essentially an O(mn) problem 
O.K. Examples
O(n) better than O(n log(n)), better than 
O(n
2
), better than O(n
3
) 
Terrible Examples 
O(k
n
) = exponential time….horrible!!!! 
NP problems- no known polynomial time 
Solutions = non-deterministic polynomial 
Problems. 

