7.91 – Lecture #3
More Multiple Sequence Alignment
-and-
Motif Scanning, Database Searching
Michael Yaffe
Outline
? Multiple Sequence Alignment - Carillo & Lipman,
Clustal(W)
? Position-Specific Scoring Matrices (PSSM)
? Information content, Shannon entropy
? Sequence logos
? Hidden Markov Models
? …Other approaches: Genetic algorithms,
expectation maximization, MEME,Gibbs sampler
? FASTA, Blast searching, Smith-Waterman
? Psi-Blast
Reading - Mount p. 139-150, 152-157, 161-171, 185-198
Multiple Sequence Alignments
? Sequences are aligned so as to bring the
greatest number of single characters into
register.
? If we include gaps, mismatches, then even
dynamic programming becomes limited to
~ 3 sequences unless they are very
short….need an alternative approach…
Why?
Consider the 2 sequence comparison
…..an O(mn) problem – order n
2
i =0 1 2 3 4 5
j =
Gap V D S C Y
0
0
4
-8
-4
-3
-8
-16 -24 -32 -40
-8
1 -8
4
-12 -20 -28
-12 -16
-6
-9
7
-17
2
1 -73
-24
-14
3
-6
9
4 -32
-22 -14
-30
3
0
5
-40
-22
-7
1
3
13
6
-48
-38
-30
-15 5
23
For 3 sequences….
ARDFSHGLLENKLLGCDSMRWE
.::. .:::. .:::: :::.
GRDYKMALLEQWILGCD-MRWD
.::. ::.: .. :. .:::
SRDW--ALIEDCMV-CNFFRWD
An O(mnj) problem !
Consider sequences each 300 amino acids
Uh Oh !!!
2 sequences – (300)
2 Our polynomail problem
3 sequences – (300)
3
Just became exponential!
but for v sequences – (300)
v
Consider pairwise alignments
between 3 sequences
Carillo and Lipman – Sum of Pairs method
Sequence B
C
e
c
n
e
u
q
e
S
A-C
A-B
B-C
A-B-C
Do we need to
Score each node?
Sequence A
Get the multiple alignment score within the cubic lattice by
Adding together the scores of the pairwise alignments…
In practice, doesn’t give optimal alignment…But we’re close!
Seems reasonable that the optimal alignment won’t be far from
the diagonal we were on…so we just set bounds on the location
of the msa within the cube based on each pairwise-alignment.
Then just do dynamic programing within the volume defined by the
pre-imposed bounds
Still takes too long for more than three
sequences…need a better way!
? Progressive Methods of Multiple Sequence
Alignment
Concept – simple:
1-Use DP to build pairwise alignments of most closely
related sequences
2- Then progressively add less related sequences or
groups of sequences…
ClustalW
Higgins and Sharp 1988
?1-Do pairwise analysis of all the sequences
(you choose similarity matrix).
? 2- Use the alignment scores to make a
phylogenetic tree.
? 3- Align the sequences to each other guided
by the phylogenetic relationships in the tree.
New features: Clustal ?ClustalW (allows weights) ? ClustalX (GUI-based
Weighting is important to avoid biasing an alignment by many sequence
Members that are closely related to each other evolutionarily!
Pairwise Alignment
Steps in Multiple Alignment
Multiple alignment following the tree from A
S
1
S
1
S
1
6 pairwise comparisons
then cluster analysis
similarity
align most similar pair
Gaps to optimize alignment
New gap to optimize
alignment of (S
2
S
4
) with (S
1
S
3
)
align next most similar pair
align alignments-preserve gaps
Example - 4 sequences S
1
S
2
S
3
S
4
S
2
S
2
S
2
S
3
S
3
S
3
S
4
S
4
S
4
S
1
S
2
S
3
S
4
Progressive Alignments
Note that the final msa is EXTREMELY DEPENDENT on
the initial pairwise sequence alignments!
If the sequences are close in evolution, and you can see
the alignment – GREAT!
If the sequences are NOT close in evolution, and you CANNOT
See the alignment – errors will be propogated to the final msa
Has led to other approaches to do msa’s that aren’t so
Dependent on initial states….i.e. genetic algorithm
Finding patterns (i.e. motifs and
domains) in Multiple Sequence Analysis
Block Analysis, Position Specific Scoring Matrices (PSSM)
BUILD an msa from groups of related proteins
BLOCKS represent a conserved region in that msa
that is LACKING IN GAPS – i.e. no insertions/deletions
The BLOCKS are typically anwhere from 3-60 amino acids long,
based on exact amino acid matches – i.e. alignment will tolerate
mismatches, but doesn’t use any kind of PAM or
BLOSUM matrix…in fact they generate the BLOSUM matrix!
A single protein contains numerous such BLOCKS
separated by stretches of intervening sequences that can
differ in length and composition.
These blocks may be whole domains, short sequence motifs,
key parts of enzyme active sites etc, etc.
BLOCKS database….so far exploration limited. Lots of stuff to probe!
Can use these conserved BLOCKS
to derive a PSSM
? The dirty secret behind prosite! Scansite! And in a
12345………….11
twised way Psi-BLAST!
….GDSFHYFVSHG…..
….GDAFHYYISFG…..
….GDSYHYFLSFG…..
….SDSFHYFMSFG…..
….GDSFHFFASFG…..
Now build a matrix with 20 amino acids as the columns, and 11 rows
For the positions in the BLOCK
Each matrix entry is the log(frequency of the amino acid occurance)
at that position in the BLOCK.
….GDSFHYFVSHG…..
….GDAFHYYISFG…..
….GDSYHYFLSFG…..
….SDSFHYFMSFG…..
….GDSFHFFASFG…..
A C D E F G H I K….
1
Log(4)
2 Log(5)
3
4
5
We can now use the PSSM to search a database for other proteins that
have the BLOCK (or motif).
Problem 1 – We need to think about what kind of information is
Contained within the PSSM.
→ Leads to concepts of Information Content & Entropy
….GDSFHYFVSHG…..
….GDAFHYYISFG…..
….GDSYHYFLSFG…..
….SDSFHYFMSFG…..
….GDSFHFFASFG…..
Problem 2 –The PSSM must accurately represent the expected BLOCK
Or motif….and we have only limited amounts of data! Is it a good statistical
Sampling of the BLOCK/motif? Is it too narrow because of small dataset?
Should we broaden it by adding extra amino acids that we choose using
Some type of randomization scheme (called adding pseudocounts). If so,
How many should we add?
Finding patterns (i.e. motifs and
domains) in Multiple Sequence Analysis
Block Analysis, Position Specific Scoring Matrices (PSSM)
BUILD an msa from groups of related proteins
BLOCKS represent a conserved region in that msa
that is LACKING IN GAPS – i.e. no insertions/deletions
The BLOCKS are typically anwhere from 3-60 amino acids long,
based on exact amino acid matches – i.e. alignment will tolerate
mismatches, but doesn’t use any kind of PAM or
BLOSUM matrix…in fact they generate the BLOSUM matrix!
These blocks may be whole domains, short sequence motifs,
key parts of enzyme active sites etc, etc.
Position Specific Scoring Matrices
PSSM
12345………….11
….GDSFHQFVSHG…..
….SDAFHQYISFG…..
….GDSYWNFLSFG…..
….SDSFHQFMSFG…..
….GDSYWNYASFG…..
This BLOCK might represent some small part of a modular protein
domain, or might represent a motif for something
…..like a phosphorylation site on the S in position 9
Now build a matrix with 20 amino acids as the columns, and 11 rows
For the positions in the BLOCK
Each matrix entry is the Log(frequency of the amino acid occurance)
at that position in the BLOCK.
Posit
i
on
12345………….11
….GDSFHQFVSHG…..
….SDAFHQYISFG…..
….GDSYWNFLSFG…..
….SDSFHQFMSFG…..
….GDSYWNYASFG…..
T…A C D E F G H I K…. S
1
Log(3)
Log(2)
2 Log(5)
3
4
5
We can now use the PSSM to look for the BLOCK (motif) in single proteins
-or
use the PSSM to search a database for other proteins that
have the BLOCK (or motif).
Problem 1 –The PSSM must accurately represent the expected BLOCK
Or motif….and we have only limited amounts of data! Is it a good statistical
Sampling of the BLOCK/motif? Is it too narrow because of small dataset?
Should we broaden it by adding extra amino acids that we choose using
Some type of randomization scheme (called adding pseudocounts). If so,
How many should we add?
….GDSFHQFVSHG…..
….SDAFHQYISFG…..
….GDSYWNFLSFG…..
….SDSFHQFMSFG…..
….GDSYWNYASFG…..
Problem 2 –We need to think about what kind of information is
Contained within the PSSM.
→ Leads to concepts of Information Content & Entropy
Pseudocounts
?If the number of sequences in the training set is both large and diverse,
then the sequences in the training set represent a good statistical
sampling of the motif….if not, then we have a sampling error!
Correct for this by adding pseudocounts. How many to add?
→ Too many pseudocounts dominate the frequencies…
and the resulting matrix won’t work!
→ Too few pseudocounts then we’ll miss many amino
acid variations, and matrix will only find sequences
that produced the motif!
Add few pseudocounts if sampling is good (robust), and add
more pseudocounts if sampling is sparse
One reasonable approach is to add √N pseudocounts, where
N is the number of sequences…
As N increases, the influence of pseusocounts decreases since
N increases faster than √N, but doesn’t add enough at low N
Pseudocounts
How do we choose which amino acids to use
for the pseudocounts?
Approach 1 – Choose the randomly based on f(i), the frequency of
With which amino acid I is found in all proteins in GenBank…
……seems wrong somehow, since not all amino acids should
substitute for others in an alignment….
Approach 2 – Use information from scoring matrices (i.e. PAM250 or
BLOSUM 62 matrices, since these give estimates of amino acid
substitutions….OK, but may not make optimal use of the data from an
aligned sequence block
Approach 3 – Use data from some subset of other aligned sequence
BLOCKS as an estimate of the amino acid distribution in the block
we are working with. These capture some features of protein
structure and function. Sjolander et al have prepared such amino acid
distributions and these are referred to as Dirichlet mixtures.
A word on Dirichlet Mixtures
Concept comes from Bayesian statistics….idea of posterior
distributions. Obtained by multiplying prior probabilities
(i.e. what you know) with likelihood of new data ….OK,
maybe more on this later
An easier to understand, almost correct explanation:
(Bayes without tears)
Suppose in one column of an alignment of 27 sequences we find:
12 G’s, 6 T’s, 6 F’s and 3 P’s.
The probability of such a distribution of amino acids is:
P(12G, 6T, 6F, 3P)=n!(pG)
12
(pT)
6
(pF)
6
(pP)
3
Where n! = (12+6+6+3)!
12! 6! 6! 3!
This is called a multinomial distribution.
A word on Dirichlet Mixtures
We want to get at the prior probablities that would generate
this type of multinomial distribution,
i.e. that would statistically favor the choice amino acids we
actually observed in our alignment.
We also have to consider several other alternatives that are
close to the exact distribution of amino acids we actually
observed, but might also be reasonable, i.e.
11 G’s, 7 T’s, 5 F’s and 4 P’s.
This would have a slightly different multinomial distribution
So we also need to consider the prior probabilities that
would reflect these other similar, but not identical amino
acid distributions for the alignment
Finally, we weigh all of these different prior probabilities that
would generate these similar but not identical multinomial
distributions into a combined frequency distribution…
AND THIS IS THE DIRICHLET MIXTURE
The last word on Dirichlet
Mixtures
At the moment, these Dirchelet mixtures provide
the best way to analyze amino acid compositions
in an alignment that can detect weak but
significant sequence similarity.
Dirchelet mixtures are used extensively in several
Hidden Markov Models of sequence alignments
(later in this lecture).
Why? Because you can use fewer sequences in the training set
(i.e. 20 or so for HMMs) and still derive good motifs!
Information Content
….GDSFHQFVSHG…..
….SDAFHQYISFG…..
….GDSYWNFLSFG…..
….SDSFHQFMSFG…..
….LDSYWNYASFG…..
Derived from the field of Information Theory
Concept is simple – some positions in an alignment provide
alot more information than others as far as identifying a motif
is concerned
How do we quantitate this?
Information Theory
Originated with Claude Shannon,
an MIT PhD who worked for Bell Labs
and was interested in how much information
you could transmit, how fast you could send it,
and how accurately you could
transmit it, across telephone lines
Photograph courtesy of Lucent Technologies. http://www.bell-labs.com/news/2001/february/26/1.html
Copyright notice: http://www.lucent.com/copyright.html
Information Theory
What is the minimum amount of information necessary
to convey a message unambiguously?
How much information is present in each position
in a sequence motif. How can we quantitate
information content in motif sequences? How can this
help us understand the biological meaning of motifs?
How do we quantitate the uncertainty in a message?
Complex Problem
I put 1,048,576 identical boxes in a
straight line and I hide $500 into one of
them at random
…...
…
Now let’s play 20 questions….(yes/no)
Simpler Example
X
Minimum number of yes-no questions
to locate an object of interest?
Simple Example
X
Minimum number of questions
to locate the object of interest?
X
X
Log
2
8 = 3
Complex Problem
I put 1,048,576 identical boxes in a
straight line and I hide $500 into one of
them at random
…...
…
Now let’s play 20 questions….(yes/no)
2
20
=1,048,576
Information Theory
Since the information was derived using base2 of the total number
of probabilities (i.e. computers), the resulting information is said to
be coded in units of bits (binary digits)
Could have used base e for the logs of the probabilities, (i.e. natural
Logs) – then the resulting information is said to be coded in units of nats
(natural digits)
Could have used base 10 (i.e.) for the logs of the probabilities – then the
resulting information is said to be coded in units of dits
(decimal digits)
Information Theory
So we have the relationship that : 2
b
=M
where M = number of possibilities.
or
b= log
2
(M)
where b=number of bits required to uniquely determine which,
of all possibilties M, a particular value is.
b tells us something about how many bits of information we need
to specify all the possibilities for a position.
Can also write b= - log
2
(1/M)
For equivalent probabilities P = 1/M
b= - log
2
(P)
For 20 amino acids in one position in a sequence motif
If all 20 had an equal chance, P = 1/20
b= - log
2
(1/20) = 4.32 bits of information
Information Theory
What happens when the probabilities aren’t the same?
Concept of surprisals (Tribus)
Watching TV analogy….I flash up one of the amino acids, and you guess
Which one it is…..
Define u
i
= - log2 (P
i
)
u
i
= the surprisal….a measure of how surprised you are
by the amino acid that appears on the screen….
In a sequence alignment, we want to know the average surprisal in
each column of the alignment.
….GDSFHQFVSHG…..
….SDAFHQYISFG…..
….GDSYWNFLSFG…..
….SDSFHQFMSFG…..
….LDSYWNYASFG…..
Information Theory
Average surprisal = u
V
+u
I
+u
L
+u
M
+u
A
N
total # of sequences
20
Ni = # of sequences with
In general case: ∑
N
i
u
i
Amino acid i in that position
i=1
N =total # of sequences
N
20
Bring N inside the ∑
N
i
u
i
Ni = # of sequences with
Amino acid i in that position
Summation sign i=1
N
N =total # of sequences
But Ni/N is just Pi, the probability of finding amino acid i in that
position of the alignment.
20
So we get…
∑
P
i
u
i
i=1
We defined u
i
= - log2 (P
i
)
So we end up with Shannon’s famous formula:
20
H = -
∑
Pi(log
2
Pi)
Where H = the “Shannon Entropy”
i=1
In bits per position in the alignment
Information Theory
So we end up with Shannon’s famous formula:
20
H = -
∑
Pi(log
2
Pi)
Where H = the “Shannon Entropy”
i=1
In bits per position in the alignment
What does this mean???
H is a measure of entropy or randomness or disorder
….it tells us how much uncertainty there is for the different
amino acid abundances at one position in a sequence motif
Information Theory
Shannon’s famous formula:
20
Where H = the “Shannon Entropy”
H = -
∑
Pi(log
2
Pi)
In bits per position in the alignment
i=1
….GDSFHQFVSHG…..
….SDAFHQYISFG…..
….GDSYWNFLSFG…..
….SDSFHQFMSFG…..
….LDSYWNYASFG…..
Here P(D)=1 so H = -(1 * log
2
(1)) = -(1*0) = 0 !
No uncertainty in this position!!
Here H = -(1/5 * log
2
(1/5)) -(1/5 * log
2
(1/5)) -(1/5 * log
2
(1/5))
-(1/5 * log
2
(1/5)) -(1/5 * log
2
(1/5)) = 2.32
Great uncertainty in this position!!
Information Theory
From Uncertainty to Information
X
Back to our cube
Suppose I would only answer
2 questions…
Now the uncertainty before would have been
3 bits
Uncertainty after is 1 bit
Therefore, the information content you have gained, R, is
R = H
before
–H
after
= 3 bits -1 bit = 2 bits!
Information Theory
….GDSFHQFVSHG…..
….SDAFHQYISFG…..
….GDSYWNFLSFG…..
….SDSFHQFMSFG…..
….LDSYWNYASFG…..
H
before
= 4.32, H
after
=0
Assuming all 20 amino acids equally possible:
Therefore, this position encodes 4.32-0 =4.32 bits of information!
H
Another position in the motif that contains all 20 amino acids…
before
= 4.32, H
after
=4.32
Therefore, this position encodes 4.32-4.32 =0 bits of information!
Sequence Logos
Motif “A” in prenyltransferases,
Made at http://www.bio.cam.ac.uk/seqlogo
For the compulsively nerdy – error bars represent possible error in the stack
due to limited number of sequences
Sequence Logos
1- Horizontal axis is position within a sequence motif
2 - Vertical axis is amount of information in bits that the
Position holds. The height of the individual letters is
H
before
-H
letter
, i.e. how much of that information that particular
letter (i.e. amino acid) contributes
3- Very useful graphical representations that convey
relative information content at various positions in a motif
and contributions from relative amino acids.
SF
D F
Y
D F
D Y
Markov Chains & Models
In PSSMs, each position is evaluated independently.
….GD HQFVSHG…..
….S QYISFG…..
….GDS WNFLSFG…..
…. HQFMSFG…..
….L S WNYASFG…..
A H
S S
Note – we can use PSSMs for aligning sequences,
finding domains, mapping motif sites, and even
secondary structure prediction….
Are we capturing all the information?
F
F
Y
Markov Chains & Models
What other information is there?
….GDSFHQ FVSHG…..
….SDAFH QYISFG…..
….GDSYWN FLSFG…..
….SDSFHQ FMSFG…..
….LDSYWNYASFG…..
How do we model this? …A Markov model!
….GD
….SD
….GD
….SD
Q
….LD
H
WN
HQ
YWN
A Markov Chain
Probability parameters
A collection
=transition probabilities
of states:
….GDSFHQFVSHG…..
….SDAF QYISFG…..
….GDSY FLSFG…..
….SDS MSFG…..
….LDS YASFG…..
F H
W
Y
F F
Net probabilities for the entire sequence involve conditional probabilities – Bayes!
The probability of any amino acid depends ONLY on the value of the preceding
amino acid.
QF
Y
F
SHG…..
SFG…..
SFG…..
SFG…..
SFG…..
Markov Models
Hidden Markov Models
Statistical model that includes all combinations of matches,
mismatches and gaps (prior slide)
Very useful to obtain best multiple sequence alignment or
find new members of the alignment!
Make a starting model, and use 20-100 aligned sequences to train
it (i.e. get the transition probabilities tuned).
Often use Dirichlet mixtures when deciding which amino acid
substitutions to allow in a motif
Try and generate all the sequences in the training set by examining
all possible paths through the model. This might be a huge computational
problem, but it is reduced using something called the
“Forward-Backward” algorithm
Another algorithm, Baum-Welch, counts the number of times each
state-to-state transition was used, and that a particular amino acid is
required to generate one of the sequences in the training set
Make an improved HMM that uses the results of the Baum-Welch
algorithm, and repeat cyclically until the parameters (i.e. the
transition probabilities) stabilize.
Now you have a trained model that can provide the most likely
path to generate each sequence. There is another
algorithm, the Viterbi algorithm, that uses dynamic programming
to find the best path to generate each sequence, given the model.
This collection of paths, (i.e. match-match-insertion-match…) defines
the final multiple sequence alignment.
Note that in the end, all you have are transition probabilities
for each amino acid at each position in an alignment. The
model that generated these is “hidden”
…The Hidden Markov Model!
Get multiple sequence alignments as good or better than those
Obtained with other techniques.
Both Pfam and SMART, two software programs that search for domains
In amino acid quieries, use HMMs
Database Searching
Problem is simple:
I want to find homologues to my protein in the database
How do I do it?
Do the obvious – compare my protein against
every other protein in the database and look
for local alignments by dynamic programming
Uh Oh!
1 n
For k sequences in the
1
12345678…. Database
12345678….
this becomes an O(mnk)
12345678….
problem!
12345678….
12345678….
m
12345678….
….essentially an O(mn) problem
Database Searching
Still, this can be done - ~ 50x slower than Blast/FASTA,
Smith-Waterman algorithm…
SSEARCH (ftp.virginia.edu/pub/fasta) – do it locally!
But in the old days, needed a faster method…
2 approaches – Blast, FASTA – both heuristic
(i.e. tried and true) – almost always finds related
Proteins but cannot guarantee optimal solution
FASTA: Basic Idea
1- Search for matching sequence patterns or words
Called k-tuples, which are exact matches of “k” characters
between the two sequences
i.e. RW = 2-tuple
Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE
Database Searching
FASTA: Basic Idea
2- Repeat for all possible k-tuples
i.e. CV = 2-tuple
Seq 1: AHFYRWNKLCV Seq 2: DRWNLFCVATYWE
3- Make a Hash Table (Hashing) that has the position of
each k-tuple in each sequence
2-tuple pos. in Seq1
RW 5
CV 10
AH 1
i.e.
pos in Seq 2 Offset (pos1-pos2)
2 3
7 3
---- ---
Database Searching
AHFYRWNKLCV DRWNLFCVATYWESeq 1: Seq 2:
3- Make a Hash Table
i.e.
(Hashing) that has the position of
each k-tuple in each sequence
2-tuple pos. in Seq1 pos in Seq 2 Offset (pos1-pos2)
3
3
RW 5 2
CV 10 7
AH 1 ---- ----
4- Look for words (k-tuples) with same offset
These are in-phase and reveal a region of alignment
between the two sequences.
5- Build a local alignment based on these, extend it outwards
Seq 1: AHFYRWNKLCV
Seq 2: DRWNLFCVATYWE
Database Searching
With hashing, number of comparisons is proportional
To the average sequence length (i.e. an O(n) problem),
Not an O(mn) problem as in dynamic programming.
Proteins – ktup = 1-2,
Nucleotides, ktup=4-6
One big problem – low complexity regions.
Seq 1: AHFYPPPPPPPPFSER
Seq 2: DVATPPPPPPPPPPPNLFK
Database Searching
BLAST
Same basic idea as FASTA, but faster and more sensitive!
How?
BLAST searches for common words or k-tuples, but
limits the search for k-tuples that are most significant,
by using the log-odds values in the Blosum62 amino
acid substitution matrix
i.e. look for WHK and might accept WHR but
not HFK as a possible match (note 8000 possibilities)
Repeat for all 3-tuples in the query
Search the database for a match to the top 50 3-tuples that
match the first query position in the sequence, the second
query position, etc.
Use any match to seed an ungapped alignment (old BLAST)
Database Searching
Word length is fixed: 3-tuple for proteins
11-tuple for nucleotides
By default, filters out low complexity regions.
Determine if the alignment is statistically significant.
calculates the probability of observing a score greater
than or equal to your alignment based on extreme
value distribution.
Calculates an E-value = expectation value:
This is the probability of finding an unrelated sequence
that shows this good an alignment just by chance.
Remember if p=.0001 and my database has 500,000
sequences, I will have an E=50! (normal starting E=10)
See: http://www.nlm.nih.gov
Psi-BLAST
Position-specific iterative BLAST
Combines BLAST searching with PSSMs!
1- Start with regular BLAST search – look at the results
Psi-BLAST
Position-specific iterative BLAST
Combines BLAST searching with PSSMs!
1- Start with regular BLAST search – look at the results
2- Pick the ones you believe are really homologous
Psi-BLAST
Position-specific iterative BLAST
Combines BLAST searching with PSSMs!
1- Start with regular BLAST search – look at the results
2- Pick the ones you believe are really homologous
3- Now align these sequences to the query sequence
and make up a PSSM that tells how much to weigh
each amino acid in each position in the alignment
4- Use this PSSM to do another BLAST search
5- Add any new sequences that come up to the old ones
if you believe they are really homologous
6- Repeat the alignment to make a new and improved PSSM
that tells how much to weigh each amino acid in each
position in the alignment
Psi-BLAST
7-- Use this PSSM to do another BLAST search
8– Keep iterating until no new sequences are found
Very good for finding weakly related sequences
The End !!!!!