N - Algorithms

Six different algorithms can be used to align sequence data.

The first is called the optimistic algorithm (opt) and computes a similarity between the first and the last bases that are in common between both sequences under comparison.

Starting and ending non-overlapping regions are not accounted, but intermediate overlapping regions are accounted as mismatches.

The pessimistic algorithm (pes) accounts all portions of both sequences including the ones that are not in common.

So mutations, deletions and insertions are accounted as mismatches.

The third algorithm is the super optimistic (SupOpt) one and will only be using the overlapping region(s) between the two sequences under comparison and will, like the pessimistic algorithm, consider the mutations, deletions and short insertions as mismatches.

The next three algorithms are called opt2dir, pes2dir and supopt2dir because they align the original sequence as well as its reverse complement in an optimistic, a pessimistic and super optimistic way respectively.

Then, only the best matching sequence is accounted for the similarity comparisons.

When several sequences are available in the same field of a single record and have to be compared with one or several of another record, then only the best matching pair of sequences is accounted for the computation of the similarity coefficient.

A threshold value can also be set in order to keep alignments with a minimum amount of base pairs in common.

Let’s use a very simple example to demystify these algorithms. A source and a reference DNA sequence are created as follows:

Source DNA:

10 20 30 40 50 60

gcttggagtcaccgcagacgttaacgggaaccgacgttgtcaccggggacaccctcctcttcc

Reference DNA:

ttctttcttggagtcaccgcagacgttaccacggcggacttcgcattatatagcgcatagcgcgcaggcgagagagctct

10 20 30 40 50 60 70 80

tcatattatatcgatctcgatcatgccttgacggaaaccgacgttgtcaccggggacacctcagg

90 100 110 120 130 140

The alignment by hand of these two sequences gives the following result:

1 10 20 30 40 50 60

gcttggagtcaccgcagacgtta ac g gg aaccgacgttgtcaccggggacaccctcctcttcc

:::::::::::::::::::::: :: : :: ::::::::::::::::::::::::: ::

ttctttcttggagtcaccgcagacgttaccacggcg...ttccttgacggaaaccgacgttgtcaccggggacacc tcagg

1 10 20 30 110 120 130 140

There is 54 identical nucleotides. The similarity is this 54 value divided by a denominator that can be computed different ways.

The best alignment returned by BioloMICS is:

1 10 20 30 40 50 60

gcttggagtcaccgcagacgtta acgg g aaccgacgttgtcaccggggacaccctcctcttcc

:::::::::::::::::::::: :::: : ::::::::::::::::::::::::: ::

ttctttcttggagtcaccgcagacgttaccacggcg...ttccttgacggaaaccgacgttgtcaccggggacacc tcagg

1 10 20 30 110 120 130 140

Note that this alignment is slightly different from the previous one, showing that there isn’t one unique alignment solution.

Super-Optimistic

1 10 20 30 40 50 60

gcttggagtcaccgcagacgtta acgg g aaccgacgttgtcaccggggacaccctcctcttcc

:::::::::::::::::::::: :::: : ::::::::::::::::::::::::: ::

ttctttcttggagtcaccgcagacgttaccacggcg...ttccttgacggaaaccgacgttgtcaccggggacacc tcagg

<------- Super-optimistic ---> + <--------- denominator ---->

In this case, important insertions are ignored. The denominator is the sum of the size of the similar segments. In the case above, there is two similar blocs. Bloc 1 is running in source from nucleotide 2 to 28 with 3 insertions, so its size is 28 – 2 + 1 + 3 = 30 and bloc 2 is covering nucleotides 29 to 55 with one insertion, so its size = 55 – 29 + 1 + 1 = 28. The denominator is 30 + 28 = 58 and the similarity is 54 / 58 = 93.10 %.

Pessimistic

In this case, the denominator is the global right most location of a nucleotide minus the global leftmost position of a nucleotide, as shown in figure below.

gcttggagtcaccgcagacgtta acgg g aaccgacgttgtcaccggggacaccctcctcttcc

:::::::::::::::::::::: :::: : ::::::::::::::::::::::::: ::

ttctttcttggagtcaccgcagacgttaccacggcg...ttccttgacggaaaccgacgttgtcaccggggacacc tcagg

<------------------------- Pessimistic denominator ---------------------------------->

In the example above, the denominator is 150 and the similarity:

Sim = 54 / 150 = 36.00 %

Optimistic

In this case, the denominators is the global right most location of a nucleotide in the sequence ending the first minus the global leftmost position of a nucleotide in the sequence starting the last, as shown in figure below.

gcttggagtcaccgcagacgtta acgg g aaccgacgttgtcaccggggacaccctcctcttcc

:::::::::::::::::::::: :::: : ::::::::::::::::::::::::: ::

ttctttcttggagtcaccgcagacgttaccacggcg...ttccttgacggaaaccgacgttgtcaccggggacacc tcagg

<-------------------- Optimistic denominator ------------------------------->

In the example above, the denominator is 141 and the similarity: Sim = 54 / 141 = 38.30 %