Page 1
Sequence Alignment
Institute of Lifelong Learning, University of Delhi
Subject: Bioinformatics
Lesson: Sequence Alignment
Lesson Developer: Sandip Das
College/ Depatment : Department of Botany, University of Delhi
Page 2
Sequence Alignment
Institute of Lifelong Learning, University of Delhi
Subject: Bioinformatics
Lesson: Sequence Alignment
Lesson Developer: Sandip Das
College/ Depatment : Department of Botany, University of Delhi
Sequence Alignment
Institute of Lifelong Learning, University of Delhi 1
Table of Contents
Chapter: Sequence Alignment
? Introduction
? Principle of alignment
? Matrices for alignment
o DNA matrices
o Protein Matrices
? Multiple Sequence Alignment
? Summary
? Exercise/ Practice
? Glossary
? References/ Bibliography/ Further Reading
Page 3
Sequence Alignment
Institute of Lifelong Learning, University of Delhi
Subject: Bioinformatics
Lesson: Sequence Alignment
Lesson Developer: Sandip Das
College/ Depatment : Department of Botany, University of Delhi
Sequence Alignment
Institute of Lifelong Learning, University of Delhi 1
Table of Contents
Chapter: Sequence Alignment
? Introduction
? Principle of alignment
? Matrices for alignment
o DNA matrices
o Protein Matrices
? Multiple Sequence Alignment
? Summary
? Exercise/ Practice
? Glossary
? References/ Bibliography/ Further Reading
Sequence Alignment
Institute of Lifelong Learning, University of Delhi 2
Introduction
One of the central themes in bioinformatics is the concept of “similarity” and “relatedness”
which in turn in based on evolutionary relationship or ancestry. We use such themes of
“similarity/relatedness” in a variety of applications such as
? Gene and genetic element finding
? Molecular evolution or phylogeny
? Comparative genomics
? Structure prediction through homology modeling, and several others
The principle on which all these are based is sequence similarity that can be deduced via
Sequence Alignment.
We often can deduce relationship among objects by identifying similar features or
characters. Alignment also attempts to identify similarity between two or multiple sequences
by applying a similar logic, except that several events (such as types, frequency and
occurrence of mutation) that may have led to similarity or dissimilarity are also taken into
account.
Before we delve into the principles of sequence alignment, it may be useful to refresh some
of the concepts of mutation and evolution and keep them in mind while understanding
alignment.
a. Mutations occur at the level of DNA
b. Mutations can survive or are accepted if they are potentially non-harmful
(selectively neutral) or confer some selective advantage to the organism and
population. A mutation that is harmful, has a negative impact and may be lethal
will be lost from the population
c. Small mutations such as single-base changes include transitions and
transversions, and insertion and deletion of bases
d. Transitions are more frequently encountered than transversions
e. Non-coding DNA can accumulate mutations or changes at a higher rate than
coding regions (because of the subsequent consequences on the encoded
proteins)
f. Due to degeneracy of codons and Wobble bases, all mutations at DNA level do
not have an impact at the protein level and are thus deemed to be silent.
Page 4
Sequence Alignment
Institute of Lifelong Learning, University of Delhi
Subject: Bioinformatics
Lesson: Sequence Alignment
Lesson Developer: Sandip Das
College/ Depatment : Department of Botany, University of Delhi
Sequence Alignment
Institute of Lifelong Learning, University of Delhi 1
Table of Contents
Chapter: Sequence Alignment
? Introduction
? Principle of alignment
? Matrices for alignment
o DNA matrices
o Protein Matrices
? Multiple Sequence Alignment
? Summary
? Exercise/ Practice
? Glossary
? References/ Bibliography/ Further Reading
Sequence Alignment
Institute of Lifelong Learning, University of Delhi 2
Introduction
One of the central themes in bioinformatics is the concept of “similarity” and “relatedness”
which in turn in based on evolutionary relationship or ancestry. We use such themes of
“similarity/relatedness” in a variety of applications such as
? Gene and genetic element finding
? Molecular evolution or phylogeny
? Comparative genomics
? Structure prediction through homology modeling, and several others
The principle on which all these are based is sequence similarity that can be deduced via
Sequence Alignment.
We often can deduce relationship among objects by identifying similar features or
characters. Alignment also attempts to identify similarity between two or multiple sequences
by applying a similar logic, except that several events (such as types, frequency and
occurrence of mutation) that may have led to similarity or dissimilarity are also taken into
account.
Before we delve into the principles of sequence alignment, it may be useful to refresh some
of the concepts of mutation and evolution and keep them in mind while understanding
alignment.
a. Mutations occur at the level of DNA
b. Mutations can survive or are accepted if they are potentially non-harmful
(selectively neutral) or confer some selective advantage to the organism and
population. A mutation that is harmful, has a negative impact and may be lethal
will be lost from the population
c. Small mutations such as single-base changes include transitions and
transversions, and insertion and deletion of bases
d. Transitions are more frequently encountered than transversions
e. Non-coding DNA can accumulate mutations or changes at a higher rate than
coding regions (because of the subsequent consequences on the encoded
proteins)
f. Due to degeneracy of codons and Wobble bases, all mutations at DNA level do
not have an impact at the protein level and are thus deemed to be silent.
Sequence Alignment
Institute of Lifelong Learning, University of Delhi 3
g. Any change in DNA sequence that does not alter protein sequence is termed as
synonymous; and a change in DNA that leads to incorporation of an alternate
amino acid is termed non-synonymous
h. Within proteins, replacement rate of one amino acid with another is rarely
observed within domains or functional units
i. Amino acids belonging to similar chemical or physical properties are more likely
to replace one another
j. Rate of evolution among DNA is higher than proteins; or in other words, proteins
are more conserved than DNA sequences
As alignment aims to find matches between similar residues, concepts of evolutionary
biology are widely used. DNA sequences that shared a last common ancestor upto 600
million years ago and proteins that have diverged upto a billion years ago can be
successfully aligned.
Principle of Alignment:
The course of evolution proceeds in small incremental stages i.e. instead of large scale
disruptions that span entire genomes, evolution favours small variations spread throughout
the genome. Of-course it is difficult to actually define the physical boundaries of what
constitutes “large” or “small”! For the sake of simplicity, let us limit our definition of “small”
to single base or amino acids, and “large” being several Kilobases or even Megabases in
dimensions. As majority of the changes are small, it is possible for us to detect similar
regions with the genome through alignment. We also presume that regions that share
considerable levels of similarity as measured through alignment must have shared ancestry
or have common evolutionary history. Such regions are termed as homologous sequences.
Homology can be further sub-divided into orthology and paralogy which are shared
evolutionary history either by speciation or through duplication. A note of caution: Two
sequences can also share high similarity without sharing recent ancestry. Such sequences
are termed as xenologs and are generally acquired through horizontal gene transfer.
Page 5
Sequence Alignment
Institute of Lifelong Learning, University of Delhi
Subject: Bioinformatics
Lesson: Sequence Alignment
Lesson Developer: Sandip Das
College/ Depatment : Department of Botany, University of Delhi
Sequence Alignment
Institute of Lifelong Learning, University of Delhi 1
Table of Contents
Chapter: Sequence Alignment
? Introduction
? Principle of alignment
? Matrices for alignment
o DNA matrices
o Protein Matrices
? Multiple Sequence Alignment
? Summary
? Exercise/ Practice
? Glossary
? References/ Bibliography/ Further Reading
Sequence Alignment
Institute of Lifelong Learning, University of Delhi 2
Introduction
One of the central themes in bioinformatics is the concept of “similarity” and “relatedness”
which in turn in based on evolutionary relationship or ancestry. We use such themes of
“similarity/relatedness” in a variety of applications such as
? Gene and genetic element finding
? Molecular evolution or phylogeny
? Comparative genomics
? Structure prediction through homology modeling, and several others
The principle on which all these are based is sequence similarity that can be deduced via
Sequence Alignment.
We often can deduce relationship among objects by identifying similar features or
characters. Alignment also attempts to identify similarity between two or multiple sequences
by applying a similar logic, except that several events (such as types, frequency and
occurrence of mutation) that may have led to similarity or dissimilarity are also taken into
account.
Before we delve into the principles of sequence alignment, it may be useful to refresh some
of the concepts of mutation and evolution and keep them in mind while understanding
alignment.
a. Mutations occur at the level of DNA
b. Mutations can survive or are accepted if they are potentially non-harmful
(selectively neutral) or confer some selective advantage to the organism and
population. A mutation that is harmful, has a negative impact and may be lethal
will be lost from the population
c. Small mutations such as single-base changes include transitions and
transversions, and insertion and deletion of bases
d. Transitions are more frequently encountered than transversions
e. Non-coding DNA can accumulate mutations or changes at a higher rate than
coding regions (because of the subsequent consequences on the encoded
proteins)
f. Due to degeneracy of codons and Wobble bases, all mutations at DNA level do
not have an impact at the protein level and are thus deemed to be silent.
Sequence Alignment
Institute of Lifelong Learning, University of Delhi 3
g. Any change in DNA sequence that does not alter protein sequence is termed as
synonymous; and a change in DNA that leads to incorporation of an alternate
amino acid is termed non-synonymous
h. Within proteins, replacement rate of one amino acid with another is rarely
observed within domains or functional units
i. Amino acids belonging to similar chemical or physical properties are more likely
to replace one another
j. Rate of evolution among DNA is higher than proteins; or in other words, proteins
are more conserved than DNA sequences
As alignment aims to find matches between similar residues, concepts of evolutionary
biology are widely used. DNA sequences that shared a last common ancestor upto 600
million years ago and proteins that have diverged upto a billion years ago can be
successfully aligned.
Principle of Alignment:
The course of evolution proceeds in small incremental stages i.e. instead of large scale
disruptions that span entire genomes, evolution favours small variations spread throughout
the genome. Of-course it is difficult to actually define the physical boundaries of what
constitutes “large” or “small”! For the sake of simplicity, let us limit our definition of “small”
to single base or amino acids, and “large” being several Kilobases or even Megabases in
dimensions. As majority of the changes are small, it is possible for us to detect similar
regions with the genome through alignment. We also presume that regions that share
considerable levels of similarity as measured through alignment must have shared ancestry
or have common evolutionary history. Such regions are termed as homologous sequences.
Homology can be further sub-divided into orthology and paralogy which are shared
evolutionary history either by speciation or through duplication. A note of caution: Two
sequences can also share high similarity without sharing recent ancestry. Such sequences
are termed as xenologs and are generally acquired through horizontal gene transfer.
Sequence Alignment
Institute of Lifelong Learning, University of Delhi 4
Figure: Homologs: Orthologs and Paralogs
Source: Dr Sandeep Das
An alignment attempts to create a matrix of rows and columns where each row denotes a
sequence and each column is occupied by similar characters derived from each sequences
or a gap. Pairwise alignment attempts to align two sequence at-a-time, whereas
multiple sequence alignment (MSA) attempts to align more than two sequences. If
there are several sequences are derived from organisms having a common shared ancestry
or evolutionary history, we expect that these sequences will exhibit similarity but will not be
exactly identical i.e. we expect to find similar characters or residues and also some
differences. The differences or dissimilarities encountered are a result of mutational events;
more the time since common ancestry, more the number or accumulated mutation and
therefore more the number of dissimilar residues. The number of changes is therefore
directly proportional to evolutionary time.
Therefore alignment tools will try to generate the matrix such that there are more identical
and/or similar residues. It may be worthwhile to point out in case a mutational event or
events lead to deletion of the nucleotides, “gaps” are introduced while performing the
alignment to “mimic” the event and “achieve” an alignment with maximal identity.
Therefore sequence alignment is a combination of correctly identifying and placing similar
and dissimilar residues in columns.
Read More