Note on sequence alignment quality:
Knowing the correct reading-frame is very important for the
statistical mutation model. When estimating the probability that a
given nucleotide in a CDS will mutate, the software needs to know
which codon the nucleotide is in, and its position within that codon.
The reference sequence CDS files are used to define all the
reading-frames. Every coding nucleotide in the reference sequence is
assigned a codon position (1st, 2nd or 3rd) and read-direction
(forward or reverse). Nucleotides in the non-reference sequences are
identified purely on the basis of the nucleotides in the reference
sequence with which they align. (See also note on reference sequence quality.)
In order to maintain the correct reading-frame in the different
sequences, it is important that gaps in the alignment occur in groups
of three within coding sequences. A useful programme for doing this
is
code2aln (Stocsits 2003). This is a sequence alignment
programme that tries to keep gaps in groups of three in CDSs and
smoothly joins coding regions onto non-coding regions. A newer version
- called
codaln - is also available (Stocsits et al 2005).
Provided you don't have too many sequences, you can enter your
sequences at this site, which uses
code2aln to realign the sequences. If you have purely coding
sequences you can of course use e.g. CLUSTALW
on the translated (amino acid) sequences.
Note also that in places where the reference sequence contains
alignment gaps, there is no frame information for the non-reference
sequences. As far as calculation of statistics is concerned, all such
regions are omitted. However, for the stop and start codon annotation
on some of the output plots, any non-reference sequence stops or
starts within reference sequence gaps will be missed.
If your alternate model is an ORF overlapping and entirely contained
within a known CDS, then alignment on the known CDS amino acid
sequence will ensure that gaps within the putative CDS occur in groups
of three. If, however, the putative CDS is partially or entirely
contained within a region that is non-coding in the null model, then
it will be harder to ensure that the alignment programme maintains
gaps within the putative CDS in groups of three. You can improve the
chances that it does so, by annotating the putative CDSs as well as
known CDSs on the alignment server page.
References:
- Roman R. Stocsits, 2003, Nucleic Acid Sequence Alignments of
Partly Coding Regions, PhD thesis, University of Vienna.
- Roman R. Stocsits, Ivo L. Hofacker, Claudia Fried, Peter
F. Stadler, 2005, Multiple Sequence Alignments of Partially Coding
Nucleic Acid Sequences, BMC Bioinformatics, 6, 160.