MLOGD: Programme for detecting
overlapping coding sequences
Summary: This is a suite of software for detecting new
protein-coding sequences (CDSs) by analysing the pattern of mutations
across an input sequence alignment. In particular, the software can
be used to detect new CDSs that overlap known CDSs in a different
read-frame. Such CDSs can be difficult to detect with standard
gene-finding algorithms (see below). The software is particularly
useful for analysing virus genome alignments, where overlapping genes
and ribosomal frameshifting sites are common.
These programmes are described in Firth A. E., Brown C. M., 2005, Detecting
overlapping coding sequences with pairwise alignments, Bioinformatics,
21, 282-92 and Firth A. E., Brown C. M., 2006, Detecting Overlapping
Coding Sequences in Virus Genomes, BMC Bioinformatics, 7, 75.
You can enter your sequences into the web interface (recommended) or
download the programmes (written in C++ and csh scripts; user
instructions for LINUX) to run locally.
Please use the following login details if requested. Note that these
will only allow access to public parts of this site. If you get an
'access denied' error then you are probably trying to access a
non-public part. Please contact me (aef24cam.ac.uk).
Not a 'universal' gene-finder. The CDS has to be subject to
purifying selection (e.g. HCV F ORF not detected). Overlapping CDSs
that are less conserved (at the amino acid level) than the genes they
overlap are often missed, so a negative MLOGD signal doesn't mean an
ORF is non-coding.
-2 frame overlaps generate false positives.
Some scatter into low positive scores -> need to set thresholds
to avoid false positives. (I like to discard ones where the
'mean log likelihood ratio per nucleotide'
is less than one sixth of the 'sequence
divergence' [i.e. where y < x/6 in the statement 'Sum over phylogenetic
tree is located at (x,y)' at the bottom of the 'likelihood ratio plot'
in the 'Test input query CDSs' results page].)
Need minimum of ~20 independent base variations (e.g. two
sequences with <89% identity, for a 60-codon ORF).
Reduced sensitivity when sequences are too divergent
(e.g. < 65% pairwise identity) due to 'mutational saturation' effects.
If ORF is very short, MLOGD may give a false positive if
certain columns are constrained (e.g. due to RNA secondary structure
or regulatory region).
Overview:
Overlapping protein-coding sequences (CDSs) are particularly common in
viruses but also occur in more complex genomes. Detecting such genes
with conventional gene-finding algorithms can be difficult for several
reasons. Due to the double-coding constraints, overlapping CDSs often
display an atypical codon bias. Extending training-set methods, such
as HMMs, to overlapping CDSs is made difficult by the several
different frames (each requiring its own model) and limited training
data. Similarity to known sequences or conservation between species
may only point to the existence of one of an overlapping pair.
Furthermore, overlapping genes on the same read-strand (e.g. at
ribosomal frameshifting sites) may have the same promoter and mRNA, so
that looking for promoters or transcription may only identify one of the
two genes. Nonetheless overlapping CDSs have their own signatures
resulting from the mutational constraints imposed by the requirement
of simultaneously maintaining protein function in both genes.
The original MLOGD was a suite of software for analysing the mutation
patterns in a multiple sequence alignment and estimating the relative
likelihood that a given sequence region is single-coding or
double-coding. The mutation model includes a nucleotide mutation
matrix, codon usage table and amino acid substitution matrix. The
suite also included a Monte Carlo single/double-coding sequence
evolution simulator, for determining confidence scores and other
statistics (as a function of sequence composition, length, divergence
time and the double-coding frame).
The current version of MLOGD has been improved in several respects,
and is now much more user-friendly than the original version. There
are three running modes. In the 'Test input query CDSs' mode, the
user inputs an alignment, annotation of known CDSs in a reference
sequence, and the position of a query, or hypothetical, CDS. MLOGD
then calculates the likelihood ratio between the null model (only the
known CDSs are coding) and the alternate model (both the known CDSs
and the query CDS are coding). This may involve combinations of the
non-coding, single-coding and double-coding mutation models. In the
'Find and test all non-annotated ORFs' mode, MLOGD will look for all
non-annotated ORFs above a given length in the reference sequence, and
calculate the above likelihood ratio for each ORF. In the 'Six-frame
sliding window plots' mode, MLOGD will calculatate the likelihood
ratio in sliding windows in all six possible read-frames. Positive
regions in the plots may indicate unannotated CDSs.
In addition, lots of pretty graphics are produced (see guided tour for examples). See also
Firth A. E., Brown C. M., 2005, Detecting overlapping coding sequences
with pairwise alignments, Bioinformatics, 21, 282-92 and
Firth A. E., Brown C. M., 2006, Detecting Overlapping Coding Sequences in
Virus Genomes, BMC Bioinformatics, 7, 75 for
more detail on the algorithms.
Notes:
You must agree to the Terms of Usage
before using any of this software.
If you use this software for publications, please cite Firth A. E.,
Brown C. M., 2005, Detecting overlapping coding sequences with pairwise
alignments, Bioinformatics, 21, 282-92 or
Firth A. E., Brown C. M., 2006, Detecting Overlapping Coding Sequences in
Virus Genomes, BMC Bioinformatics, 7, 75.
Queries or comments to Andrew Firth (aef24cam.ac.uk).
AEF gratefully acknowledges funding from the Foundation for Research,
Science and Technology, grant number UOOX0304.
CMB gratefully acknowledges funding from the NZ Health Research Council.