MLOGD: Programme for detecting overlapping coding sequences

Summary: This is a suite of software for detecting new protein-coding sequences (CDSs) by analysing the pattern of mutations across an input sequence alignment. In particular, the software can be used to detect new CDSs that overlap known CDSs in a different read-frame. Such CDSs can be difficult to detect with standard gene-finding algorithms (see below). The software is particularly useful for analysing virus genome alignments, where overlapping genes and ribosomal frameshifting sites are common.

These programmes are described in Firth A. E., Brown C. M., 2005, Detecting overlapping coding sequences with pairwise alignments, Bioinformatics, 21, 282-92 and Firth A. E., Brown C. M., 2006, Detecting Overlapping Coding Sequences in Virus Genomes, BMC Bioinformatics, 7, 75.

You can enter your sequences into the web interface (recommended) or download the programmes (written in C++ and csh scripts; user instructions for LINUX) to run locally.

Please use the following login details if requested. Note that these will only allow access to public parts of this site. If you get an 'access denied' error then you are probably trying to access a non-public part. Please contact me (aef24

cam.ac.uk).

Web interface (guided tour; caveats).
Software download.
Supplementary material.
Virus database (retrieval form; 640 virus NCBI refseqs, each clustered with its own genome neighbours).
Generate plots on-the-fly for 630 NCBI virus refseq alignments.
Generate plots on-the-fly for 18377 Mammalian NCBI Homologene groups (mRNA alignments).
Generate plots on-the-fly for 5449 yeast CDS alignments.
Old webpage (Firth & Brown, 2005, Bioinformatics, 21, 282-92).
Terms of Usage.

Return to my homepage.

Caveats (see also Supplementary material):

Not a 'universal' gene-finder. The CDS has to be subject to purifying selection (e.g. HCV F ORF not detected). Overlapping CDSs that are less conserved (at the amino acid level) than the genes they overlap are often missed, so a negative MLOGD signal doesn't mean an ORF is non-coding.
-2 frame overlaps generate false positives.
Some scatter into low positive scores -> need to set thresholds to avoid false positives. (I like to discard ones where the 'mean log likelihood ratio per nucleotide' is less than one sixth of the 'sequence divergence' [i.e. where y < x/6 in the statement 'Sum over phylogenetic tree is located at (x,y)' at the bottom of the 'likelihood ratio plot' in the 'Test input query CDSs' results page].)
Need minimum of ~20 independent base variations (e.g. two sequences with <89% identity, for a 60-codon ORF).
Reduced sensitivity when sequences are too divergent (e.g. < 65% pairwise identity) due to 'mutational saturation' effects.
If ORF is very short, MLOGD may give a false positive if certain columns are constrained (e.g. due to RNA secondary structure or regulatory region).

Overview:

Overlapping protein-coding sequences (CDSs) are particularly common in viruses but also occur in more complex genomes. Detecting such genes with conventional gene-finding algorithms can be difficult for several reasons. Due to the double-coding constraints, overlapping CDSs often display an atypical codon bias. Extending training-set methods, such as HMMs, to overlapping CDSs is made difficult by the several different frames (each requiring its own model) and limited training data. Similarity to known sequences or conservation between species may only point to the existence of one of an overlapping pair. Furthermore, overlapping genes on the same read-strand (e.g. at ribosomal frameshifting sites) may have the same promoter and mRNA, so that looking for promoters or transcription may only identify one of the two genes. Nonetheless overlapping CDSs have their own signatures resulting from the mutational constraints imposed by the requirement of simultaneously maintaining protein function in both genes.

The original MLOGD was a suite of software for analysing the mutation patterns in a multiple sequence alignment and estimating the relative likelihood that a given sequence region is single-coding or double-coding. The mutation model includes a nucleotide mutation matrix, codon usage table and amino acid substitution matrix. The suite also included a Monte Carlo single/double-coding sequence evolution simulator, for determining confidence scores and other statistics (as a function of sequence composition, length, divergence time and the double-coding frame).

The current version of MLOGD has been improved in several respects, and is now much more user-friendly than the original version. There are three running modes. In the 'Test input query CDSs' mode, the user inputs an alignment, annotation of known CDSs in a reference sequence, and the position of a query, or hypothetical, CDS. MLOGD then calculates the likelihood ratio between the null model (only the known CDSs are coding) and the alternate model (both the known CDSs and the query CDS are coding). This may involve combinations of the non-coding, single-coding and double-coding mutation models. In the 'Find and test all non-annotated ORFs' mode, MLOGD will look for all non-annotated ORFs above a given length in the reference sequence, and calculate the above likelihood ratio for each ORF. In the 'Six-frame sliding window plots' mode, MLOGD will calculatate the likelihood ratio in sliding windows in all six possible read-frames. Positive regions in the plots may indicate unannotated CDSs.

In addition, lots of pretty graphics are produced (see guided tour for examples). See also Firth A. E., Brown C. M., 2005, Detecting overlapping coding sequences with pairwise alignments, Bioinformatics, 21, 282-92 and Firth A. E., Brown C. M., 2006, Detecting Overlapping Coding Sequences in Virus Genomes, BMC Bioinformatics, 7, 75 for more detail on the algorithms.

Notes:

You must agree to the Terms of Usage before using any of this software.
If you use this software for publications, please cite Firth A. E., Brown C. M., 2005, Detecting overlapping coding sequences with pairwise alignments, Bioinformatics, 21, 282-92 or Firth A. E., Brown C. M., 2006, Detecting Overlapping Coding Sequences in Virus Genomes, BMC Bioinformatics, 7, 75.
Queries or comments to Andrew Firth (aef24cam.ac.uk).
AEF gratefully acknowledges funding from the Foundation for Research, Science and Technology, grant number UOOX0304.
CMB gratefully acknowledges funding from the NZ Health Research Council.