MLOGD: Programme for detecting overlapping coding sequences

Summary: This is a suite of software for detecting new protein-coding sequences (CDSs) by analysing the pattern of mutations across an input sequence alignment. In particular, the software can be used to detect new CDSs that overlap known CDSs in a different read-frame. Such CDSs can be difficult to detect with standard gene-finding algorithms (see below). The software is particularly useful for analysing virus genome alignments, where overlapping genes and ribosomal frameshifting sites are common.

These programmes are described in Firth A. E., Brown C. M., 2005, Detecting overlapping coding sequences with pairwise alignments, Bioinformatics, 21, 282-92 and Firth A. E., Brown C. M., 2006, Detecting Overlapping Coding Sequences in Virus Genomes, BMC Bioinformatics, 7, 75.

You can enter your sequences into the web interface (recommended) or download the programmes (written in C++ and csh scripts; user instructions for LINUX) to run locally.

Please use the following login details if requested. Note that these will only allow access to public parts of this site. If you get an 'access denied' error then you are probably trying to access a non-public part. Please contact me (aef24at signcam.ac.uk). jpeg of user/passwd

Return to my homepage.


Caveats (see also
Supplementary material):


Overview:

Overlapping protein-coding sequences (CDSs) are particularly common in viruses but also occur in more complex genomes. Detecting such genes with conventional gene-finding algorithms can be difficult for several reasons. Due to the double-coding constraints, overlapping CDSs often display an atypical codon bias. Extending training-set methods, such as HMMs, to overlapping CDSs is made difficult by the several different frames (each requiring its own model) and limited training data. Similarity to known sequences or conservation between species may only point to the existence of one of an overlapping pair. Furthermore, overlapping genes on the same read-strand (e.g. at ribosomal frameshifting sites) may have the same promoter and mRNA, so that looking for promoters or transcription may only identify one of the two genes. Nonetheless overlapping CDSs have their own signatures resulting from the mutational constraints imposed by the requirement of simultaneously maintaining protein function in both genes.

The original MLOGD was a suite of software for analysing the mutation patterns in a multiple sequence alignment and estimating the relative likelihood that a given sequence region is single-coding or double-coding. The mutation model includes a nucleotide mutation matrix, codon usage table and amino acid substitution matrix. The suite also included a Monte Carlo single/double-coding sequence evolution simulator, for determining confidence scores and other statistics (as a function of sequence composition, length, divergence time and the double-coding frame).

The current version of MLOGD has been improved in several respects, and is now much more user-friendly than the original version. There are three running modes. In the 'Test input query CDSs' mode, the user inputs an alignment, annotation of known CDSs in a reference sequence, and the position of a query, or hypothetical, CDS. MLOGD then calculates the likelihood ratio between the null model (only the known CDSs are coding) and the alternate model (both the known CDSs and the query CDS are coding). This may involve combinations of the non-coding, single-coding and double-coding mutation models. In the 'Find and test all non-annotated ORFs' mode, MLOGD will look for all non-annotated ORFs above a given length in the reference sequence, and calculate the above likelihood ratio for each ORF. In the 'Six-frame sliding window plots' mode, MLOGD will calculatate the likelihood ratio in sliding windows in all six possible read-frames. Positive regions in the plots may indicate unannotated CDSs.

In addition, lots of pretty graphics are produced (see guided tour for examples). See also Firth A. E., Brown C. M., 2005, Detecting overlapping coding sequences with pairwise alignments, Bioinformatics, 21, 282-92 and Firth A. E., Brown C. M., 2006, Detecting Overlapping Coding Sequences in Virus Genomes, BMC Bioinformatics, 7, 75 for more detail on the algorithms.


Notes: