A powerful technique for locating functional elements in genomes is to look for conserved columns in multiple sequence alignments (Stojanovic et al. 1999; plotcon, EMBOSS package - Rice et al. 2000; MultiPipMaker - Schwartz et al. 2003; Margulies et al. 2003; VISTA - Frazer et al. 2004). However it is difficult to use this method to detect additional functional elements within protein-coding sequences (CDSs), since many columns in CDSs show conservation due to constraints on the encoded protein. It is possible to look for conserved columns at four-fold degenerate sites (some, but not all, third nucleotide positions in codons), but this leaves out information from at least two thirds of columns and is more-or-less impossible within overlapping genes (common in viruses). Conserved RNA secondary structures may be found with programmes such as alidot (Hofacker et al. 2002) and RNA-DECODER (Pedersen et al. 2004), while other features may be detected through database similarity searches. However novel features without significant RNA secondary structure can not be detected using these methods.
The software package CDS-plotcon is specifically designed to
search for conserved functional elements within CDSs. It uses an average
model (12.3) of the expected mutation patterns
within CDSs (incorporating a nucleotide mutation matrix, amino acid
substitution matrix, sequence divergence parameter
, mean
synonymous:nonsynonymous substitution ratio
and phylogenetic tree;
it can handle up to three overlapping CDSs in different read-frames).
Using this, it calculates the expected number of mutations across the
alignment in each column and compares this with the observed number of
mutations. The results are plotted along the genome, and optionally
passed through a sliding window (clipped) mean filter
(
6).
Particularly conserved regions may indicate non-coding functional elements, new coding ORFs, or more-conserved regions within proteins (e.g. motifs). The software also produces conservation plots for four-fold degenerate sites, that may be used to help distinguish these alternatives. CDS-plotcon could also be used in conjunction with complementary programmes (e.g. RNA structure prediction programmes).
As well as running the core conservation-calculating programme, the
master script run_mlrgd also aligns the input sequences,
extracts CDS locations from GENBANK-format files or user-supplied
files, calculates a phylogenetic tree, and produces the plots. In
run_mlrgd, the user may alter many parameters including
parameters for fitting and
, running mean window sizes and
clipping levels, whether the genome is circular or not and sequence
range to analyse (
8).
The package is particularly useful for analysing virus genomes where (sometimes multiple) CDSs overlapping non-coding conserved features are common and many sequenced genomes with a reasonable range of divergences are often available. In general, a set of viral genomes may be downloaded in GENBANK-format from the NCBI website and fed straight into the package with minimal user input necessary.