MLOGD: Notes

Notes on nucleotide, codon, amino acid and genetic code matrices:

The nucleotide, codon, amino acid and genetic code matrices used by the software are contained in the files nuc.dat, codon.dat, aa.dat and aa2codon.dat, respectively, (follow links to view default files). These are described in more detail below. See also the Supplementary Material (pdf file 1.0MB, ps file 0.9MB) for more details on the mutation model.

Nucleotide matrix

The programmes use a general reversible model (Yang 1994) for the nucleotide matrix Q, i.e. Q =

-	a.π_C	b.π_A	c.π_G
a.π_U	-	d.π_A	e.π_G
b.π_U	d.π_C	-	f.π_G
c.π_U	e.π_C	f.π_A	-

where a, b, c, d, e, f are rate parameters, the π_j are the equilibrium nucleotide frequencies, the order of the rows and columns is U, C, A, G, and the diagonals are set so that

q_ii = - ∑_{j ≠ i} q_ij.

The programmes use as default a κ = 3 Kimura (1980) model, i.e. π_U = π_C = π_A = π_G and a = f = 3b = 3c = 3d = 3e. Before using Q as input to the programmes, it needs to be diagonalized. See the file nuc.dat for an example. This file contains four matrices. The first is Q (normalized so that the row sums are zero and - ∑_i π_i.q_ii = 1). The next three matrices are U^-1, D and U, where D is diagonal and Q = U^-1 D U. In general, if you have the a, b, c, d, e, f parameters, you can make a Q matrix that conserves relative nucleotide frequencies for a particular gene or species by multiplying these values by the π_j values for that gene or species.

Codon usage matrix

In theory, the codon usage tables (CUTs) used by the software are relative frequencies, multiplied by the degeneracy of the corresponding amino acid (e.g. the score for AUG (met) is multiplied by 1 while the score for CUG (leu) is multiplied by 6), and divided by the π_j for the particular species or alignment (e.g. the score for CUG is divided by π_C.π_U.π_G). The best way to combine codon usage frequencies with amino acid matrices and nucleotide matrices is not very clear: our procedure does not, for example, properly take into account the hypermutability of GC dinucleotides. You may prefer to use a null CUT (i.e. equal codon frequencies), which is in fact the default matrix (see Firth & Brown 2005 for justification).

The EMBOSS (Rice et al. 2000) format for CUTs comprises five columns:

3-letter codon code,
1-letter amino acid code,
relative frequencies (sum to 1 for each amino acid),
absolute frequencies (total for all codons sums to 1000),
raw codon counts.

The following simple commands can be used to produce new CUTs. E.g. for the EMBOSS CUT, Ehum.cut:

Sort the CUT into the order and code used by the programmes:

awk '{print $1,$3,$4}' Ehum.cut | sed 's/T/a/g' | sed 's/U/a/g' | \
  sed 's/C/b/g' | sed 's/A/c/g' | sed 's/G/d/g' | sort | sed 's/a/U/g' | \
  sed 's/b/C/g' | sed 's/c/A/g' | sed 's/d/G/g' > Ehum2.cut

Null CUT:

awk '{print $1,1./64.}' Ehum2.cut > null.cut

Absolute CUT:

awk '{print $1,$3/1000.}' Ehum2.cut > abs.cut

Relative CUT:

awk '{print $1,$2}' Ehum2.cut > rel.cut

Relative CUT scaled by degeneracy for each amino acid:

csh
touch rel2.cut
sed 's/\*/Z/' aa2codon.dat > temp0
awk '{print $1}' temp0 > temp2
foreach i (`awk '{print $1}' rel.cut`)
  set aa = `grep $i temp0 | awk '{print $1}'`
  set freq = `grep $i rel.cut | awk '{print $2}'`
  set degen = `grep $aa temp2 | wc | awk '{print $1}'`
  echo `echo $i $freq $degen | awk '{print $1,$2*$3}'` >> rel2.cut
end
exit

CUT divided by π_i.π_j.π_k (in this example, π_U = 0.22, π_C = 0.26, π_A = 0.26, π_G = 0.26):

cat rel2.cut | sed 's/U/0.22 /g' | sed 's/C/0.26 /g' | sed 's/A/0.26 /g' | \
  sed 's/G/0.26 /g' | awk '{print $4/$1/$2/$3/64.}' > temp1
paste rel2.cut temp1 | awk '{print $1,$3}' > rel3.cut

Then copy the relevant *.cut file to codon.dat.

Alternatively you can get CUTs from www.kazusa.or.jp/codon/index.html. The former CUTs are in a rectangular array format. For each amino acid, the CUT gives the 3-letter codon code, the absolute frequencies (total for all codons sums to 1000) and the raw codon counts. You can use the following commands to obtain abs.cut and rel.cut, and then the above commands to obtain rel2.cut and rel3.cut.

cat kazusa.cut | sed 's/(/ /g' | sed 's/)/ /g' | \
  awk '{printf "%s %s\n%s %s\n%s %s\n%s %s\n",$1,$2,$4,$5,$7,$8,$10,$11}' | \
  grep "[UCAGTucagt]" | \
  sed 's/T/a/g' | sed 's/U/a/g' | sed 's/C/b/g' | sed 's/A/c/g' | \
  sed 's/G/d/g' | sort | \
  sed 's/a/U/g' | sed 's/b/C/g' | sed 's/c/A/g' | sed 's/d/G/g' | \
  awk '{print $1,$2/1000.}' > abs.cut

csh
touch rel.cut
sed 's/\*/Z/' aa2codon.dat > temp0
paste temp0 abs.cut | awk '{print $1,$5}' > temp2
foreach i (`awk '{print $1}' abs.cut`)
  set aa = `grep $i temp0 | awk '{print $1}'`
  set absfreq = `grep $i abs.cut | awk '{print $2}'`
  set aaabsfreq = `grep $aa temp2 | awk '{sum+=$2}END{print sum}'`
  echo `echo $i $absfreq $aaabsfreq | awk '{print $1,$2/$3}'` >> rel.cut
end
rm temp0 temp2
exit

Note that the CUT normalization is unimportant.

Amino acid matrix

The default amino acid matrix is derived from the BLOSUM40 matrix. The BLOSUM40 matrix is for very divergent sequences, so it maximizes the amino acid acceptability component relative to the nucleotide mutation probability component. This is best for our software because we deal with the latter separately with the nucleotide matrix. If you want to make amino acid matrices using different BLOSUM matrices (which you can ftp from ncbi.nlm.nih.gov) then the programme make_blosum.cxx will help. I used the frequencies matrices in the blosum??.out files. Cut off all the headers so that the first line contains the amino acid one-letter codes (A, R, N etc.) and the next 20 lines are a lower triangular matrix of frequency values. make_blosum.cxx converts these values to q_ij/e_ij values (see Henikoff & Henikoff 1992 for more details).

g++ -o make_blosum make_blosum.cxx
make_blosum inputmatrix aa.dat

Note that the normalization of the amino acids matrix is unimportant.

Genetic code table

The table aa2codon.dat is used to convert the amino acid one-letter codes in the amino acid matrix aa.dat into codons. The programmes only read the first column of aa2codon.dat, so they rely on you maintaining the codon order given in the second column. You can edit the first column of aa2codon.dat to use a different genetic code.

References

Firth A. E., Brown C. M., 2005, Detecting overlapping coding sequences with pairwise alignments, Bioinformatics, 21, 282-92.
Henikoff S., Henikoff J. G., 1992, Amino acid substitution matrices from protein blocks, Proc. Natl. Acad. Sci. USA, 89, 10915-10919.
Kimura M., 1980, A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences, J. Mol. Evol., 16, 111-120.
Rice P., Longden I., Bleasby A., 2000, EMBOSS: the European molecular biology open software suite, Trends Genet., 16, 276-277.
Yang Z., 1994, Estimating the pattern of nucleotide substitution, J. Mol. Evol., 39, 105-111.