Notes on nucleotide, codon, amino acid and genetic code matrices:

The nucleotide, codon, amino acid and genetic code matrices used by the software are contained in the files nuc.dat, codon.dat, aa.dat and aa2codon.dat, respectively, (follow links to view default files). These are described in more detail below. See also the Supplementary Material (pdf file 1.0MB, ps file 0.9MB) for more details on the mutation model.


Nucleotide matrix

The programmes use a general reversible model (Yang 1994) for the nucleotide matrix Q, i.e. Q =

- a.πC b.πA c.πG
a.πU - d.πA e.πG
b.πU d.πC - f.πG
c.πU e.πC f.πA -

where a, b, c, d, e, f are rate parameters, the πj are the equilibrium nucleotide frequencies, the order of the rows and columns is U, C, A, G, and the diagonals are set so that

qii = - ∑j ≠ i qij.

The programmes use as default a κ = 3 Kimura (1980) model, i.e. πU = πC = πA = πG and a = f = 3b = 3c = 3d = 3e. Before using Q as input to the programmes, it needs to be diagonalized. See the file nuc.dat for an example. This file contains four matrices. The first is Q (normalized so that the row sums are zero and - ∑i πi.qii = 1). The next three matrices are U-1, D and U, where D is diagonal and Q = U-1 D U. In general, if you have the a, b, c, d, e, f parameters, you can make a Q matrix that conserves relative nucleotide frequencies for a particular gene or species by multiplying these values by the πj values for that gene or species.


Codon usage matrix

In theory, the codon usage tables (CUTs) used by the software are relative frequencies, multiplied by the degeneracy of the corresponding amino acid (e.g. the score for AUG (met) is multiplied by 1 while the score for CUG (leu) is multiplied by 6), and divided by the πj for the particular species or alignment (e.g. the score for CUG is divided by πCUG). The best way to combine codon usage frequencies with amino acid matrices and nucleotide matrices is not very clear: our procedure does not, for example, properly take into account the hypermutability of GC dinucleotides. You may prefer to use a null CUT (i.e. equal codon frequencies), which is in fact the default matrix (see Firth & Brown 2005 for justification).

The EMBOSS (Rice et al. 2000) format for CUTs comprises five columns:
  1. 3-letter codon code,
  2. 1-letter amino acid code,
  3. relative frequencies (sum to 1 for each amino acid),
  4. absolute frequencies (total for all codons sums to 1000),
  5. raw codon counts.
The following simple commands can be used to produce new CUTs. E.g. for the EMBOSS CUT, Ehum.cut:

Sort the CUT into the order and code used by the programmes:
awk '{print $1,$3,$4}' Ehum.cut | sed 's/T/a/g' | sed 's/U/a/g' | \
  sed 's/C/b/g' | sed 's/A/c/g' | sed 's/G/d/g' | sort | sed 's/a/U/g' | \
  sed 's/b/C/g' | sed 's/c/A/g' | sed 's/d/G/g' > Ehum2.cut

Null CUT:
awk '{print $1,1./64.}' Ehum2.cut > null.cut

Absolute CUT:
awk '{print $1,$3/1000.}' Ehum2.cut > abs.cut

Relative CUT:
awk '{print $1,$2}' Ehum2.cut > rel.cut

Relative CUT scaled by degeneracy for each amino acid:
csh
touch rel2.cut
sed 's/\*/Z/' aa2codon.dat > temp0
awk '{print $1}' temp0 > temp2
foreach i (`awk '{print $1}' rel.cut`)
  set aa = `grep $i temp0 | awk '{print $1}'`
  set freq = `grep $i rel.cut | awk '{print $2}'`
  set degen = `grep $aa temp2 | wc | awk '{print $1}'`
  echo `echo $i $freq $degen | awk '{print $1,$2*$3}'` >> rel2.cut
end
exit

CUT divided by πijk (in this example, πU = 0.22, πC = 0.26, πA = 0.26, πG = 0.26):
cat rel2.cut | sed 's/U/0.22 /g' | sed 's/C/0.26 /g' | sed 's/A/0.26 /g' | \
  sed 's/G/0.26 /g' | awk '{print $4/$1/$2/$3/64.}' > temp1
paste rel2.cut temp1 | awk '{print $1,$3}' > rel3.cut

Then copy the relevant *.cut file to codon.dat.

Alternatively you can get CUTs from www.kazusa.or.jp/codon/index.html. The former CUTs are in a rectangular array format. For each amino acid, the CUT gives the 3-letter codon code, the absolute frequencies (total for all codons sums to 1000) and the raw codon counts. You can use the following commands to obtain abs.cut and rel.cut, and then the above commands to obtain rel2.cut and rel3.cut.
cat kazusa.cut | sed 's/(/ /g' | sed 's/)/ /g' | \
  awk '{printf "%s %s\n%s %s\n%s %s\n%s %s\n",$1,$2,$4,$5,$7,$8,$10,$11}' | \
  grep "[UCAGTucagt]" | \
  sed 's/T/a/g' | sed 's/U/a/g' | sed 's/C/b/g' | sed 's/A/c/g' | \
  sed 's/G/d/g' | sort | \
  sed 's/a/U/g' | sed 's/b/C/g' | sed 's/c/A/g' | sed 's/d/G/g' | \
  awk '{print $1,$2/1000.}' > abs.cut

csh
touch rel.cut
sed 's/\*/Z/' aa2codon.dat > temp0
paste temp0 abs.cut | awk '{print $1,$5}' > temp2
foreach i (`awk '{print $1}' abs.cut`)
  set aa = `grep $i temp0 | awk '{print $1}'`
  set absfreq = `grep $i abs.cut | awk '{print $2}'`
  set aaabsfreq = `grep $aa temp2 | awk '{sum+=$2}END{print sum}'`
  echo `echo $i $absfreq $aaabsfreq | awk '{print $1,$2/$3}'` >> rel.cut
end
rm temp0 temp2
exit

Note that the CUT normalization is unimportant.


Amino acid matrix

The default amino acid matrix is derived from the BLOSUM40 matrix. The BLOSUM40 matrix is for very divergent sequences, so it maximizes the amino acid acceptability component relative to the nucleotide mutation probability component. This is best for our software because we deal with the latter separately with the nucleotide matrix. If you want to make amino acid matrices using different BLOSUM matrices (which you can ftp from ncbi.nlm.nih.gov) then the programme make_blosum.cxx will help. I used the frequencies matrices in the blosum??.out files. Cut off all the headers so that the first line contains the amino acid one-letter codes (A, R, N etc.) and the next 20 lines are a lower triangular matrix of frequency values. make_blosum.cxx converts these values to qij/eij values (see Henikoff & Henikoff 1992 for more details).
g++ -o make_blosum make_blosum.cxx
make_blosum inputmatrix aa.dat
Note that the normalization of the amino acids matrix is unimportant.


Genetic code table

The table aa2codon.dat is used to convert the amino acid one-letter codes in the amino acid matrix aa.dat into codons. The programmes only read the first column of aa2codon.dat, so they rely on you maintaining the codon order given in the second column. You can edit the first column of aa2codon.dat to use a different genetic code.


References