Notes on nucleotide, codon, amino acid and
genetic code matrices:
The nucleotide, codon, amino acid and genetic code matrices used by
the software are contained in the files nuc.dat, codon.dat, aa.dat and aa2codon.dat,
respectively, (follow links to view default files). These are
described in more detail below. See also the Supplementary Material
(pdf file 1.0MB, ps file
0.9MB) for more details on the mutation model.
Nucleotide matrix
The programmes use a general reversible model (Yang 1994) for the
nucleotide matrix Q, i.e. Q =
- |
a.πC |
b.πA |
c.πG |
a.πU |
- |
d.πA |
e.πG |
b.πU |
d.πC |
- |
f.πG |
c.πU |
e.πC |
f.πA |
- |
where a, b, c, d, e, f are
rate parameters, the πj are the equilibrium nucleotide
frequencies, the order of the rows and columns is U, C, A, G, and the
diagonals are set so that
qii = - ∑j ≠ i qij.
The programmes use as default a κ = 3 Kimura (1980) model,
i.e. πU = πC = πA =
πG and a = f = 3b = 3c =
3d = 3e. Before using Q as input to the
programmes, it needs to be diagonalized. See the file nuc.dat for an example.
This file contains four matrices. The first is Q (normalized
so that the row sums are zero and - ∑i
πi.qii = 1). The next three matrices
are U-1, D and
U, where D is diagonal and Q =
U-1 D U. In general, if you have the
a, b, c, d, e, f parameters,
you can make a Q matrix that conserves relative nucleotide
frequencies for a particular gene or species by multiplying these
values by the πj values for that gene or
species.
Codon usage matrix
In theory, the codon usage tables (CUTs) used by the software are
relative frequencies, multiplied by the degeneracy of the
corresponding amino acid (e.g. the score for AUG (met) is
multiplied by 1 while the score for CUG (leu) is multiplied by
6), and divided by the πj for the particular species or
alignment (e.g. the score for CUG is divided by
πC.πU.πG). The best way to
combine codon usage frequencies with amino acid matrices and
nucleotide matrices is not very clear: our procedure does not, for
example, properly take into account the hypermutability of GC
dinucleotides. You may prefer to use a null CUT (i.e. equal codon
frequencies), which is in fact the default matrix (see Firth & Brown
2005 for justification).
The EMBOSS (Rice et al. 2000) format for CUTs comprises five columns:
- 3-letter codon code,
- 1-letter amino acid code,
- relative frequencies (sum to 1 for each amino acid),
- absolute frequencies (total for all codons sums to 1000),
- raw codon counts.
The following simple commands can be used to produce new CUTs. E.g.
for the EMBOSS CUT, Ehum.cut:
Sort the CUT into the order and code used by the programmes:
awk '{print $1,$3,$4}' Ehum.cut | sed 's/T/a/g' | sed 's/U/a/g' | \
sed 's/C/b/g' | sed 's/A/c/g' | sed 's/G/d/g' | sort | sed 's/a/U/g' | \
sed 's/b/C/g' | sed 's/c/A/g' | sed 's/d/G/g' > Ehum2.cut
Null CUT:
awk '{print $1,1./64.}' Ehum2.cut > null.cut
Absolute CUT:
awk '{print $1,$3/1000.}' Ehum2.cut > abs.cut
Relative CUT:
awk '{print $1,$2}' Ehum2.cut > rel.cut
Relative CUT scaled by degeneracy for each amino acid:
csh
touch rel2.cut
sed 's/\*/Z/' aa2codon.dat > temp0
awk '{print $1}' temp0 > temp2
foreach i (`awk '{print $1}' rel.cut`)
set aa = `grep $i temp0 | awk '{print $1}'`
set freq = `grep $i rel.cut | awk '{print $2}'`
set degen = `grep $aa temp2 | wc | awk '{print $1}'`
echo `echo $i $freq $degen | awk '{print $1,$2*$3}'` >> rel2.cut
end
exit
CUT divided by πi.πj.πk (in
this example, πU = 0.22, πC = 0.26,
πA = 0.26, πG = 0.26):
cat rel2.cut | sed 's/U/0.22 /g' | sed 's/C/0.26 /g' | sed 's/A/0.26 /g' | \
sed 's/G/0.26 /g' | awk '{print $4/$1/$2/$3/64.}' > temp1
paste rel2.cut temp1 | awk '{print $1,$3}' > rel3.cut
Then copy the relevant *.cut file to codon.dat.
Alternatively you can get CUTs from www.kazusa.or.jp/codon/index.html. The
former CUTs are in a rectangular array format. For each amino acid,
the CUT gives the 3-letter codon code, the absolute frequencies (total
for all codons sums to 1000) and the raw codon counts. You can use
the following commands to obtain abs.cut and
rel.cut, and then the above commands to obtain
rel2.cut and rel3.cut.
cat kazusa.cut | sed 's/(/ /g' | sed 's/)/ /g' | \
awk '{printf "%s %s\n%s %s\n%s %s\n%s %s\n",$1,$2,$4,$5,$7,$8,$10,$11}' | \
grep "[UCAGTucagt]" | \
sed 's/T/a/g' | sed 's/U/a/g' | sed 's/C/b/g' | sed 's/A/c/g' | \
sed 's/G/d/g' | sort | \
sed 's/a/U/g' | sed 's/b/C/g' | sed 's/c/A/g' | sed 's/d/G/g' | \
awk '{print $1,$2/1000.}' > abs.cut
csh
touch rel.cut
sed 's/\*/Z/' aa2codon.dat > temp0
paste temp0 abs.cut | awk '{print $1,$5}' > temp2
foreach i (`awk '{print $1}' abs.cut`)
set aa = `grep $i temp0 | awk '{print $1}'`
set absfreq = `grep $i abs.cut | awk '{print $2}'`
set aaabsfreq = `grep $aa temp2 | awk '{sum+=$2}END{print sum}'`
echo `echo $i $absfreq $aaabsfreq | awk '{print $1,$2/$3}'` >> rel.cut
end
rm temp0 temp2
exit
Note that the CUT normalization is unimportant.
Amino acid matrix
The default amino acid matrix is derived from the BLOSUM40 matrix.
The BLOSUM40 matrix is for very divergent sequences, so it maximizes
the amino acid acceptability component relative to the nucleotide
mutation probability component. This is best for our software because
we deal with the latter separately with the nucleotide matrix. If you
want to make amino acid matrices using different BLOSUM matrices
(which you can ftp from ncbi.nlm.nih.gov) then the programme
make_blosum.cxx will help.
I used the frequencies matrices in the blosum??.out files.
Cut off all the headers so that the first line contains the amino acid
one-letter codes (A, R, N etc.) and the next 20 lines are a lower
triangular matrix of frequency values. make_blosum.cxx
converts these values to qij/eij
values (see Henikoff & Henikoff 1992 for more details).
g++ -o make_blosum make_blosum.cxx
make_blosum inputmatrix aa.dat
Note that the normalization of the amino acids matrix is
unimportant.
Genetic code table
The table aa2codon.dat is used to convert the amino acid
one-letter codes in the amino acid matrix aa.dat into codons.
The programmes only read the first column of aa2codon.dat, so
they rely on you maintaining the codon order given in the second
column. You can edit the first column of aa2codon.dat to use a
different genetic code.
References
- Firth A. E., Brown C. M., 2005, Detecting overlapping coding
sequences with pairwise alignments, Bioinformatics,
21, 282-92.
- Henikoff S., Henikoff J. G., 1992, Amino acid substitution
matrices from protein blocks, Proc. Natl. Acad. Sci. USA, 89,
10915-10919.
- Kimura M., 1980, A simple method for estimating evolutionary
rates of base substitutions through comparative studies of
nucleotide sequences, J. Mol. Evol., 16, 111-120.
- Rice P., Longden I., Bleasby A., 2000, EMBOSS: the European
molecular biology open software suite, Trends Genet., 16,
276-277.
- Yang Z., 1994, Estimating the pattern of nucleotide
substitution, J. Mol. Evol., 39, 105-111.