MLOGD: Notes

Note on calculation of likelihood ratios using Monte Carlo simulations:

The statistics in columns 4-9 of the table at the bottom of the Monte Carlo simulations results page are determined for each reference - non-reference sequence pair in the input alignment. We wish to use these statistic to try to determine which of the null model or alternate model CDSs provide a better model for the pattern of mutations observed in the input alignment.

Column 4 (the MLOGD likelihood ratio score) may be used directly as such a classifier: values less than zero support the null model, while values greater than zero support the alternate model. We recommend using this score. However, in Firth & Brown, 2005, Bioinformatics, 21, 282-92, we also investigated using the other scores: 1st/2nd/3rd codon position mutation fractions and synonymous/nonsynonymous codon mutation fractions. In order to interpret these scores, we resort to simulations to find out what range of values we would expect to observe for each of the null and alternate models.

Ideally one would estimate the probability of obtaining each observed score, under each of the null and alternate models, directly from the simulations. In practice it would require a very large number of simulations to estimate the near-zero probabilities that occur whenever the observed score is incompatible with one or other of the models (e.g. if the alternate model is actually correct, then trying to estimate the probability of the observed score occuring under the null model might be very difficult). Instead we assume a normal probability distribution and use the simulations to estimate the distribution parameters (mean and standard deviation). Using this, we can determine the probabilities P(x | null model) and P(x | alternate model), for each score x, and hence find the null versus alternate model log likelihood ratio. These are the scores in columns 10-15. Values less than zero support the null model, while values greater than zero support the alternate model.

This classification scheme is simple to implement but ignores deviations from normal distributions (e.g. columns 5-9 are constrained to be non-negative), so columns 10-15 are not strictly speaking likelihood ratios, but may still be used as classifiers. Note that, even if they were true likelihood ratios, converting the likelihood ratios to probabilities would still require Bayesian prior probabilities on the null and alternate models.

You may like to combine the 1st/2nd/3rd codon position mutation fraction scores into one score (i.e. columns 10 + 11 + 12), and similarly for the synonymous/nonsynonymous codon mutation fraction scores (i.e. columns 13 + 14). Be aware, however, that non-independence of these scores will compromise the statistics. In particular, the 3rd codon position score is completely determined if you know the 1st and 2nd codon position scores, so you may like to use (2/3) * (columns 10 + 11 + 12), instead.

Full details of the above are given in Firth A. E., Brown C. M., 2005, Detecting overlapping coding sequences with pairwise alignments, Bioinformatics, 21, 282-92 and Section 2.5 of the Supplementary Material pdf file (1.0MB), ps file (0.9MB).