You must read TermsOfUsage before installing or using any of this software.

###########################################################################
# The software is Copyright (C) 2005 Andrew E Firth, University of Otago, #
# Dunedin, New Zealand, aef(at)sanger.otago.ac.nz                         #
#                                                                         #
# The software is free software; you can redistribute it and/or modify    #
# it under the terms of the GNU General Public License (version 2) as     #
# published by the Free Software Foundation.                              #
#                                                                         #
# The software is distributed in the hope that it will be useful,         #
# but WITHOUT ANY WARRANTY; without even the implied warranty of          #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the           #
# GNU General Public License for more details.                            #
#                                                                         #
# You should have received a copy of the GNU General Public License       #
# along with this program; if not, write to the Free Software             #
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA           #
# 02110-1301, USA.                                                        #
###########################################################################

Note that the software was written for a web server, so a lot of the text 
  output is in HTML.  The easiest way to view the output is by opening it in
  a standard web browser window.  This will also allow hypertext linking to
  the help pages describing the output etc.  Note that, because the help pages
  are lifted directly from the web server, a few of the comments and links 
  referring back to the web server will be broken.

The programmes are written in C++ and C-shell scripts, and should run under 
  any LINUX/UNIX-like environment (including MacOS-X X11 - but see below). 
  You will need a C++ compiler and the ability to run C-shell (csh) scripts.  

-------------------------------------------------------------------------------

INSTALLATION:

First unzip and unpack the distribution file:

gunzip mlogd.tar.gz 
tar xvf mlogd.tar 
cd MLOGD


This will give you the files:

calcprob.cxx   domlogd         mlogd.cxx     mlogd.cxx_mac   redoMLEnuc  
doallorfs      domlogd_mac     ntadjust.cxx  prepmlogd       redosixframe
doallorfs_mac  dosixframe      mcsim.cxx     prepmlogd_mac   runmean2.cxx
domcsims       dosixframe_mac  minmax.cxx    README          TermsOfUsage

and the sub-directories:

EXAMPLE       FORM        SCRIPTS


If you are using MacOS-X X11 then

  cp doallorfs_mac doallorfs 
  cp domlogd_mac domlogd
  cp dosixframe_mac dosixframe
  cp prepmlogd_mac prepmlogd
  cp mlogd.cxx_mac mlogd.cxx


Now compile the C++ programmes (replace 'g++' by an appropriate alternative, 
  e.g. 'c++' or 'gcc', if you're using a different C++ compiler):

g++ -o calcprob calcprob.cxx
g++ -o mcsim mcsim.cxx
g++ -o minmax minmax.cxx
g++ -o mlogd mlogd.cxx
g++ -o ntadjust ntadjust.cxx
g++ -o runmean2 runmean2.cxx


Then copy these programmes to your ~/bin directory:

mkdir ~/bin
cp calcprob ~/bin
cp mcsim    ~/bin
cp minmax   ~/bin
cp mlogd    ~/bin
cp ntadjust ~/bin
cp runmean2 ~/bin


Now make sure that doallorfs, domcsims, domlogd, dosixframe, prepmlogd, 
  redoMLEnuc and redosixframe are executable:

chmod u+x doallorfs domcsims domlogd dosixframe prepmlogd redoMLEnuc redosixframe

And copy these to your ~/bin directory:

cp doallorfs    ~/bin
cp domcsims     ~/bin
cp domlogd      ~/bin
cp dosixframe   ~/bin
cp prepmlogd    ~/bin
cp redoMLEnuc   ~/bin
cp redosixframe ~/bin


OTHER REQUIRED SOFTWARE:

You will also need to download and install the following software if you don't 
already have it:

seqret       Part of the EMBOSS package.
degapseq     Obtain from http://emboss.sourceforge.net/
infoseq 
noreturn
getorf
 
R            Statistics and graphics package.
             Obtain from http://www.r-project.org/


The following programme is recommended for doing your sequence alignments:

code2aln     Obtain from http://www.tbi.univie.ac.at/~roman/Code2aln/

  or, alternatively, the updated version,
codaln       Obtain from http://www.bioinf.uni-leipzig.de/Software/codaln/


BRIEF SUMMARY OF MLOGD SCRIPTS:

prepmlogd:    Prepares input files for all scripts.
domlogd:      Calculates coding (alternate model) versus non-coding (null 
              model) likelihood statistics and plots, for a given query  
              (i.e. hypothetical or potential) CDS.
doallorfs:    Finds all ORFs in reference sequence and runs 'domlogd' on each. 
dosixframe:   Runs 'domlogd' in a sliding window in all six read-frames and 
                plots scores.
domcsims:     Runs Monte Carlo sequence evolution simulations to put error bars
                on the statistics calculated by 'domlogd'.
redoMLEnuc:   Redraws the 'domlogd' plots with new plot parameters.
redosixframe: Redraws the 'dosixframe' plots with new plot parameters.


RUNNING MLOGD:

First make a working directory in the directory MLOGD:

mkdir WORK
cd WORK


Within this directory you must make another directory for a particular
  run, e.g.:

mkdir HBV.001

(The exact number of subdirectory levels between the working directory
  and the MLOGD base directory is important.)


Next you will need your alignment, ORF annotation, and parameter files
  (examples are shown in the 'EXAMPLE' directory; look at these to see the 
  required format):

Several alignment formats are accepted, including multi-FASTA and CLUSTALW.
  Name the alignment file 'allseqs.txt'.

You will need to fill out the parameters file 'mlogd.param' (see the
  web server help pages for more details):
   title     Title to use on plots.
   refseq    Name of reference sequence (must be one of the sequences in
               'allseqs.txt'.
   range1    Nucleotide start range to use if wholeseq = 0.  (Reference
               sequence coordinates.)
   range2    Nucleotide end range to use if wholeseq = 0.  (Reference
               sequence coordinates.)
   minorf    Mimimum ORF length in codons.  Only used by 'doallorfs' script.
   circular  '1' means the genome is circular; '0' means it is linear.
   wholeseq  Nucleotide range to use to calculate statistics: '1' means use 
               the whole reference sequence; '0' means use range1 to range2.
               '2' means use the query CDS(s) given in 'orfs.2.txt' (defaults
               to '0' for the 'doallorfs' and 'dosixframe' scripts).
   orftype   '0' means use 'start-stop' ORFs; '1' means use 'stop-stop' ORFs
               (only used by 'doallorfs' script).
   ncodons   Sliding window size in codons (only used by 'dosixframe' script).
   step      Step size for sliding windows in codons (only used by 'dosixframe'
               script).
   download  Leave equal to '1'.

There are two ORF/CDS files: 'orfs.1.txt' for the 'Known CDS(s)' or null model,
  and 'orfs.2.txt' for the 'Query CDS(s)' or alternate model.  The later is 
  not used by the 'doallorfs' and 'dosixframe' scripts, but it should still be
  present even if empty.

Finally, for calculating statistics summed over a phylogenetic tree, you will
  need the file 'allpairs.txt', listing sequence pairs tracing round the
  perimeter of one possible tree (see website for details).


To start with, you may like to try running the scripts on the example files
  in the directory EXAMPLE.  In the following, be sure to include the
  '.'s and '0's - they are important.  After each command you should check 
  'output.html' (preferably in a web browser window) for any error messages.

cd WORK
mkdir HBV.001
cd HBV.001
cp ../../EXAMPLE/* .
prepmlogd . 0 > output.html
domlogd . . 0 0 >> output.html

Now you can redo the nucleotide-by-nucleotide plots.  E.g. to change the
  running window to 7 nt, zoom in on the range 1028-1627 (reference sequence 
  coords), and add grid lines, try:

redoMLEnuc . 3 1028 1627 1 1 1 . >> output.html

In general the command is
    redoMLEnuc . halfwindow base1 base2 grids count threshold .
  where 'base1'-'base2' is the coordinate range to zoom in on, 1+2*'halfwindow'
  is the new running window size, 'grids' = '1' means add grid lines ('grids' 
  = '0' means without grid lines), 'count' is a suffix added to distinguish
  the new plot files from the originals (you should use 1, 2, 3, ...), and
  'threshold' is a number between 0 and 1 determining when to extend the 
  summed running mean plot into partially gapped regions (see website for 
  details).

You can also use 'domcsims' to use Monte Carlo sequence simulations to add 
  error bars and calculate some other statistics:

domcsims . 20 20 1 343 . >> output.html

In general the command is
    domcsims . nsim1 nsim2 count seed .
  where 'nsim1' is the number of simulations used to plot the general 
  distributions of scores as a function of sequence divergence, 'nsim2' is 
  the number of simulations used to derive each error bar, 'seed' is a random 
  seed (use a positive integer), and 'count' is a suffix added to distinguish
  the plot files from other plot files if you run the script multiple times
  (you should use 1, 2, 3, ...).  The maximum values for nsim1 and nsim2 
  will be limited to 200 (for 'mlogd.cxx') and 20 (for 'mlogd.cxx_mac').


To run MLOGD on all ORFs greater than or equal to a given length (the 'minorf'
  parameter in 'mlogd.param') use:

cd ../../WORK
mkdir HBV.002
cd HBV.002
cp ../../EXAMPLE/* .
prepmlogd . 1 > output.html
doallorfs . . >> output.html


To run MLOGD in a sliding window (window size = 'ncodons',  step size = 'step' 
  parameters in 'mlogd.param') in all six read-frames, use:

cd ../../WORK
mkdir HBV.003
cd HBV.003
cp ../../EXAMPLE/* .
prepmlogd . 2 > output.html
dosixframe . . >> output.html


You can redo the nucleotide-by-nucleotide plots.  E.g. to zoom in on the range 
  1028-1627 (reference sequence coords), and add grid lines, try:

redosixframe . 1028 1627 1 1 . 0.75 >> output.html

In general the command is
    redosixframe . base1 base2 grids count . threshold
  where 'base1'-'base2' is the coordinate range to zoom in on, 'grids' = '1' 
  means add grid lines ('grids' = '0' means without grid lines), 'count' is
  a suffix added to distinguish the new plot files from the originals 
  (you should use 1, 2, 3, ...), and 'threshold' is a number between 0 and 1 
  determining when to extend the plot into partially gapped regions (see
  website for details).


Note that you should start a new directory each time you run one of 'domlogd', 
  'doallorfs' and 'dosixframe', and for each of these you should always run
  'prepmlogd' first.  However, you should run 'domcsims' or 'redoMLEnuc' in 
  the same directory as you ran 'domlogd', and you should run 'redosixframe' 
  in the same directory as you ran 'dosixframe'.

-------------------------------------------------------------------------------

MAXIMUM NUMBER OF INPUT SEQUENCES AND ALIGNMENT LENGTH:

These are currently set to
  maximum number of input sequences: 200 
  maximum number of input sequence pairs: 200 
  maximum length of input sequences (nt): 35000

You can change these by editing the lines
   #define maxlength 35000
   #define maxnseqs 201
   #define maxpairs 200
in mlogd.cxx (and similarly in calcprob.cxx, mcsim.cxx, ntadjust.cxx, 
if necessary) before compiling them.

Note that run-time is approximately linearly proportional to the
number of sequences and the length of the alignment.

-------------------------------------------------------------------------------

NOTE FOR MACOS-X USERS:

1) The software was developed on a RedHat LINUX computer.  I have tried to 
   make it compatible with MACOS-X X11 UNIX, but don't guarantee that it is.
   Please feel free to email me with bugs.

2) I've noticed that MacOS-X tends to have a system-wide maximum stacksize 
   limit of 65536 kB.  To avoid exceeding this limit (resulting in a 
   'Segmentation Fault' error), in 'mlogd.cxx' I have edited the lines

   #define maxlength 35000
   #define maxnseqs 201
   #define maxpairs 200

   to 

   #define maxlength 10000
   #define maxnseqs 21
   #define maxpairs 20

   to make 'mlogd.cxx_mac'.  Unfortunately this limits the maximum length 
   of input sequences to 10000 nt and the maximum number of input sequences
   to 20.

3) You may find that your version of R doesn't recognize 'png256' in the R
   scripts and, as a result, fails to produce the png versions of the plots.  
   However, you should still get the eps and pdf versions of the plots.

-------------------------------------------------------------------------------

NUCLEOTIDE, CODON & AMINO ACID MUTATION MATRICES

If you wish to include your own amino acid, codon and nucleotide mutation 
matrices, instead of the default ones, then please follow the link from the 
base MLOGD page to 'Supplementary material' for details.