You must read TermsOfUsage before installing or using any of this software. ########################################################################### # The software is Copyright (C) 2005 Andrew E Firth, University of Otago, # # Dunedin, New Zealand, aef(at)sanger.otago.ac.nz # # # # The software is free software; you can redistribute it and/or modify # # it under the terms of the GNU General Public License (version 2) as # # published by the Free Software Foundation. # # # # The software is distributed in the hope that it will be useful, # # but WITHOUT ANY WARRANTY; without even the implied warranty of # # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # # GNU General Public License for more details. # # # # You should have received a copy of the GNU General Public License # # along with this program; if not, write to the Free Software # # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA # # 02110-1301, USA. # ########################################################################### Note that the software was written for a web server, so a lot of the text output is in HTML. The easiest way to view the output is by opening it in a standard web browser window. This will also allow hypertext linking to the help pages describing the output etc. Note that, because the help pages are lifted directly from the web server, a few of the comments and links referring back to the web server will be broken. The programmes are written in C++ and C-shell scripts, and should run under any LINUX/UNIX-like environment (including MacOS-X X11 - but see below). You will need a C++ compiler and the ability to run C-shell (csh) scripts. ------------------------------------------------------------------------------- INSTALLATION: First unzip and unpack the distribution file: gunzip mlogd.tar.gz tar xvf mlogd.tar cd MLOGD This will give you the files: calcprob.cxx domlogd mlogd.cxx mlogd.cxx_mac redoMLEnuc doallorfs domlogd_mac ntadjust.cxx prepmlogd redosixframe doallorfs_mac dosixframe mcsim.cxx prepmlogd_mac runmean2.cxx domcsims dosixframe_mac minmax.cxx README TermsOfUsage and the sub-directories: EXAMPLE FORM SCRIPTS If you are using MacOS-X X11 then cp doallorfs_mac doallorfs cp domlogd_mac domlogd cp dosixframe_mac dosixframe cp prepmlogd_mac prepmlogd cp mlogd.cxx_mac mlogd.cxx Now compile the C++ programmes (replace 'g++' by an appropriate alternative, e.g. 'c++' or 'gcc', if you're using a different C++ compiler): g++ -o calcprob calcprob.cxx g++ -o mcsim mcsim.cxx g++ -o minmax minmax.cxx g++ -o mlogd mlogd.cxx g++ -o ntadjust ntadjust.cxx g++ -o runmean2 runmean2.cxx Then copy these programmes to your ~/bin directory: mkdir ~/bin cp calcprob ~/bin cp mcsim ~/bin cp minmax ~/bin cp mlogd ~/bin cp ntadjust ~/bin cp runmean2 ~/bin Now make sure that doallorfs, domcsims, domlogd, dosixframe, prepmlogd, redoMLEnuc and redosixframe are executable: chmod u+x doallorfs domcsims domlogd dosixframe prepmlogd redoMLEnuc redosixframe And copy these to your ~/bin directory: cp doallorfs ~/bin cp domcsims ~/bin cp domlogd ~/bin cp dosixframe ~/bin cp prepmlogd ~/bin cp redoMLEnuc ~/bin cp redosixframe ~/bin OTHER REQUIRED SOFTWARE: You will also need to download and install the following software if you don't already have it: seqret Part of the EMBOSS package. degapseq Obtain from http://emboss.sourceforge.net/ infoseq noreturn getorf R Statistics and graphics package. Obtain from http://www.r-project.org/ The following programme is recommended for doing your sequence alignments: code2aln Obtain from http://www.tbi.univie.ac.at/~roman/Code2aln/ or, alternatively, the updated version, codaln Obtain from http://www.bioinf.uni-leipzig.de/Software/codaln/ BRIEF SUMMARY OF MLOGD SCRIPTS: prepmlogd: Prepares input files for all scripts. domlogd: Calculates coding (alternate model) versus non-coding (null model) likelihood statistics and plots, for a given query (i.e. hypothetical or potential) CDS. doallorfs: Finds all ORFs in reference sequence and runs 'domlogd' on each. dosixframe: Runs 'domlogd' in a sliding window in all six read-frames and plots scores. domcsims: Runs Monte Carlo sequence evolution simulations to put error bars on the statistics calculated by 'domlogd'. redoMLEnuc: Redraws the 'domlogd' plots with new plot parameters. redosixframe: Redraws the 'dosixframe' plots with new plot parameters. RUNNING MLOGD: First make a working directory in the directory MLOGD: mkdir WORK cd WORK Within this directory you must make another directory for a particular run, e.g.: mkdir HBV.001 (The exact number of subdirectory levels between the working directory and the MLOGD base directory is important.) Next you will need your alignment, ORF annotation, and parameter files (examples are shown in the 'EXAMPLE' directory; look at these to see the required format): Several alignment formats are accepted, including multi-FASTA and CLUSTALW. Name the alignment file 'allseqs.txt'. You will need to fill out the parameters file 'mlogd.param' (see the web server help pages for more details): title Title to use on plots. refseq Name of reference sequence (must be one of the sequences in 'allseqs.txt'. range1 Nucleotide start range to use if wholeseq = 0. (Reference sequence coordinates.) range2 Nucleotide end range to use if wholeseq = 0. (Reference sequence coordinates.) minorf Mimimum ORF length in codons. Only used by 'doallorfs' script. circular '1' means the genome is circular; '0' means it is linear. wholeseq Nucleotide range to use to calculate statistics: '1' means use the whole reference sequence; '0' means use range1 to range2. '2' means use the query CDS(s) given in 'orfs.2.txt' (defaults to '0' for the 'doallorfs' and 'dosixframe' scripts). orftype '0' means use 'start-stop' ORFs; '1' means use 'stop-stop' ORFs (only used by 'doallorfs' script). ncodons Sliding window size in codons (only used by 'dosixframe' script). step Step size for sliding windows in codons (only used by 'dosixframe' script). download Leave equal to '1'. There are two ORF/CDS files: 'orfs.1.txt' for the 'Known CDS(s)' or null model, and 'orfs.2.txt' for the 'Query CDS(s)' or alternate model. The later is not used by the 'doallorfs' and 'dosixframe' scripts, but it should still be present even if empty. Finally, for calculating statistics summed over a phylogenetic tree, you will need the file 'allpairs.txt', listing sequence pairs tracing round the perimeter of one possible tree (see website for details). To start with, you may like to try running the scripts on the example files in the directory EXAMPLE. In the following, be sure to include the '.'s and '0's - they are important. After each command you should check 'output.html' (preferably in a web browser window) for any error messages. cd WORK mkdir HBV.001 cd HBV.001 cp ../../EXAMPLE/* . prepmlogd . 0 > output.html domlogd . . 0 0 >> output.html Now you can redo the nucleotide-by-nucleotide plots. E.g. to change the running window to 7 nt, zoom in on the range 1028-1627 (reference sequence coords), and add grid lines, try: redoMLEnuc . 3 1028 1627 1 1 1 . >> output.html In general the command is redoMLEnuc . halfwindow base1 base2 grids count threshold . where 'base1'-'base2' is the coordinate range to zoom in on, 1+2*'halfwindow' is the new running window size, 'grids' = '1' means add grid lines ('grids' = '0' means without grid lines), 'count' is a suffix added to distinguish the new plot files from the originals (you should use 1, 2, 3, ...), and 'threshold' is a number between 0 and 1 determining when to extend the summed running mean plot into partially gapped regions (see website for details). You can also use 'domcsims' to use Monte Carlo sequence simulations to add error bars and calculate some other statistics: domcsims . 20 20 1 343 . >> output.html In general the command is domcsims . nsim1 nsim2 count seed . where 'nsim1' is the number of simulations used to plot the general distributions of scores as a function of sequence divergence, 'nsim2' is the number of simulations used to derive each error bar, 'seed' is a random seed (use a positive integer), and 'count' is a suffix added to distinguish the plot files from other plot files if you run the script multiple times (you should use 1, 2, 3, ...). The maximum values for nsim1 and nsim2 will be limited to 200 (for 'mlogd.cxx') and 20 (for 'mlogd.cxx_mac'). To run MLOGD on all ORFs greater than or equal to a given length (the 'minorf' parameter in 'mlogd.param') use: cd ../../WORK mkdir HBV.002 cd HBV.002 cp ../../EXAMPLE/* . prepmlogd . 1 > output.html doallorfs . . >> output.html To run MLOGD in a sliding window (window size = 'ncodons', step size = 'step' parameters in 'mlogd.param') in all six read-frames, use: cd ../../WORK mkdir HBV.003 cd HBV.003 cp ../../EXAMPLE/* . prepmlogd . 2 > output.html dosixframe . . >> output.html You can redo the nucleotide-by-nucleotide plots. E.g. to zoom in on the range 1028-1627 (reference sequence coords), and add grid lines, try: redosixframe . 1028 1627 1 1 . 0.75 >> output.html In general the command is redosixframe . base1 base2 grids count . threshold where 'base1'-'base2' is the coordinate range to zoom in on, 'grids' = '1' means add grid lines ('grids' = '0' means without grid lines), 'count' is a suffix added to distinguish the new plot files from the originals (you should use 1, 2, 3, ...), and 'threshold' is a number between 0 and 1 determining when to extend the plot into partially gapped regions (see website for details). Note that you should start a new directory each time you run one of 'domlogd', 'doallorfs' and 'dosixframe', and for each of these you should always run 'prepmlogd' first. However, you should run 'domcsims' or 'redoMLEnuc' in the same directory as you ran 'domlogd', and you should run 'redosixframe' in the same directory as you ran 'dosixframe'. ------------------------------------------------------------------------------- MAXIMUM NUMBER OF INPUT SEQUENCES AND ALIGNMENT LENGTH: These are currently set to maximum number of input sequences: 200 maximum number of input sequence pairs: 200 maximum length of input sequences (nt): 35000 You can change these by editing the lines #define maxlength 35000 #define maxnseqs 201 #define maxpairs 200 in mlogd.cxx (and similarly in calcprob.cxx, mcsim.cxx, ntadjust.cxx, if necessary) before compiling them. Note that run-time is approximately linearly proportional to the number of sequences and the length of the alignment. ------------------------------------------------------------------------------- NOTE FOR MACOS-X USERS: 1) The software was developed on a RedHat LINUX computer. I have tried to make it compatible with MACOS-X X11 UNIX, but don't guarantee that it is. Please feel free to email me with bugs. 2) I've noticed that MacOS-X tends to have a system-wide maximum stacksize limit of 65536 kB. To avoid exceeding this limit (resulting in a 'Segmentation Fault' error), in 'mlogd.cxx' I have edited the lines #define maxlength 35000 #define maxnseqs 201 #define maxpairs 200 to #define maxlength 10000 #define maxnseqs 21 #define maxpairs 20 to make 'mlogd.cxx_mac'. Unfortunately this limits the maximum length of input sequences to 10000 nt and the maximum number of input sequences to 20. 3) You may find that your version of R doesn't recognize 'png256' in the R scripts and, as a result, fails to produce the png versions of the plots. However, you should still get the eps and pdf versions of the plots. ------------------------------------------------------------------------------- NUCLEOTIDE, CODON & AMINO ACID MUTATION MATRICES If you wish to include your own amino acid, codon and nucleotide mutation matrices, instead of the default ones, then please follow the link from the base MLOGD page to 'Supplementary material' for details.