DRIVeR
Diversity Resulting from
In Vitro Recombination.
- Download driver.cxx and
driver.batch.cxx (for calculating the
expected number of distinct sequences in a library constructed by
in vitro recombination of two highly homologous sequences).
- Download the Monte Carlo simulation programme
driver_mc.cxx.
The programmes are written in C++ and should run under LINUX, MacOS-X
or MS-Windows, provided you have a C++ compiler. Most users will not
need to download the software, as the web server provides a more
convenient interface.
Return to library statistics home.
Click here for some warnings.
Problem: Given a library of L sequences generated by
random recombination of two near-identical genes differing at only a
small number of known nucleotide (or codon) positions, we wish to
calculate the expected number of distinct sequences in the library.
(Typically assuming the mean number of crossovers per sequence
m < 0.1 x sequence length N).
1) driver.cxx and driver.batch.cxx
These programmes are for calculating the expected number of distinct
sequences in a library generated by random crossovers between two
near-identical sequences. In driver.cxx the user inputs the
library size L, sequence length N, mean number of
crossovers per sequence m (or lambda), and a list of the
variable positions. It then calculates the probabilities of there
being an even or odd number of crossovers between each pair of
consecutive variable positions. Multiplying these probabilities gives
the relative probabilities of each of the 2^M possible daughter
sequences (where M is the total number of variable positions).
From these it calculates the probability that each daughter sequence
will be present in the library and hence the expected number of
distinct sequences in the library. driver.batch.cxx is
similar, but calculates the expected number of distinct sequences in
the library for a range of L and m values centred on
the input L and m.
Note that you may, more or less, consider your sequence either as a sequence
of nucleotides with a few variable nucleotides or as a sequence of codons
with a few variable codons.
Compile the programmes as follows (replace 'gcc' by an appropriate alternative,
e.g. 'c++' or 'g++', if you're using a different C++ compiler):
g++ -o driver driver.cxx
g++ -o driver.batch driver.batch.cxx
Before running the programmes, you will need to make a file listing
the variable positions. The first line lists the number of variable
positions. The remaining lines list the positions. These must be in
numerical order. Click here for an
example position file.
Run the programmes as follows:
./driver L N m posfile outfile xtrue
./driver.batch L N m posfile outfile xtrue
where
L = library size,
N = sequence length,
m = mean number of crossovers per sequence,
posfile is the list of variable positions (e.g. use driver.in),
outfile is the output data file (e.g. use driver.dat),
and xtrue is 1 if m is the true mean number of crossovers
per sequence and 0 if m is the mean number of observable
crossovers per sequence (click here for
details on counting crossovers).
driver.cxx outputs to screen the total number of possible
sequences, the expected number of distinct sequences in the library,
the true mean number of crossovers per sequence, and the the mean
number of observable crossovers per sequence (click here for details on counting
crossovers). It also produces an output file outfile (html
format) with columns:
1) coordinates of each interval between variable positions,
2) length of the interval,
3) the mean expected number of crossovers in the interval,
4) the probability for an even number of crossovers in the interval,
5) the probability for an odd number of crossovers in the interval.
driver.batch.cxx produces two output files - outfile (html
format) and outfile2 (plain text format), with columns:
1) true mean number of crossovers per sequence,
2) observed mean number of crossovers per sequence,
3-12) expected number of distinct sequences for different library sizes.
The library sizes (columns) range from L / 32 to L x 16,
while the crossover rates (rows) range from about m / 30 to m
x 30.
Currently the maximum number of variable positions is limited to 20
(in driver.cxx) and 15 (in driver.batch.cxx). Also
the maximum sequence length is 10^8 and the maximum library size is
10^12. You can change these by editing the
#define maxpos 20
#define maxndaugh 524288 // pow(2,maxpos-1)
#define maxn 100000000 // 10^8
#define maxl 1000000000000. // 10^12
lines in the programmes, and recompiling. Note that maxn,
maxpos and maxndaugh are integers. In general the
compiler will limit the maximum size of integers to 2^31 ~= 2.1 x
10^9. Some compilers may limit the maximum size of integers to 2^15
~= 32000. If any of maxn, maxpos, maxndaugh
exceed the relevant limit, then you will get nonsense results when you
run the programmes.
If you get a segmentation fault error it probably means you
need to increase your stacksize - use the ulimit or
limit command.
Links to download programmes: driver.cxx,
driver.batch.cxx.
2) driver_mc.cxx
This programme does a full Monte Carlo simulation for the DRIVeR scenario.
It may be useful for checking the analytic calculations used in
driver.cxx, but is relatively slow, especially for large numbers of
variable positions or large library sizes.
Compile the programme as follows (replace 'gcc' by an appropriate alternative,
e.g. 'c++' or 'g++', if you're using a different C++ compiler):
g++ -o driver_mc driver_mc.cxx
Before running the programme, you will need to make a file listing the
variable positions. The first line lists the number of variable positions.
The remaining lines list the positions. These must be in numerical order.
Click here for an example position file.
Run the programme as follows:
./driver_mc L N m posfile seed nsims approx
where
L = library size,
N = sequence length,
m = (true) mean number of crossovers per sequence
(click for details),
posfile is the list of variable positions (e.g. use driver.in),
seed = random seed (positive integer),
nsims = requested number of simulated libraries, and
approx = 1 or 0 tells the programme which method to use (see below).
The programme outputs to screen the mean and standard deviation of the
number of distinct daughter sequences per simulated library. For the
final simulated library only, the programme outputs to the file
mc.dat the number of times each of the possible daughter
sequences (encoded by 0,1,2,...,(2^M)-1) occurs in the library.
driver_mc.cxx may use one of two methods for generating the
simulated sequences:
approx = 1: For each sequence in the library, a random
Poisson variable with mean m is used to select the number of
crossovers. These are then applied at random places in the sequence.
approx = 0: Every position in each simulated sequence is
tested using a random number to decide whether a crossover occurs at that
site or not.
The approx = 1 method is quicker, but a bit less accurate.
Current limits are maximum number of simulated libraries = 100000,
maximum sequence length = 2000, maximum library size = 1000000, and
maximum number of variable positions = 12. You can change these by
editing the
#define maxniter 100000
#define maxn 2000
#define maxl 1000000
#define maxpos 12
#define maxndaugh 4096 // pow(2,maxpos)
lines in driver_mc.cxx, and recompiling. Beware of
increasing the maximum sequence length above about 10^9, or decreasing
the crossover rate m / N below about 1/10^9, as
these may be too extreme for the random number generator to resolve
(typically the random numbers have 9-10 random digits). Note also
that all these numbers are integers. In general the compiler will
limit the maximum size of integers to 2^31 ~= 2.1 x 10^9. Some
compilers may limit the maximum size of integers to 2^15 ~= 32000. If
any of these numbers exceed the relevant limit, then you will get
nonsense results when you run the programme.
Link to download programme:
driver_mc.cxx.
Notes:
- You must agree to the Terms of Usage
before using any of this software.
- If you use this software for publications, please cite Wayne M. Patrick,
Andrew E. Firth and Jonathan M. Blackburn, 2003, User-friendly algorithms
for estimating completeness and diversity in randomized protein-encoding
libraries, Protein Engineering, 16, 451-457 or Andrew E.
Firth and Wayne M. Patrick, 2005, Statistics of protein library
construction, Bioinformatics, 21, 3314-3315.
- If you seem to be getting bizarre results, check that none of the
limitations on L, N, m etc. have been violated (see
the maths notes).
- All corrections and notifications of bugs are gratefully received.
- Queries or comments to Andrew Firth (aef24cam.ac.uk).
- AEF gratefully acknowledges funding from the Foundation for Research,
Science and Technology, grant number UOOX0304.