(last update of help was on 1/3/2021 by CMB)
17 7 2019. CRISPRDetect version 2.4 (download). https://github.com/davidchyou/CRISPRDetect_2.4
Version 2.4 has some bug fixes (compared to 2.2) , update to the CRISPRDirection (v2) module, and extends the CRISPR I-E identification of repeats. The array score is nowalso in the gff3 file. The gff3 file and txt files can be downloaded directly from the web output. The web interface uses a similar version with additions for the web inferface.
7 5 2018. Updated CRISPRDirection module with an increase in accuracy for Type II direction calls. Modified to handle gbff files (e.g. GCF or GCA) and cas gene annotations in these.
6 5 2016. CRISPRDetect is a tool to discover and explore the CRISPR non coding RNAs in sequence data. It is a bioinformatic tool to find CRISPR arrays. It is part of CRISPRSuite. The final submission version of CRISPRDetect v2.2 is described and available here. All earlier released presubmision '1' or 2.0 versions were testing versions and may have minor bugs.
News about CRISPRSuite is here:
For enquiries contact chris.brown at otago.ac.nz
>NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_1_451303_451334|32, Direction:Forward TCCAATTTAAATATGCTTTTTTTACCAATTCC >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_2_451364_451396|33, Direction:Forward GACGAGGAATCTGATAACGCATTCCGTGTTCGC >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_3_451426_451457|32, Direction:Forward GTCGTTCATCAGGTCATGCTCTGACAGCTTCG >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_4_451487_451518|32, Direction:Forward CAGACCCTCTTTGGTATACTTCAATTTTATAA >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_5_451548_451579|32, Direction:Forward CAACGACATCGAAATCAGCGCTAAATATTTCT >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_6_451609_451640|32, Direction:Forward GATGCGATGAAACACACGTATCTGGGCTATGC >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_7_451670_451702|33, Direction:Forward TGGAAGGCTGGGCGCGTGAAATCGGGTGCAATC >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_8_451732_451763|32, Direction:Forward TCTGTCATTGGCTGTCGGGGAGGTTCACTGCC >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_9_451793_451824|32, Direction:Forward CCCCAGCGCATAAACGACGCTATACCACAACC >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_10_451854_451885|32, Direction:Forward TCAGCCCCCCAAACAAACACAGCCGCTATCCA >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_11_451915_451946|32, Direction:Forward TCTGCGTTAACAGGCCGCTTGTTTGGGCGTTT >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_12_451976_452007|32, Direction:Forward ATAAATCATATTCAGAGTTGGAATTCTTGAAT >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_13_452037_452068|32, Direction:Forward CTGTGGCTGAGCTATTTCGTGCGCGTCGTTGA >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_14_452098_452129|32, Direction:Forward AGAAAAATACTCAATCGTGGTCGTGCCGAAAC >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_15_452159_452190|32, Direction:Forward CTGTTTGGGTTGATGAGAAGGAGAGGACTTTG >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_16_452220_452251|32, Direction:Forward GATATGCAAAACGTCATTATCGGCGGCATTCT >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_17_452281_452312|32, Direction:Forward GTTGCCGAAATCTTCAATGGCATGGATTTAGG >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_18_452342_452373|32, Direction:Forward AGAAAAATACTCAATCGTGGTCGTGCCGAAAC >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_19_452403_452434|32, Direction:Forward TGCTCGCAACGTTTGCGTATCAGGACTACGCA >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_20_452464_452496|33, Direction:Forward TGCTAATTTGTAACCTTCCGGTATTACCGGAGA >NC_017845|Pectobacterium-sp. SCC3193, complete genome.|2_451274_452587|2_21_452526_452557|32, Direction:Forward TTCCAACCCTTTCAGCAAGCTCTACCTGAGTADescription: CRISPRDetect preserves all the positional information for each spacer. It contains both the start and stop position as well as the index of the spacers. For example, '2_1_451303_451334|32' indicates, that the spacer is the 1st spacer (from top or 5' end) of the 1st CRISPR from Pectobacterium-sp. SCC3193, and the spacer is 32nt long and in forward direction (i.e. 5' to 3').
1. Presence of either cas1 or cas2 genes in the genome is awarded ( +1, or 0). (cas)
This method is only applied when an annotation file (NCBI gbk or gbff file) is used as input. The annotation files are searched (term based) to create a list of all cas genes present in the genome. The scoring system awards the quality score with ‘+1’ when annotation of either cas1 or cas2 genes are present in the input file.
2. Match to known repeat using a set of reference repeats from high confidence arrays (+3). (likely repeat)
We use 26 experimentally verified representative repeats as reference and increased the set of known repeats by allowing up to 7 base mismatches. This extended set of ~400 repeat was used to predict a higher confidence set. Arrays were predicted then those with greater than 7 repeats and scores > 4 were used to predict a set of likely repeats. This file was converted in to a BLAST database and potential repeat searched against that with blastn-short which is optimised for short sequences. When a match is found, the array quality score is awarded ‘+3’. This file or the score can be modified in the commandline version.
3. Repeat has at least 23 bases and ATTGAAA(N) at the end (+3, or 0). (motif_match)
Another feature adapted from the CRISPRDirection algorithm is the presence of motif ATTGAAA(N) at the 3’ end of repeats. We observed that, this motif is an accurate indicator of the direction of transcription. In that paper we also observed that all the potential repeats that are >=23nt long containing this motif were genuine CRISPRs. Hence, we used this information to contribute to the quality score, and the quality score is awarded with ‘+3’ when the repeats are >=23nt long and contains ATTGAAA(N) at the 3' end.
4. Overall repeat identity within an array (0 to 1). (overall_repeat_identity)
The overall repeats identity score (S) is calculated using the following method
S= (average % identity of the repeats - 80)/20
The maximum possible positive score can be 1 (when all repeats are identical). However, the score will be negative, when the overall repeat identity is <80%.
5. The repeats in the array do not form one sequence similarity cluster (-1.5, or 0). (one_repeat_cluster)
The repeat are clustered using CD-HIT-EST if they form more than one cluster the quality score is penalized by ‘-1.5’.
6. Scoring the repeat lengths (range -3 to +1). (exp_repeat_length)
In this method, we use the table of repeat length distribution (Figure 3). The relative score (S) for a repeat of length (L) is determined using the following rules:
S= 0.25 + L/H [where, L>=23 and L =< 47;
H is the most abundant repeat length for bacteria or archaea]
S= -0.25*(23 - L) [where, L <23]
S= -0.25*(L - 47) [where, L >47]
The maximum negative score limit is set to -3, and maximum positive score limit is +1.
7. Scoring the spacer lengths (range -3 to +3). (exp_spacer_length)
In this method, each spacer of an array is independently scored, and counted towards a final spacer length score. The individual spacer length score (S) for a spacer with length (L) within the range 28-48 (see Fig 3B) are awarded a positive score using the formula:
S = 0.01 + N/H [where, 27< L =<48;
N= Total number of spacers of this length;
H= Most abundant spacer length for bacteria or archaea
Any spacer length outside this range is penalised by the following rule:
S=-0.10* (28 - L) [where, L<28]
S=-0.10* (L -48) [where, L>48]
Finally, an average spacer score for the current array is calculated using
Average score=Sum_of_scores/no_of_spacers
The maximum negative score limit is -3 and maximum positive score limit is +1.
8. Overall spacer identity (-3 to +1) (spacer_identity)
In this method we test the sequence (dis)similarity among all the spacers. If the spacers are all near identical it is more likely to be a direct repeat, possibly a tandem repeat rather than a CRISPR array. If the spacers belong to a total number of clusters (C) with identity >=80%, the spacer identity score (S) for an array with number of spacers (N) is calculated using the following rule:
S= -3 [where, C =< integer (N/2); ]
S= 0.20*C [where, C > integer (N/2); ]
The positive score limit is +1.
9. Scoring total number of identical repeats 0 to +1) (log(total repeats) - log(total mutated repeats))
Since longer arrays, and those with a greater number of identical repeats are more likely to be a true CRISPR, this scoring method uses both. If an array contains ‘P’ identical repeats out of the ‘N’ total number of repeats, then the score (S) is calculated using the following rule:
S= log (N) - log (N-P) [where, P=Identical repeats, N= total number of repeats]
The maximum positive score limit is +1.