What is it?
Whatever the biological questions it addresses, each RNA-seq analysis requires a computational prediction of either small scale mutations, indels, splice junctions or fusion RNAs. This prediction is currently performed using complex pipelines involving multiple tools for mapping, coverage computation, and prediction at distinct steps.
We propose a novel way of analyzing reads that integrates genomic locations and local coverage, and delivers all above mentioned predictions in a single step. Our program, CRAC, uses a double k-mer profiling approach to detect candidate mutations, indels, splice or fusion junctions in each single read.
Compared to existing tools, CRAC provides state of the art sensitivity and improved precision for all types of predictions, yielding high rates of true positive candidates (99.5% for splice junctions). When applied to four breast cancer libraries, CRAC recovered 74% of validated fusion RNAs and predicted reccurrent fusion junctions that were overseen in previous studies. Importantly, CRAC improves its predictive performance when supplied with e.g. 200 nt reads and should fit future needs of read analyses.
The article is published in Genome Biology 2013, 14:R30.
As input, CRAC takes FASTA or FASTQ files either in raw format or compressed using gzip. CRAC can output a SAM/BAM file (containing extra informations on the biological predictions) as well as, in-house formats (see corresponding documentation for more informations).
As CRAC creates an index on the read collection, it may require small overhead of memory. Hence for 42 millions reads of length 2x100bp on the human genome, CRAC needs 10 GB of memory, but Tophat2 is even more greedy (15GB) as well as STAR (30GB).
As CRAC performs much more prediction than classical mappers, it needs generally more time than them. However it is quicker than specialized mappers (such as TopHat or GSNAP).