What is it?

Summary

Whatever the biological questions it addresses, each RNA-seq analysis requires a computational prediction of either small scale mutations, indels, splice junctions or fusion RNAs. This prediction is currently performed using complex pipelines involving multiple tools for mapping, coverage computation, and prediction at distinct steps.

We propose a novel way of analyzing reads that integrates genomic locations and local coverage, and delivers all above mentioned predictions in a single step. Our program, CRAC, uses a double k-mer profiling approach to detect candidate mutations, indels, splice or fusion junctions in each single read.

Compared to existing tools, CRAC provides state of the art sensitivity and improved precision for all types of predictions, yielding high rates of true positive candidates (99.5% for splice junctions). When applied to four breast cancer libraries, CRAC recovered 74% of validated fusion RNAs and predicted reccurrent fusion junctions that were overseen in previous studies. Importantly, CRAC improves its predictive performance when supplied with e.g. 200 nt reads and should fit future needs of read analyses.

Publication

CRAC is a joint work of N. Philippe, M. Salson, T. Commes and É. Rivals.

The article is published in Genome Biology 2013, 14:R30.

Input/Output

As input, CRAC takes FASTA or FASTQ files either in raw format or compressed using gzip. CRAC can output a SAM file (containing extra informations on the biological predictions) as well as, in-house formats.

Memory consumption

As CRAC creates an index on the read collection, it may require more memory than other similar softwares. Hence for 42 millions reads of length 75 on the human genome, CRAC needs 38 GB of memory.

Time consumption

As CRAC performs much more prediction than classical mappers, it needs generally more time than them. However it is quicker than specialized mappers (such as TopHat or GSNAP).