Gk-arrays are provided as a simple-to-use C++ library dedicated to queries on large collection of sequences as produced by high-throughput sequencers (e.g. HiSeq 2000 from Illumina, 454 from Roche).

Gk-arrays index k-mers of reads and allow to answer different queries on that read collection (e.g. how many reads share this k-mer? where does this k-mer occur in the read collection?).

Gk-arrays consist of a space-efficient alternative to hash tables while being similar in terms of query times.


Gk-arrays is a work by Nicolas Philippe, Mikaël Salson, Thierry Lecroq, Martine Léonard, Thérèse Commes and Éric Rivals. It has been published in the BMC Bioinformatics journal. If you use this work, please don't forget to cite this paper.


Small udpate
  • Add a progressBar to measure time required for the different steps of the Gk-arrays construction (by A. Mancheron)
Bug corrections
  • When processing very large files, Gk-Arrays could encounter some problems and not finish. The problem is now solved.
Small update (bug corrections)
  • Read numbers started at 1 instead of 0
  • Small issues in Makefile from the tests/ directory
Gk-arrays have dramatically evolved!
  • Multi-threaded construction (preliminary version due to A. Limasset, final version by M. Salson)
  • Improved construction algorithm which does not rely anymore on a suffix array construction algorithm (preliminary version due to A. Limasset, final version by N. Philippe)
  • Manage stranded (or non-stranded) libraries. In a non stranded library, ATTG, and its reverse complement CAAT are considered equivalent (preliminary version due to A. Limasset, final version by N. Philippe and M. Salson).
  • Manage paired-end reads (J. Audoux and N. Philippe)
  • GkArrays can be built using bit vectors instead of integer arrays. It therefore guarantees to use less memory, at the expense of a longer construction and query times (preliminary version from F. Recourt, final version by A. Mancheron)


Gk-arrays source code is distributed under the GPL-compliant CeCILL-C license.

Source code
Version 2.1.0: libGkArrays-2.1.0.tar.gz
Debian/Ubuntu package
Version 2.1.0 (64 bit architecture): libGkArrays_2.1.0_amd64.deb
Version 2.1.0 (32 bit architecture): libGkArrays_2.1.0_i386.deb
Previous versions
Previous versions of the library can be found on the forge.

A very simple test file can be downloaded from here. Once the library is installed, you can compile the test file using e.g. g++ -Wall -pedantic -O3 testGkArrays.cpp -o testGkArrays -lGkArrays. Another test file (measuring the query time) is also included in the source code under the src directory.


The installation will create you a test executable (called buildTables) and a library that could be used in any of your programs.

Note: the library usage is simplified since version 1.0.0. If necessary, you can see the details for using previous versions.

Installation from the source code

  1. Unpack the archive
  2. Enter the directory libGkArrays-version-number
  3. Type ./configure
  4. If everything went fine, run make
  5. To install the library on your machine, type make install as an administrator
  6. Afterwards, you may want to run ldconfig as an administrator.

You can specify parameters to the configure script. For instance you can choose to build a static version (quicker) of the library rather than a shared version. Typing ./configure --help will provide you the list of available options.

Installation from the deb package

You just need to install the package using a dedicated program on your distribution or by typing dpkg -i package-name.

Using Gk-arrays in your code

Inside the archive, you will find under the doc directory a documentation on how to use the Gk-arrays in your code with a simple example. A full documentation of the library is available online or as a downloable PDF.

Contacting us

Feel free to contact us (crac-gkarrays@lists.gforge.inria.fr) if you find a bug in the library or if you encounter any problem.