PETER/PeARL: An algorithm and index structure for string similarity searches and similarity joins on sequence data

PETER is a prefix-tree based indexing algorithm supporting approximate search and approximate joins. It combines an efficient implementation of compressed prefix trees with advanced pruning techniques (length filtering, frequence filtering, q-gram filtering). PETER is written in C++ and can be used as a:

  • UNIX command line tool
  • user-defined index in Oracle DB
  • shared library for individual programs

PETER features Hamming and Edit distance as similarity measures. Our tool has been evaluated on various collections of Expressed Sequence Tags (ESTs) from dbEST. PETER is faster by orders of magnitude compared to agrep, nrgrep for search queries and compared to user-defined functions for similarity joins inside Oracle DB. For a detailed evaluation, see: PETER was developed for indexing Expressed Sequence Tags and currently only deal with strings consisting of the letters A,C,G,T. PeARL is a parallel in-memory version of PETER for similarity-based string search and join on arbitrary alphabets, which is also capable of computing exact read alignments of millions of sequence reads in parallel.

Download and instructions for use

Publications

  • Rheinländer, A., Knobloch, M., Hochmuth, N. and Leser, U.: Prefix Tree Indexing for Similarity Search and Similarity Join on Genomic Data. Int. Conf. on Statistical and Scientific Databases, Heideberg, Germany, 2010.
  • Rheinländer, A. and Leser, U.: Similarity Search and Similarity Join in Oracle DB. In: Proc. of DOAG'10, Nuremberg, Germany 2010.
  • Rheinländer, A. and Leser, U.: Scalable Sequence Similarity Search and Join in Main Memory on Multi-Cores 2nd International Workshop on High Performance Bioinformatics and Biomedicine (HiBB), Bordeaux, France, 2011.