基于块内Offset的倒排索引压缩算法研究

INVERTED INDEX COMPRESSION ALGORITHM BASED ON INTRA-BLOCK OFFSET

  • 摘要: 倒排索引是目前应用最广泛的全文检索技术之一。压缩倒排索引不仅可以节省磁盘空间,也可以减少内存访问次数。结合压缩offset和对整数序列分区的思想,提出基于块内offset的GR-OP压缩算法、Simple-OP压缩算法和PFOR-OP压缩算法。这些算法在动态分块的基础上,计算offset新序列后分别使用Golomb-Rice、Simple9、PFOR算法压缩。实验表明压缩块内offset比压缩原始docid序列和d-gap序列具有更好的压缩性能。基于块内offset的新算法在BPI和随机查询时间上有所提升,因为算法复杂度的增加,所以压缩时间较长。该算法对于实际应用十分有意义。

     

    Abstract: The compressed inverted index is one of the most widely-used techniques for full-text retrieval, which can not only save disk space but also reduce the number of memory accesses. Combining the ideas of compressing offset and chunking integer sequences, we propose the GR-OP compression algorithm, Simple-OP compression algorithm and PFOR-OP compression algorithm based on the offset within a block. These algorithms are based on dynamic chunking and compressed using Golomb-Rice, Simple9 and PFOR algorithm respectively. Experiments show that the new algorithms based on the intra-block offset have better compression performance than the original docid sequence and d-gap sequence. The new algorithms improve the BPI and random query time, while the compression time is longer due to the increased complexity of the algorithms. The new algorithms are meaningful for practical applications.

     

/

返回文章
返回