Within-document term-based index pruning techniques with statistical hypothesis testing

Date
2010
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Static index pruning methods have been proposed to reduce the index size of information retrieval systems while retaining the effectiveness of the search. Document-centric static index pruning methods provide smaller indexes and faster query times by dropping some within-document term information from inverted lists. We present a method for pruning inverted lists derived from the formulation of unigram language models for retrieval. This method is based on the statistical significance of term frequency ratios. Using the two-sample two-proportion (2P2N) test, the frequency of occurrence of a word within a given document is statistically compared to the frequency of its occurrence in the collection to decide whether to prune it. Experimental results show that this technique can be used to significantly decrease the size of the index and querying time with less compromise to retrieval effectiveness than similar heuristic methods. We also implemented static index pruning algorithm that uses the retrievability of the documents decide whether to remove or keep them in the index, along with the statistical hypothesis testing method. The retrievability is calculated using the document entropy which is in turn calculated using the entropies of each of the terms in the document. It is observed from the experimental results that the performance of the retrieval system is improved by this hybrid algorithm. Furthermore, a formal statistical justification for such methods is also given.
Description
Keywords
Citation