Within-document term-based index pruning techniques with statistical hypothesis testing

Author(s)Thota, Sree
Date Accessioned2011-07-08T11:54:42Z
Date Available2011-07-08T11:54:42Z
Publication Date2010
AbstractStatic index pruning methods have been proposed to reduce the index size of information retrieval systems while retaining the effectiveness of the search. Document-centric static index pruning methods provide smaller indexes and faster query times by dropping some within-document term information from inverted lists. We present a method for pruning inverted lists derived from the formulation of unigram language models for retrieval. This method is based on the statistical significance of term frequency ratios. Using the two-sample two-proportion (2P2N) test, the frequency of occurrence of a word within a given document is statistically compared to the frequency of its occurrence in the collection to decide whether to prune it. Experimental results show that this technique can be used to significantly decrease the size of the index and querying time with less compromise to retrieval effectiveness than similar heuristic methods. We also implemented static index pruning algorithm that uses the retrievability of the documents decide whether to remove or keep them in the index, along with the statistical hypothesis testing method. The retrievability is calculated using the document entropy which is in turn calculated using the entropies of each of the terms in the document. It is observed from the experimental results that the performance of the retrieval system is improved by this hybrid algorithm. Furthermore, a formal statistical justification for such methods is also given.en_US
AdvisorCarterette, Ben
DegreeM.S.
DepartmentUniversity of Delaware, Department of Computer Science
URLhttp://udspace.udel.edu/handle/19716/9797
PublisherUniversity of Delawareen_US
dc.subject.lcshStatistical hypothesis testing
dc.subject.lcshComputer network resources -- Abstracting and indexing
dc.subject.lcshInformation retrieval
TitleWithin-document term-based index pruning techniques with statistical hypothesis testingen_US
TypeThesisen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Sree Lekha_Thota_thesis.pdf
Size:
1.29 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.22 KB
Format:
Item-specific license agreed upon to submission
Description: