posted on 1996-01-01, 00:00authored byChristos Faloutsos, Raphael Chan
High capacity disks, especially optical ones, are commercially
available. These disks are ideal for archiving large
text data bases. In this work, we examine efficient
searching techniques for such applications. We propose
a unifying framework, which reveals the similarities
between signature files and an inverted file using a hash
table. Then, we design methods that combine the ease of
insertion of the signature files with the fast retrieval of
the inverted files. We develop analytical models for their
performance and we verify it through experimentation on
a 2.8 Mb data base. The agreement between theory and
experimentation is very good. The results show that the
proposed methods achieve fast retrieval, they require a
modest 10%-30% space overhead, (as opposed to 50%-
300% overhead [13] for the inverted files), and they do
not require re-writing; thus, they can handle insertions
easily, they permit searches during an insertion and they
can be used with write-once optical disks. Using our
verified model, the performance predictions for the proposed
methods on large data bases (e.g., 250 Mb) are
very promising.