The Gene Ontology (GO) is extensively used to analyze
all types of high-throughput experiments.
However, researchers still face several challenges
when using GO and other functional annotation
databases. One problem is the large number of multiple
hypotheses that are being tested for each
study. In addition, categories often overlap with
both direct parents/descendents and other distant
categories in the hierarchical structure. This makes
it hard to determine if the identified significant categories
represent different functional outcomes or
rather a redundant view of the same biological processes.
To overcome these problems we developed
a generative probabilistic model which identifies a
(small) subset of categories that, together, explain
the selected gene set. Our model accommodates
noise and errors in the selected gene set and GO.
Using controlled GO data our method correctly
recovered most of the selected categories, leading
to dramatic improvements over current methods for
GO analysis. When used with microarray expression
data and ChIP-chip data from yeast and human our
method was able to correctly identify both general
and specific enriched categories which were overlooked
by other methods.