posted on 2003-09-01, 00:00authored byAnupam Gupta, John Lafferty, Han Liu, Larry Wasserman, Min Xiu
We study graph estimation and density estimation in high dimensions, using a family of density
estimators based on forest structured undirected graphical models. For density estimation, we do
not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of
the bivariate and univariate marginals, and apply Kruskal’s algorithm to estimate the optimal forest
on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative
to the risk of the best forest. For graph estimation, we consider the problem of estimating forests
with restricted tree sizes. We prove that finding a maximum weight spanning forest with restricted
tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree
size as a complexity parameter, we then select a forest using data splitting, and prove bounds
on excess risk and structure selection consistency of the procedure. Experiments with simulated
data and microarray data indicate that the methods are a practical alternative to sparse Gaussian
graphical models.