Extension of Cross Validation with Confidence to Determining Number of Communities in Stochastic Block Models

2020-05-21T21:05:51Z (GMT) by Jining Qin
Stochastic block model (SBM) and its variants constitute an important family of methods for modeling network data. There is a rich literature on methods for estimating
the block labels and model parameters of stochastic block models, as well as study on the properties of such methods. Most of these studies would require the number of communities K as an input, making this an important problem. There are several methods proposed for this problem such as spectral methods, likelihood based methods, information criteria, Bayesian methods, and cross-validation. Cross-validation is a natural option for this problem since it is a widely used generic method for evaluating statistical methods, easy to be adapted for various scenarios. However, cross-validation is known to be inconsistent and prone to over-fitting unless using impractical split ratios. Cross-validation with confidence (CVC) has recently been proposed as a variation of cross-validation with better theoretical guarantees in conventional settings. In this thesis we studied the properties of cross-validation with confidence for stochastic block models. Practically, we implemented different variations of this
method by changing the train-test split scheme, approaches of obtaining the sampling distribution for the test statistic, and the loss function in cross-validation. We checked the performance of our method amongst these variations and against similar established methods in the literature. We checked its robustness under misspecification and different data generating processes. We also tested our algorithm
by applying it to two widely used real-world data sets. In addition, through theoretical studies, we show that under certain assumptions, our method is guaranteed
to eliminate under-fitting candidate models. We also further showed that CVC, unlike standard cross-validation, can consistently pick the optimalK by showing that the validated loss of the true model is not much worse than that of a slightly over- fitting model. Therefore, the candidate set output by a CVC procedure will contain the optimal model with guaranteed probability.