Extension of Cross Validation with Confidence to Determining Number of Communities in Stochastic Block Models

Qin, Jining

doi:10.1184/R1/12327896.v1

jiningq_phd_statistics_2019.pdf (1.9 MB)

Extension of Cross Validation with Confidence to Determining Number of Communities in Stochastic Block Models

thesis

posted on 2020-05-21, 21:05 authored by Jining QinJining Qin

Stochastic block model (SBM) and its variants constitute an important family of methods for modeling network data. There is a rich literature on methods for estimating

the block labels and model parameters of stochastic block models, as well as study on the properties of such methods. Most of these studies would require the number of communities K as an input, making this an important problem. There are several methods proposed for this problem such as spectral methods, likelihood based methods, information criteria, Bayesian methods, and cross-validation. Cross-validation is a natural option for this problem since it is a widely used generic method for evaluating statistical methods, easy to be adapted for various scenarios. However, cross-validation is known to be inconsistent and prone to over-fitting unless using impractical split ratios. Cross-validation with confidence (CVC) has recently been proposed as a variation of cross-validation with better theoretical guarantees in conventional settings. In this thesis we studied the properties of cross-validation with confidence for stochastic block models. Practically, we implemented different variations of this

method by changing the train-test split scheme, approaches of obtaining the sampling distribution for the test statistic, and the loss function in cross-validation. We checked the performance of our method amongst these variations and against similar established methods in the literature. We checked its robustness under misspecification and different data generating processes. We also tested our algorithm

by applying it to two widely used real-world data sets. In addition, through theoretical studies, we show that under certain assumptions, our method is guaranteed

to eliminate under-fitting candidate models. We also further showed that CVC, unlike standard cross-validation, can consistently pick the optimalK by showing that the validated loss of the true model is not much worse than that of a slightly over- fitting model. Therefore, the candidate set output by a CVC procedure will contain the optimal model with guaranteed probability.

History

Date

2019-12-18

Degree Type

Dissertation

Department

Statistics

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Jing Lei Alessandro Rinaldo Larry Wasserman Kehui Chen

Usage metrics

Keywords

Social network analysis stochastic block models cross-validation

Licence

CC BY-NC 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Extension of Cross Validation with Confidence to Determining Number of Communities in Stochastic Block Models

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports