Statistical Data Stewardship in the 21st Century: An Academic Perspective
journal contribution
posted on 2002-01-01, 00:00authored byGeorge Duncan
This paper presents an academic perspective on a broad spectrum of ideas and best practices for statistical data collectors to ensure proper stewardship for personal information that they collect, process and disseminate. Academic researchers in confidentiality address statistical data stewardship both because of its inherent importance to society and because the mathematical and statistical problems that arise challenge their creativity and capability. To provide a factual basis for policy decisions, an information organization (IO) engages in a two-stage process: (1) It gathers sensitive personal and proprietary data of value for analysis from respondents who depend on the IO for confidentiality protection. (2) From these data, it develops and disseminates data products that are both useful and have low risk of confidentiality disclosure. The IO is a broker between the respondent who has a primary concern for confidentiality protection and the data user who has a primary concern for the utility of the data. This inherent tension is difficult to resolve because deidentification of the data is generally inadequate to protect their confidentiality against attack by a data snooper. Effective stewardship of statistical data requires restricted access or restricted data procedures. In developing restricted data, IOs apply disclosure limitation techniques to the original data. Desirably, the resulting restricted data have both high data utility U to users (analytically valid data) and low disclosure risk R (safe data). This paper explores the promise of the R-U confidentiality map, a chart that traces the impact on R and U of changes in the parameters of a disclosure limitation procedure. Theory for the R-U confidentiality map is developed for additive noise. By an implementation through simulation methods, an IO can develop an empirical R-U confidentiality map. Disclosure limitation for tabular data is discussed and a new method, called cyclic perturbation, is introduced. The challenges posed by on-line access are explored.