The National Science Foundation recently awarded Jerry Reiter, Department of Statistical Science, John de Figueiredo, Duke Law School, and Ashwin Machanavajjhala, Department of Computer Science, nearly $1.5 million for their research project, “An Integrated System for Public/Private Access to Large-scale, Confidential Social Science Data”. The goal: to change the way researchers access and use data.
Dealing with data confidentiality
When dealing with highly sensitive data, such as information on health, income or even history of (or lack thereof) infidelity, researchers must be extremely cautious as to how much data to release with their findings. The ability to replicate findings is one of the hallmarks of a sound study, yet it is hardly ethical to release the names of individuals who have been involved with affairs. The problem: how much information is too much information?
Reiter and his fellow researchers believe they have the solution: a three-part system of data delivery, which not only protects the identities of participants, but also offers accessible and reliable mock data.
So, how does it work?
The first component deals with generating highly redacted (altered) yet accurate data models for public use. It accomplishes this through a synthetic modeling machine, which incorporates learning and flexible techniques in novel statistical models to capture many of the relationships present in the original data.
The second component, an original construct by the team, helps alleviate the possibility of a bad analysis of synthetic data. When working with even well developed synthetic data, you can never truly know if your particular data is “well preserved” (read: accurate and precise) by the original data set. To solve this issue, Reiter’s team created a verification server, which can actually tell you whether your analysis is a high or low quality result, describing whether the analysis from synthetic data is backed by the actual data. Through this method, researchers, students, and even the general public can work with data that both maintains confidentiality and can be easily measured on its validity. Of course, there are measures so that individuals cannot abuse the system to find out confidential specifics of the original data, such as the addresses of infidelity participants, while maintaining and outstandingly valid amount of data.
Then comes the third and final component: remote access to the system for trusted and preferred users. To work with highly sensitive data, researchers must often go to physical centers, often causing a loss in time and resources. The third component allows all analysis to be done on a server to make data available and secure, without being restrictive because of location.
What does this mean for the future of research?
Researchers, students, and the general public would have access to data that is typically reserved for individuals who have gone through a lengthy – and costly – screening process. As Reiter points out: “more people that can get to data, the better.” With each new set of eyes, there is a new possibility that people can gather further understanding from a set of data. Not surprisingly, there is already interest from professionals in the university setting.
Reiter notes that such a large and unique system could not have worked without individuals from different fields: “It’s an example of interdisciplinary Duke; the project needed all these different perspectives coming together to succeed.”