Approaching Data like a Data Scientist
Graduate student Gary Koplik and rising junior Matt Tribby didn’t know each other before this summer, but for the past seven weeks they’ve sat at a table on the third floor of Gross Hall working eight-hour days together.
The two form a team working with Duke University Health System (DUHS) data as part of Data+, an immersive summer program supported by the Social Science Research Institute (SSRI) in which students of all levels work with big data. Their work quantifying rare diseases in the health system is the first of its kind.
You would think by their very nature rare diseases don’t make up a significant number of patients. By definition a rare disease affects fewer than 200,000 Americans at any given time. But there are currently over 6,000 rare diseases, so they may yet be a bigger burden on the health system. Koplik and Tribby have spent the last few weeks determining if this is the case.
Five Years of Data
Led by School of Nursing professor and health informaticist Rachel Richesson, the team has been combing through electronic medical record data from the last five years of DUHS records. The data set uses ICD-10 codes, the medical coding system developed by the World Health Organization that’s industry standard for billing, as a shorthand for patient diagnoses and symptoms.
There’s been some trial and error, but for Koplik and Tribby this was part of the appeal. Unlike coursework, which typically leads to a solution, this project didn’t have a solution or even a set way forward.
“You’re given the data set and there’s no specific direction,” Koplik said. “So it’s up to us to talk to people with domain knowledge, explore the data set on our own, and see what we can make of it. And that’s really the point of data science.”
It’s a sentiment echoed by Tribby. Through Data+, the two were able to step into the role of data scientist rather than student.
“I thought [Data+] was interesting because we’re working on data projects for a practical reason,” Tribby said. “In school, it’s mostly things that people have done before. This is generating new knowledge and there’s value in that.”
Paving Their Own Way
With no predetermined process or conclusion to work toward, their insights come from exploration. It was an authentic experience of the challenges and opportunities that come with working with a data set. When one approach seemed to be a dead-end, Koplik and Tribby worked to troubleshoot the issue and approach the data in a different way.
“There’s been a lot more data wrangling than I thought there’d be,” Tribby said. “The data in its original form is 30 million rows.”
With time and experience, they determined a method for analyzing the data and developed an estimate for the upper bound of rare diseases present in the Duke Health System. Approximately 2700 codes could represent rare diseases in the system, but, as Koplik warned, codes can map to five or six different diseases.
It isn’t yet clear how a more accurate estimate would reduce that number, though it is likely to reduce it to some degree. That lower bound will come in time, they both said, and, with any luck, before the final Data+ presentations on Friday, July 28. Their primary focus though, is creating a lower bound of the extent to which patients with rare diseases (regardless of how many) burden the system as a whole.
Working on the PRDN
Because the health data they’re using comes with restrictions, Koplik and Tribby must use the Protected Research Data Network (PRDN) for their work. The PRDN is a virtual computing and storage environment maintained by SSRI staff for researchers using data classified as Sensitive by the University and DUHS.
“Working on the PRDN means you have to be wary of others,” Tribby said. “You’re basically working on the same computer, so you have to be aware of what the other person is using in terms of memory. We had to work together in that sense and I’d never had to think about work in that way before.”
“There’s always something else to work on,” Koplik added, referring to the number of times memory limits meant only one computer could connect to the PRDN at a time.
The experience has been great, with SSRI staff on the Sensitive Data Services team, led by Rachel Franke, supporting them every step of the way. When Koplik and Tribby needed more memory, it was quickly granted. When they have questions, it’s a quick walk downstairs to visit the staff in person.
They’ve become so friendly, in fact, that Koplik, Tribby, and the Sensitive Data Services staff recently got together for a potluck lunch.
“It was lots of fun. They’ve been great to work with,” Koplik said.
On the total Data+ experience, he’s just as enthusiastic. “It’s amazing how just a little bit of knowledge about the basics of computer programming, or a statistical programming language like R, lets you do some pretty cool stuff,” he said.
And while both he and Tribby have more than “just a little bit of knowledge,” they both recommend Data+ for anyone with a desire to learn more about data science.
“All you need to do is give someone the opportunity,” Koplik said. “And those people will shine through out of curiosity and find some interesting things you might not have expected.”