Science Grid This Week
March 1, 2006 Current Issue | About SGTW | Subscribe | Archive | Contact SGTW  
The Future of Digital Data

Raymond Orbach
Francine Berman
Over the last few decades, advances in information technology have led to a fundamental change in research and education, and digital data collections are at the heart of this change. These collections are rapidly multiplying in number and increasing in size, and their management, preservation and use was the focus of the "Expanding Universe of Digital Data Collections" symposium at the 2006 American Association for the Advancement of Science Annual Meeting.

"We used to look at compute only and ask what you can do in your local environment and what you need to go outside for," said San Diego Supercomputer Center Director Francine Berman during her presentation at the symposium. "Now we need to look at data like that too. One of the things we see across many communities is the desire to put together different kinds of data to answer bigger questions."

Enabling such multidisciplinary research across different communities and digital data collections requires a comprehensive strategy for curating data sets over the long term.

"No plan for data preservation, which is currently what we have, often means that data can be lost or damaged," added Berman. "There are many kinds of questions that one must answer about these collections. What should one preserve? How should we preserve it? What kinds of formats, what kids of storage media? Who should pay for preservation?"

Raymond Orbach
Raymond Orbach
These and other questions were also raised in a recent report from the National Science Board titled "Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century," around which the symposium was based. The NSB provides oversight and policy for the National Science Foundation, and the report makes a series of recommendations to NSF for developing a strategy to support digital data collections.

Until now, the report notes, NSF strategies and policies governing long-lived data collections have been developed incrementally and not considered collectively. The report's recommendations to NSF, which it advises the agency to share and implement with other national agencies, include developing comprehensive agency-wide strategies for supporting and advancing long-lived collections; requiring research proposals for activities that will generate long-lived digital data to describe their data management plan, and for those plans to be evaluated by the funding agency; and ensuring education and training in the use of digital data sets.

Another recommendation, to develop a digital data workforce, was echoed by several symposium speakers. Data scientists with the specialized skills to manage data collections often have an ill-defined career path and face difficulties advancing in their careers. Thus the important work of curating a data collection for a large and varied user community is often done on a volunteer basis. Another speaker suggested that agencies may need to make long-term funding commitments to the collections themselves in addition to, or instead of, to the collecting organizations.

The view from the Department of Energy's Office of Science was delivered by director Raymond Orbach, who focused on the use and storage of very large data sets and the "curse of dimensionality"—the growing difficulty in interpreting data sets that are exponentially increasing in size. The techniques to make discoveries with huge experimental and simulated data sets, separating the signal from the noise and looking for patterns, need to continually change and advance.

"As our data sets become larger and larger, data mining is becoming more and more cumbersome and more and more expensive," said Orbach. "We can give example after example of discoveries that were not made at first because the data filter was wrong." Over the next few years, the DOE's Office of Science plans to double its amount of data storage and increase funding for the Office for Advanced Scientific Computing Research to initiate a long-term research program into ways of addressing the curse of dimensionality.

Read the NSB report

—Katie Yurkewicz