Menu
Home Explore People Places Arts History Plants & Animals Science Life & Culture Technology
On this page
Data curation
Organization of collected data

Data curation is the organized process of collecting, annotating, publishing, and presenting data from various sources to ensure its long-term value, availability, and reuse. It encompasses all steps required for controlled data creation, maintenance, and management, often adding value along the way. In scientific fields, data curation involves converting important information from texts into formats like biological databases. With the rise of big data, curation has gained importance, especially in software handling complex datasets. It also plays a key role in digital humanities, where managing cultural and scholarly data requires specialized practices to create, maintain, and validate valuable components, deciding what information to preserve and for how long.

We don't have any images related to Data curation yet.
We don't have any YouTube videos related to Data curation yet.
We don't have any PDF documents related to Data curation yet.
We don't have any Books related to Data curation yet.
We don't have any archived web articles related to Data curation yet.

History and practice

The user, rather than the database itself, typically initiates data curation and maintains metadata.8 According to the University of Illinois' Graduate School of Library and Information Science, "Data curation is the active and on-going management of data through its lifecycle of interest and usefulness to scholarship, science, and education; curation activities enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time."9 The data curation workflow is distinct from data quality management, data protection, lifecycle management, and data movement.10

Census data has been available in tabulated punch card form since the early 20th century and has been electronic since the 1960s.11 The Inter-university Consortium for Political and Social Research (ICPSR) website marks 1962 as the date of their first Survey Data Archive.12

Deep background on data libraries appeared in a 1982 issue of the Illinois journal, Library Trends.13 For historical background on the data archive movement, see "Social Scientific Information Needs for Numeric Data: The Evolution of the International Data Archive Infrastructure."14 The exact curation process undertaken within any organisation depends on the volume of data, how much noise the data contains, and what the expected future use of the data means to its dissemination.15

The crises in space data led to the 1999 creation of the Open Archival Information System (OAIS) model,16 stewarded by the Consultative Committee for Space Data Systems (CCSDS), which was formed in 1982.17

The term data curation is sometimes used in the context of biological databases, where specific biological information is firstly obtained from a range of research articles and then stored within a specific category of database. For instance, information about anti-depressant drugs can be obtained from various sources and, after checking whether they are available as a database or not, they are saved under a drug's database's anti-depressive category. Enterprises are also utilizing data curation within their operational and strategic processes to ensure data quality and accuracy.1819

Projects and studies

The Dissemination Information Packages (DIPS) for Information Reuse (DIPIR) project is studying research data produced and used by quantitative social scientists, archaeologists, and zoologists. The intended audience is researchers who use secondary data and the digital curators, digital repository managers, data center staff, and others who collect, manage, and store digital information.20

The Protein Data Bank was established in 1971 at Brookhaven National Laboratory, and has grown into a global project.21 A database for three-dimensional structural data of proteins and other large biological molecules, the PDB contains over 120,000 structures, all standardized, validated against experimental data, and annotated.

FlyBase, the primary repository of genetic and molecular data for the insect family Drosophilidae, dates back to 1992. FlyBase annotates the entire Drosophila melanogaster genome.22

The Linguistic Data Consortium is a data repository for linguistic data, dating back to 1992.23

The Sloan Digital Sky Survey began surveying the night sky in 2000.24 Computer scientist Jim Gray, while working on the data architecture of the SDSS, championed the idea of data curation in the sciences.25

DataNet was a research program of the U.S. National Science Foundation Office of Cyberinfrastructure, funding data management projects in the sciences.26 DataONE (Data Observation Network for Earth) is one of the projects funded through DataNet, helping the environmental science community preserve and share data.27 The commercial importance of data curation is highlighted by the existence of patents in the field. For example, companies like Tamr Inc. and Praxi Data hold patents related to data curation technologies and processes.28

See also

  • Literature portal
  • Curation of ecological and environmental data: DataONE
  • Data management tools and services spanning multiple scientific disciplines: DataConservancy

References

  1. Renée J. Miller, “Big Data Curation” in 20th International Conference on Management of Data (COMAD) 2014, Hyderabad, India, December 17–19, 2014 /wiki/Ren%C3%A9e_J._Miller

  2. Bio creative Glossary. Retrieved on 3 October 2016. http://biocreative.sourceforge.net/biocreative_glossary.html

  3. Furht, Borko; Armando Escalante (2011). Handbook of Data Intensive Computing. Springer Science & Business Media. p. 32. ISBN 9781461414155. Retrieved 2 October 2016. 9781461414155

  4. Sabharwal, Arjun (2015). Digital Curation in the Digital Humanities: Preserving and Promoting Archival and Special Collections. Chandos Publishing. p. 60. ISBN 9780081001783. Retrieved 2 October 2016. 9780081001783

  5. "An Introduction to Humanities Data Curation" by Julia Flanders and Trevor Muñoz http://guide.dhcuration.org/intro/. Not available any more: archive.org http://guide.dhcuration.org/intro/

  6. Pilin Glossary. Not available any more: archive.org http://www.pilin.net.au/Project_Documents/Glossary.htm

  7. Borgman, C (2015). Big data, little data, no data: Scholarship in the networked world. Cambridge, Massachusetts: MIT Press. pp. 13. ISBN 978-0-262-02856-1. 978-0-262-02856-1

  8. Chessell, Mandy; Nigel L Jones; Jay Limburn; David Radley; Kevin Shank (2015). Designing and Operating a Data Reservoir. IBM Redbooks. pp. 111–113. ISBN 9780837440668. Retrieved 2 October 2016. 9780837440668

  9. Cragin, Melissa; Heidorn, P. Bryan; Palmer, Carole L.; Smith, Linda C. (2007). "An Educational Program on Data Curation". ALA Science & Technology Section Conference. Retrieved 7 October 2013. https://www.ideals.illinois.edu/handle/2142/3493

  10. Chessell, Mandy; Nigel L Jones; Jay Limburn; David Radley; Kevin Shank (2015). Designing and Operating a Data Reservoir. IBM Redbooks. pp. 111–113. ISBN 9780837440668. Retrieved 2 October 2016. 9780837440668

  11. "Preserving Digital Information (PDI) report" (PDF). 1996. Retrieved 2018-03-13. https://www.clir.org/wp-content/uploads/sites/6/2016/09/pub63watersgarrett.pdf

  12. "ICPSR: History". www.icpsr.umich.edu. Retrieved 2018-03-15. https://www.icpsr.umich.edu/icpsrweb/content/about/history/

  13. Heim, Kathleen M. (November 29, 1982). "Library Trends 30 (3) Winter 1982: Data Libraries for the Social Sciences". Library Trends – via www.ideals.illinois.edu. https://www.ideals.illinois.edu/handle/2142/7218

  14. Kathleen M. Heim, "Social Scientific Information Needs for Numeric Data: The Evolution of the International Data Archive Infrastructure." in Collection Management 9 (Spring 1987): 1-53.

  15. Furht, Borko; Armando Escalante (2011). Handbook of Data Intensive Computing. Springer Science & Business Media. p. 32. ISBN 9781461414155. Retrieved 2 October 2016. 9781461414155

  16. "The OAIS reference model". 2015-12-09. Retrieved 2018-03-15. https://www.oclc.org/research/publications/library/2000/lavoie-oais.html

  17. "CCSDS.org - The Consultative Committee for Space Data Systems (CCSDS)". public.ccsds.org. Retrieved 2018-03-14. https://public.ccsds.org/default.aspx

  18. E. Curry, A. Freitas, and S. O’Riáin, “The Role of Community-Driven Data Curation for Enterprises,” Archived 2012-01-23 at the Wayback Machine in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25-47. ISBN 978-1-4419-7664-2 http://3roundstones.com/led_book/led-curry-et-al.html

  19. A. Freitas, E. Curry, “Big Data Curation,” Archived 2016-09-13 at the Wayback Machine in New Horizons for a Data-Driven Economy, Springer (Open Access), 2015. https://www.insight-centre.org/sites/default/files/publications/newhorizons_online.pdf

  20. Dissemination Information Packages for Information Reuse (DIPIR) project http://www.oclc.org/research/themes/user-studies/dipir.html http://www.oclc.org/research/themes/user-studies/dipir.html

  21. "RCSB PDB: About the PDB Archive and the RCSB PDB". About the PDB Archive and the RCSB PDB. Retrieved 15 March 2018. https://www.rcsb.org/pages/aboutus

  22. Gramates, LS; Marygold, SJ; dos Santos, G; Urbano, J-M; Antonazzo, G; Matthews, BB; Rey, AJ; Tabone, CJ; Crosby, MA; Emmert, DB; Falls, K; Goodman, JL; Hu, Y; Ponting, L; Schroeder, AJ; Strelets, VB; Thurmond, J; Zhou, P; FlyBase Consortium (2017). "lyBase at 25: looking to the future". Nucleic Acids Res. 45 (D1): D663 – D671. doi:10.1093/nar/gkw1016. PMC 5210523. PMID 27799470. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210523

  23. "About LDC". Linguistic Data Consortium. Retrieved 15 March 2018. https://www.ldc.upenn.edu/about

  24. "Sloan Digital Sky Survey". SDSS. Retrieved 15 March 2018. http://www.sdss.org/

  25. Palmer, Carole L.; Weber, Nicholas M.; Muñoz, Trevor; Renear, Allen H. (June 2013). "Foundations of Data Curation: The Pedagogy and Practice of "Purposeful Work" with Research Data". Archive Journal. 3. hdl:2142/78099. /wiki/Hdl_(identifier)

  26. "Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Summary". National Science Foundation. September 28, 2007. Retrieved March 15, 2018. https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503141

  27. "What is DataONE?". What is DataONE?. Archived from the original on 26 April 2019. Retrieved 15 March 2018. https://web.archive.org/web/20190426165259/https://www.dataone.org/what-dataone

  28. "What is Data Curation?". Praxi AI. Retrieved 15 June 2025. https://www.praxi.ai/what-is-data-curation

  29. Borgman, C (2015). Big data, little data, no data: Scholarship in the networked world. Cambridge, Massachusetts: MIT Press. pp. 13. ISBN 978-0-262-02856-1. 978-0-262-02856-1