Library: Working with data: Data Documentation and Metadata

Data documentation and metadata

Documentation that you will need

At the start of your project think about the sorts of documentation that you will need:

you might need to remind yourself how you collected and organised your data if you come to re-use them in another project. Good documentation will help you to help yourself in the future!
a peer researcher might need to know why and how you collected, organised and processed your data, in order to build on your research and use your data in new ways.
the data archive to which you will submit your data at the end of your project will ask you a set of questions when you register and upload your data - find out what these are and ensure that you document this information throughout the project. This will make the process of archiving your data much less time-consuming.
you may need to describe how you collected your data in a journal article, in enough detail to allow your results to be reproduced and verified.

Document your data as you go along- it is much easier to do this than to try to remember what you have done at the end of your project!

Methods of documenting your data

There are different ways in which you can document your data depending on the context within which it is being collected:

in a 'readme' file: any information that cannot be recorded in a structured way (i.e. as the values of fields in a data or metadata file) can be recorded as free text within a readme file.
in an electronic lab notebook: the University currently recommends Signals Notebook to Chemistry students, and Jupyter Notebook to researchers who are writing code.
within the data file: some file formats can record information in addition to the main data content. For example, the Observations and Measurements XML standard provides a way of recording sampling strategies and procedures as well as measurement values.
in a separate metadata file: some disciplines have developed special file formats or data structures for recording supporting information. There are a list of resources for using structure metadata files in the 'resources for data documentation' section of this guide.
in a file mimicking a web form: in some cases, archives generate specialist metadata files from their submission forms. Find out the fields of the submission form of the archive to which you are planning to submit your data, copy these fields into your data documentation and fill these in as you go through your project.
in a published journal article: some of the information needed to understand data would normally be provided in a journal article reporting the research. In order to prevent duplication of effort, it is possible to refer to an article to provide more information about a dataset, but before doing so you should be sure that (a) the article provides sufficient detail and (b) that the article will be available as open access.

Metadata

'Metadata describe the content, context and provenance of datasets in a standardised and structured manner, typically describing the purpose, origin, temporal characteristics, geographic location, authorship, access and conditions and terms of use of a dataset' (UK Data Service). An example of dataset metadata are values of the fields that you use when you search the literature databases. Metadata can be used to make the data discoverable on an internet search and then for researchers to decide whether the dataset is suitable for their research. They are most useful when they have been structured and often use a standardised dictionary of terms. The 'resources for data documentation' section of this guide provides links to resources to identify whether there are metadata standards used within your discipline.

Readme file templates

You can use these readme file templates to document your data.

Example readme file template
This is an example of a generic readme template. It can be changed according to the dataset being described.
Readme file template for code-based data
This readme file template is based on the CodeMeta metadata standard and can be used for any code-based dataset.
Readme file template for experimental data
This readme file template is based on the Core Scientific Metadata Model (CSMD) metadata standard and can be used for experimental datasets and facilities datasets.

Writing your data documentation

When documenting your data, the aim is to provide enough information so that a fellow researcher who is familiar with your field, but not necessarily with your work, should be able to understand the data, interpret it and use it in new research, without the need to contact you directly about the dataset.

An overview of the data should include:

the methods used to collect the data, including major methodological decisions that have been taken;
the structure of the files used;
processing or data manipulation that has been undertaken to generate the results of the project.

Specifically, you may need to include some of the following information:

details of the equipment used, such as make and model, settings and information on how it was calibrated;
the text of questionnaires and interview templates or topic guides. If these are only available under licence details of how to access the instruments should be included;
details of who collected the data and when;
key features of the methodology, such as sampling technique, whether the experiment was blinded, how sample groups were subdivided;
legal and ethical agreements relating to the data, such as consent forms, data licences, approval documents or COSHH forms;
citations for any third-party data you have used;
details of the file formats and standard data structures used to record the data and supporting information;
a glossary of column names and abbreviations used, explaining for example which measurement resulted in a given column and what units were used;
methods of managing missing data;
the codebook you used to analyse and encode content;
the workflow used to process and manipulate the data, including steps such as applying statistical tests or removing outliers;
details of the software used to generate or process the data, including version number and platform.

You may be recording some of this information in a lab notebook or research journal. If so, you may find it convenient to maintain an index file that links data files to the corresponding page numbers until you have an opportunity to transfer the information into a documentation file.

A 'readme' file is a plain text file that is named 'readme' to encourage users to read it before looking at the remainder of the content. It can contain documentation directly or instruct the reader where to look to find more information. Even though it is free text, the file should be structured into sections as an aid to the reader. The following table summarises suggestions on what to include. There are some examples of readme files provided as links below the table.

Section	What to include
Citation information	Information needed so that the reader can cite your dataset: title of the dataset names of the people responsible for the dataset year it was (or will be) released location of the dataset (this should normally be the name of the data archive that holds the data) identifier for the dataset such as a Digital Object Identifer (DOI) or Accession number
Methodology	Describe how you collected the data: reference to a published article describing the methods, including the DOI and link to an open access copy any additional information needed to allow for reproduction of the dataset or of a comparable one
Third-party inputs	If you used third-party data, provide a data citation or a description of how you accessed the data.
Workflow	Provide details of the steps you took to process the data: preparatory steps such as data cleaning and reformatting name of the software, services or scripts you used and where they can be found how to install / invoke / run any software, services or scripts any settings needed for the software
Outputs	If your workflow generates auxiliary files as well as data files, explain which are which. Relate the outputs of your workflow to the data files you have, or will be submitting, for archiving.
Inventory of files	Give the names of the files in the dataset, a short description of each, and how they interrelate. Mention related data that was not selected for inclusion, such as auxiliary files generated by your workflow.
File structure and conventions	Provide details on how to interpret your data files: explain what measurement each column heading represents units of measurement used definitions of categorical variable groups abbreviations key to identifying missing data coding or controlled vocabulary that was used
Licence information	Give a short statement about the terms under which others may use the dataset. If necessary, the full text of the licence may be given in a separate plain-text file called 'licence.txt'.
Relationships	If applicable, give links to related datasets, alternative records or publications.

The University of Bath Research Data Archive contains some examples of readme files you can look at for inspiration. The University of Minnesota provides an example of a readme file template.

As a researcher, the three main types of metadata you will be asked to provide are contextual metadata, discovery data, and metadata for reuse.

Contextual metadata

This describes the context within which the project was conducted. You will provide this when you create create a record of a dataset in Pure. This helps to connect your data to your own research profile, and to your project, funding body and publications.

Discovery metadata

This helps other researchers to find your data, and as a result may help to increase the impact of your research. You will provide discovery metadata when you complete a record in the University of Bath Research Data Archive or another research data archive or repository.

Metadata for reuse

The metadata you provide for reuse will depend on the field of your research.

Social scientists often package their data and metadata together using DDI, or if the data are strongly statistical in nature, SDMX.
Many biological and biomedical investigations have a corresponding Minimum Information Standard setting out what information would be needed to interpret the data unambiguously and reproduce the experiment.
Geospatial datasets are usually packaged in a format that complies with the standard ISO 19115. There are many profiles of this standard aimed at different communities; UK researchers are encouraged to use UK GEMINI, which is in turn compliant with the European INSPIRE Directive.
Some subject-specific data archives ask for data to be submitted in a particular format. For example, the NCBI Gene Expression Omnibus specifies a metadata set to be submitted along with data, and has developed the spreadsheet-based GEOarchive format for capturing it.

The resources section of this guide has links to a number of subject-specific metadata standards and to catalogues of metadata standards.

Some subject areas have agreed on a common set of terminology to use when describing data. Metadata standards list the properties of the dataset that need to be known and vocabularies provide a standardised set of terms with which these properties can be recorded.

The NERC Vocabulary Server provides access to many different vocabularies in use in geoscience and oceanography.
The Open Knowledge Foundation runs the Linked Open Vocabularies service, which provides access to many different vocabularies that are suitable for use in Resource Description Framework (RDF) applications.

Some labs are now moving away from paper-based laboratory notebooks to electronic lab notebooks. At the University we are currently recommending the use of Signals Notebook to Chemistry students. More information on this software can be found on the Library Chemistry subject page. The Advancing Research Computing Service is recommending Jupyter Notebook for coding and bioinformatics. Jupyter is open-source software and is therefore freely available to use.

Harvard Biomedical Data Management Group have a created an Electronic Lab Notebook Matrix that contains information on a wide range of the currently available software.

Resources for writing data documentation

Digital Curation Centre - metadata standards
Introduction to metadata standards with a summary of commonly used standards.
Scientific American - introduction to metadata, with a festive theme
Scientific American blog that provides a lay description of metadata with festive examples.
UK Data Service - intellectual property rights resources
Links to resources on data documentation from the UK Data Service.
UK Data Service - documenting your data
Overview of study- and data-level documentation and an overview of catalogue metadata.

FAIRSharing
Catalogue of databases relevant to science and standards used with them.
Digital Curation Centre - Disciplinary Metadata
Catalogue of metadata standards by discipline (Biology, Earth Science, Physical science, Social Science & Humanities) and for general research data.
RDM Metadata Standards Catalog
Open directory of metadata standards applicable to research data.
Data Documentation Initiative (DDI) Specification
International metadata standard for describing social, behavioural and economics sciences data. DDI is used by the UK Data Service repository.
Minimum Information for Biological and Biomedical Investigations (MIBBI)
This is a set of guidelines for reporting data derived by relevant methods in biosciences.
Community Inventory of EarthCube Resources for Geosciences Interoperability
Resource for discovering metadata standards for the geosciences.
Content Standard References
Marine Metadata Interoperability metadata standard catalogue aimed at marine science.
GEOSS Standards and Interoperability Registry
Metadata standards for Earth observation research.
INSPIRE Directive
Metadata standard for geospatial data (European directive).
UK GEMINI
Metadata standard for geospatial data.

NERC Vocabulary Server
Vocabularies used in geoscience and oceanography.
Linked Open Vocabularies
This service provides access to many different vocabularies that are suitable for use in Resource Description Framework.