Library: Working with data: Weeding data

Weeding your data

It might seem safest to keep all of the data that you generate during the course of your project, but if you do this you may end up with problems. For example, temporary and intermediate processing files can clutter up your file system and get in the way of important data by making it harder to find the files that you actually want to use. Additionally, without robust version control, you might end up using older versions of files by mistake. As well as this, if you are generating large quantities of data you run the risk of exceeding the limits of your storage devices. There can be substantial costs associated with buying additional space, so look carefully to see if you need to keep all of your files or whether you can delete some of them. The process of weeding your data throughout the project will help when you deposit your final dataset, which should be structured and documented appropriately with adequate metadata attached to it.

Deciding what to retain and what to delete

Data type	Comments	Retain	Delete
Raw data	If you are confident in your processing you may want to retain a cleaned version rather than the true raw data. If the research is the development or validation of a novel method then retention of the raw data is appropriate. Unless there is a reason to do so, we generally recommend not keeping raw audio or video recordings of interviews or focus groups due to difficulties in anonymising them.
Input data	If you are running a simulation, it is probably more important to keep the data you fed into the simulation than the raw data that came out of it.
Settings	Make sure you keep a note of calibrations, instrument settings, and any other data you would need to repeat your experiment.
Consent documents	Retain your consent form wording and information sheets. You might also need to retain your signed consent forms. These are important for understanding how the data may be used in the future. You should check the duration of retention of any signed consent documents with the research ethics committee that has approved your study.
Questionnaires and interview / focus group topic guides	Retain the templates of your questionnaires and interview or focus group topic guides and upload these to an archive alongside any related data. If the response data has been digitised you might not need to retain the completed versions of questionnaires; the timing of destruction should be addressed in your Data Management Plan.
Raw interview or focus group transcripts	Transcripts containing sensitive and / or personally identifiable data from human participants should be securely deleted after anonymisation, unless there are justifications for keeping the un-redacted transcripts.
Software and model code	By retaining these, you give yourself the option of re-running your data processing or simulations. You can also provide code to other researchers who can validate and build on the code you have developed.
Final outputs	You should retain the data on which your research results are based, and the data that could be used as the basis for future research. Retention of the final outputs is also important for research integrity and validation of findings.
Temporary or auxiliary files	Some processing tools write out files on their first pass through the data, and use those files when reading the data on subsequent passes. Such files are rarely needed once the process has finished, and can usually be regenerated if needed.
Intermediate files	If you are working with automated processing workflow, you probably only need your input data and final output data. The files passed between processes within the workflow are rarely needed against outside the workflow.
Obsolete versions	In some cases it can be useful to keep previous versions of a file as a form of audit train. If however the provenance of the file is adequately covered by other means, it may be safer to delete the obsolete versions.
Third party data	If your research data has been obtained from a third-party check the licence to see if you are able to archive or share the data with other researchers. You may be able to share derived datasets. If you cannot archive the data then you should keep sufficiently detailed documentation to enable other researchers to reproduce your findings.

NERC Data Management Planning
How to appraise and select research data for curation
Digital Curation Centre guidance on appraising data.

Scheduling file deletion

Having decided which files you are going to retain and delete, schedule some points at which you will review your files and enact your decision. This will help to keep the task manageable. If you are running automated workflows, you might find it useful to write cleanup routines into your code to remove unnecessary files as you go along.

What should I do with my data at the end of my project?

The University research storage service only provides storage for active projects. Once you no longer need regular access to your data, you should evaluate whether you need to retain the data for the long-term and therefore, whether it should be archived. Researchers should be aware that any research data that is held on an individuals' University file storage area (i.e. the H and X:Drive, as well as OneDrive), runs the risk of deletion if they leave the University as their accounts will be closed following their departure. Postgraduate students should consult with their supervisor about archiving and deleting research data at the end of their project.

Data that underpins a publication, or that may be of future benefit to you or other members of the research community, should be archived in an appropriate research data archive or repository. Any dataset that is deposited in a research data archive should be structured and documented appropriately, with adequate metadata attached to it. For more information, see our guide to archiving and sharing data.