Skip to main content
library logo banner

Working with data: Weeding data

Guide on working with data

Image for decorative purposes

On this page you will find information to help you decide which data to keep and which to delete and on scheduling file deletion. 

Weeding your data

It might seem safest to keep all of the data that you generate during the course of your project, but if you do this you may end up with problems. For example, temporary and intermediate processing files can clutter up your file system and get in the way of important data by making it harder to find the files that you actually want to use. Additionally,  without robust version control, you might end up using older versions of files by mistake. Additionally, if you are generating large quantities of data you run the risk of exceeding the limits of your storage devices. There can be substantial costs associated with buying additional space, so look carefully to see if you need to keep all of your files or whether you can delete some of them.

Deciding what to retain and what to delete

Data type Comments Retain Delete
Raw data

If you are confident in your processing you may want to retain a cleaned version rather than the true raw data. If the research is the development or validation of a novel method then retention of the raw data is appropriate. 

Unless there is a reason to do so, we generally recommend not keeping raw audio or video recordings of interviews or focus groups due to difficulties in anonymising them.

Input data If you are running a simulation, it is probably more important to keep the data you fed into the simulation than the raw data that came out of it.   
Settings Make sure you keep a note of calibrations, instrument settings, and any other data you would need to repeat your experiment.  
Consent documents Retain your consent form wording and information sheets. You might also need to retain your signed consent forms.  These are important for understanding how the data may be used in the future. You should check the duration of retention of any signed consent documents with the research ethics committee that has approved your study.   
Questionnaires and interview / focus group topic guides Retain the templates of your questionnaires and interview or focus group topic guides and upload these to an archive alongside any related data. If the response data has been digitised you might not need to retain the completed versions of questionnaires; the timing of destruction should be addressed in your Data Management Plan  
Raw interview or focus group transcripts Transcripts containing sensitive and / or personally identifiable data from human participants should be securely deleted after anonymisation, unless there are justifications for keeping the un-redacted transcripts.   
Software and model code By retaining these, you give yourself the option of re-running your data processing or simulations. You can also provide code to other researchers who can validate and build on the code you have developed.   
Final outputs You should retain the data on which your research results are based, and the data that could be used as the basis for future research. Retention of the final outputs is also important for research integrity and validation of findings.   
Temporary or auxiliary files Some processing tools write out files on their first pass through the data, and use those files when reading the data on subsequent passes. Such files are rarely needed once the process has finished, and can usually be regenerated if needed.  
Intermediate files If you are working with automated processing workflow, you probably only need your input data and final output data. The files passed between processes within the workflow are rarely needed against outside the workflow.   
Obsolete versions In some cases it can be useful to keep previous versions of a file as a form of audit train. If however the provenance of the file is adequately covered by other means, it may be safer to delete the obsolete versions.   
Third party data If your research data has been obtained from a third-party check the licence to see if you are able to archive or share the data with other researchers. You may be able to share derived datasets. If you cannot archive the data then you should keep sufficiently detailed documentation to enable other researchers to reproduce your findings. 

Scheduling file deletion

Having decided which files you are going to retain and delete, schedule some points at which you will review your files and enact your decision. This will help to keep the task manageable. If you are running automated workflows, you might find it useful to write cleanup routines into your code to remove unnecessary files as you go along.