Skip to Main Content

Research Data Management

Research Data Management

Data collection: volume, type and formats of data, data standards and methodologies, fodler structures and versioning

In a data management plan you will need to mention what standards or methodologies you are using to collect the data. By this we mean any protocols or standard methods you are using when collecting/handling the data.

This might include your protocols for:

  • data quality assurance
  • anonymisation and cleaning of data
  • data transcription standards
  • disciplinary standards [such as taxonomies like MeSH]
  • confidentiality agreements for data handlers

You might also mention any specific software that has relevance as to how you will be collecting or handling the data.

The volume and type of data you are collecting will affect your decisions around storage and back-up. It might also affect how you can share the data or collaborate with others. You will want to try and preempt any of these issues and explain how you will deal with them within your data management plan.

File formats are integral to ensuring that data is reusable in the future. Your choice of file format will determine the software that is required to open it. If that piece of software become obsolete, or the file format of that software is updated to the point where older versions are no longer compatible, or there are barriers to obtaining the software required, then yourself and others will not be able to make use of the data in the future! You should also check with your chosen repository, as to which files types are accepted.

Some considerations that may affect which file format you use:

  • Does your funder have any expectations as to how the data would be presented?

  • Does your research community have expectations as to while file formats are used?

  • What file formats does your chosen repository allow?

  • Is your file format widely adopted? Is it proprietary or open?

  • Is there any backwards compatibility of the file format?

  • Is there good support for metadata within the file format?

You should consider saving the data that you wish to retain in a format that can be opened by a wide variety of software, and which is unlikely to become outdated in the foreseeable future. 

You may wish to save your files in an Open File Format at a later stage of the project, but it is also worth bearing to mind that transforming  a file from one type to another may change the quality or functionality of the file. The MANTRA Research Data Management course has a module on file formats and transformations, which may be of use if you would like to learn more. An updated version of this content can also be found via the Research Data Management and Sharing MOOC in Week 3.

The table below includes recommendations on file formats gathered from a variety of sources, including the Lancaster University library pages and the UK Data Service.


Use Recommended Format Acceptable Formats Avoid
Textual

Rich Text Format (.rtf)

plain text, ASCII (.txt)

eXtensible Mark-up Language (.xml) text according to an appropriate Document Type Definition (DTD) or schema

Hypertext Mark-up Language (.html)

widely-used formats: MS Word (.doc/.docx)

some software-specific formats: NUD*IST, NVivo and ATLAS.ti

.doc
Audio Free Lossless Audio Codec (FLAC) (.flac)

MPEG-1 Audio Layer 3 (.mp3) if original created in this format

Audio Interchange File Format (.aif)

Waveform Audio Format (.wav)

.wma; .ra; .ram; compression
Geospatial Data

ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn optional)

geo-referenced TIFF (.tif, .tfw)

CAD data (.dwg)

tabular GIS attribute data

Geography Markup Language (.gml)

ESRI Geodatabase format (.mdb)

MapInfo Interchange Format (.mif) for vector data

Keyhole Mark-up Language (.kml)

Adobe Illustrator (.ai), CAD data (.dxf or .svg)

binary formats of GIS and CAD packages

Video

MPEG-4 (.mp4)

OGG video (.ogv, .ogg)

motion JPEG 2000 (.mj2)

AVCHD video (.avchd) .wmv; .mov; .avi; compression
Image TIFF 6.0 uncompressed (.tif)

JPEG (.jpeg, .jpg, .jp2) if original created in this format

GIF (.gif)

TIFF other versions (.tif, .tiff)

RAW image format (.raw)

Photoshop files (.psd)

BMP (.bmp)

PNG (.png)

Adobe Portable Document Format (PDF/A, PDF) (.pdf)

.psd; compression
Data .sql; .csv; .xml .xlsx .xls; proprietary DB formats
Tabular Data (extensive metadata)

SPSS portable format (.por)

delimited text and command ('setup') file (SPSS, Stata, SAS, etc.)

structured text or mark-up file of metadata information, e.g. DDI XML file

proprietary formats of statistical packages: SPSS (.sav), Stata (.dta), MS Access (.mdb/.accdb)
Tabular Data (minimal metadata)

comma-separated values (.csv)

tab-delimited file (.tab)

delimited text with SQL data definition statements

delimited text (.txt) with characters not present in data used as delimiters

widely-used formats: MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf), OpenDocument Spreadsheet (.ods)

 

 

Have you ever tried to organise your files by date to find that the order is incorrect due to the way you have written the date in the file name? Or are you guilty of naming final drafts "...final.doc"  "...finalfinal.doc"  "finalfinalabsolute.doc" ?

The Software Carpentry 'Data Management' video below gives a good introduction to file naming, folder structure and versioning and how they relate to each other.

Being able to locate your files and use them is a key aspect of good research data management, particularly when working collaboratively. It is good practice to decide on a strategy for naming your files at the beginning of the project. This can also help with versioning and help to mitigate mistakes. Stanford Libraries have some good examples of good and bad file naming practice on their web pages.

Tips on naming files:

The name should help to describe the file so name files purposefully in respect to their content or their role in your work, as opposed to using individual's names or document types.

Try to keep the name concise- 25 characters or less.

Certain software and different operating systems have restrictions on characters that can be used within a file name. To ensure interoperability, avoid using spaces in file names and special characters such as: * : \ / < > | " ? [ ] ; = + & £ $ , .

If you are collaborating, decide on a protocol for file naming to help keep things consistent then stick to your protocol! You can document your protocol in your data management plan.

For dates, put them in YYYY_MM_DD order at the beginning of the file/folder name.

Examples:

2019_04_03_Transcript_KYD

2011_03_01_Analysis_v03

Bulk File Renaming

There are tools to help you rename files in bulk. You may wish to try the free Bulk Rename Utility tool (Windows) or Renamer 6 (Mac).

Folder structure can greatly affect your efficiency when dealing with your data and it is particularly important to made a decision on folder structure when working collaboratively. If you are joining a team, there may be existing protocols and folder structure in place that you will have to follow.

There are 3 main common-sense tips for organising your folder structure.

1. Decide on a hierarchy that fits your project and needs- 

For example, will you have a lot of types of file and want to look for files of a type, such as interviews, transcripts, videos?

Alternatively, will you want to find files by which stage of a process you are at, such as initial data collection, cleaning, analysis?

Would it be best to keep documentation with the files it refers to, or in a separate folder?

2. Name folders appropriately, referring to the guidance given for file naming.

3. Separate active and completed files- consider a different secure storage location for files you no longer need if there are too many of these.

Once you've made a decision, you will want to consider writing this up in a README file that can be viewed by those adding files/editing them so that your structure can be consistently applied.

The Consortium of European Social Sciences Data Archives [CESSDA]  have created some useful guidance on creating a data file structure, which may be of use.


File Structure and Research Sites

If you are using a Research Site, you will  need to configure your site and library and add metadata to files to make them easy to filter and find. The Research Area column in the example library (shown below) demonstrates metadata that this site is using to sort and search files.

Research site library with columns

 

 

Versioning simply means managing different versions of your data. For example you may want the ability to re-visit a previous version of your data, or you may want to make sure that collaborators working on the same document are not overwriting the wrong versions.

There are many ways to version, below we tell you about file names, version tables and version control software:

Numbering Files

Your project may only require a way for you to distinguish between versions of your documents or files. You can easily do this by numbering the different versions of your document.

Version.Majorchange.Minorchange
This might appear as 2.4.1 where you are are on the 2nd major version of a document, the 4th major change and 1 minor change.
Major change. Minor Change
2.3 
would mean that you are on the 2nd major change, but the 3rd minor change or edit.

 

There may be a disciplinary norm for version numbers. For example, in software/coding you may use semantic version numbers, which is structured as so...

Major.Minor.Patch
Example-  3.2.4
Major= new features, structure, ideas, architecture, not compatible with previous changes
Minor= alterations that are backwards compatible, new features and improvements
Patch= bug-fixes/security updates, backwards compatible.

 

Version Control Table

In each case it may be a good idea to include a version control table in your documents. This would detail which versions of the document there are, what the change was, who made it and when. This allows you to keep control of what changes have been made to the document.

Example of a Version Control Table
Version Author Purpose/Change Date
1.1 KYD Amended results table 2019.03.02
1.2 KJR Formatted section 3 2019.03.20
1.3 SB Altered references 3 and 4 2019.04.11
2.0 JT Restructured sections as per group meeting 2019.04.28

Version Control in Software

You can get software to help with Version Control, with varying features. There are many out there, so it depends what is best for you and your project.

One Drive

One Drive will save previous versions of documents/files and allow you to revert to previous versions of files. For more information on this see the TIS sharepoint pages for Team Sites. There are also tutorials available via LinkedIn Learning, such as the one below.

GitHub

Originally used for software developers, GitHub is a popular solution for projects where you want full control over versions and collaboration. Learning GitHub is a transferable skill to many jobs and disciplines, so learning it is a good use of time. These slides from the Mozilla Science Lab 'Working Open Workshop' in Berlin are a great introduction to how GitHub can he useful for managing the files and data in your project.

LinkedIn has videos on these software and features- you can find links to a couple of these in the gallery below.

Useful Videos