The volume and type of data you are collecting will affect your decisions around storage and back-up. It might also affect how you can share the data or collaborate with others. You will want to try and preempt any of these issues and explain how you will deal with them within your data management plan.
File formats are integral to ensuring that data is reusable in the future. Your choice of file format will determine the software that is required to open it. If that piece of software become obsolete, or the file format of that software is updated to the point where older versions are no longer compatible, or there are barriers to obtaining the software required, then yourself and others will not be able to make use of the data in the future! You should also check with your chosen repository, as to which files types are accepted.
Some considerations that may affect which file format you use:
Does your funder have any expectations as to how the data would be presented?
Does your research community have expectations as to while file formats are used?
What file formats does your chosen repository allow?
Is your file format widely adopted? Is it proprietary or open?
Is there any backwards compatibility of the file format?
Is there good support for metadata within the file format?
You should consider saving the data that you wish to retain in a format that can be opened by a wide variety of software, and which is unlikely to become outdated in the foreseeable future.
You may wish to save your files in an Open File Format at a later stage of the project, but it is also worth bearing to mind that transforming a file from one type to another may change the quality or functionality of the file. The MANTRA Research Data Management course has a module on file formats and transformations, which may be of use if you would like to learn more. An updated version of this content can also be found via the Research Data Management and Sharing MOOC in Week 3.
The table below includes recommendations on file formats gathered from a variety of sources, including the Lancaster University library pages and the UK Data Service.
|Use||Recommended Format||Acceptable Formats||Avoid|
Rich Text Format (.rtf)
plain text, ASCII (.txt)
eXtensible Mark-up Language (.xml) text according to an appropriate Document Type Definition (DTD) or schema
Hypertext Mark-up Language (.html)
widely-used formats: MS Word (.doc/.docx)
some software-specific formats: NUD*IST, NVivo and ATLAS.ti
|Audio||Free Lossless Audio Codec (FLAC) (.flac)||
MPEG-1 Audio Layer 3 (.mp3) if original created in this format
Audio Interchange File Format (.aif)
Waveform Audio Format (.wav)
|.wma; .ra; .ram; compression|
ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn optional)
geo-referenced TIFF (.tif, .tfw)
CAD data (.dwg)
tabular GIS attribute data
Geography Markup Language (.gml)
ESRI Geodatabase format (.mdb)
MapInfo Interchange Format (.mif) for vector data
Keyhole Mark-up Language (.kml)
Adobe Illustrator (.ai), CAD data (.dxf or .svg)
binary formats of GIS and CAD packages
OGG video (.ogv, .ogg)
motion JPEG 2000 (.mj2)
|AVCHD video (.avchd)||.wmv; .mov; .avi; compression|
|Image||TIFF 6.0 uncompressed (.tif)||
JPEG (.jpeg, .jpg, .jp2) if original created in this format
TIFF other versions (.tif, .tiff)
RAW image format (.raw)
Photoshop files (.psd)
Adobe Portable Document Format (PDF/A, PDF) (.pdf)
|Data||.sql; .csv; .xml||.xlsx||.xls; proprietary DB formats|
|Tabular Data (extensive metadata)||
SPSS portable format (.por)
delimited text and command ('setup') file (SPSS, Stata, SAS, etc.)
structured text or mark-up file of metadata information, e.g. DDI XML file
|proprietary formats of statistical packages: SPSS (.sav), Stata (.dta), MS Access (.mdb/.accdb)|
|Tabular Data (minimal metadata)||
comma-separated values (.csv)
tab-delimited file (.tab)
delimited text with SQL data definition statements
delimited text (.txt) with characters not present in data used as delimiters
widely-used formats: MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf), OpenDocument Spreadsheet (.ods)
In a data management plan you will need to mention what standards or methodologies you are using to collect the data. By this we mean any protocols or standard methods you are using when collecting/handling the data. This might include your protocols for data quality assurance, anonymisation and cleaning of data, or disciplinary standards such as taxonomies, for example. You might also mention any specific software that has relevance as to how you will be collecting or handling the data.
Have you ever tried to organise your files by date to find that the order is incorrect due to the way you have written the date in the file name? Or are you guilty of naming final drafts "...final.doc" "...finalfinal.doc" "finalfinalabsolute.doc" ?
The Software Carpentry 'Data Management' video below gives a good introduction to file naming, folder structure and versioning and how they relate to each other.
Being able to locate your files and use them is a key aspect of good research data management, particularly when working collaboratively. It is good practice to decide on a strategy for naming your files at the beginning of the project. This can also help with versioning and help to mitigate mistakes. Stanford Libraries have some good examples of good and bad file naming practice on their web pages.
Tips on naming files:
The name should help to describe the file so name files purposefully in respect to their content or their role in your work, as opposed to using individual's names or document types.
Try to keep the name concise- 25 characters or less.
Certain software and different operating systems have restrictions on characters that can be used within a file name. To ensure interoperability, avoid using spaces in file names and special characters such as: * : \ / < > | " ? [ ] ; = + & £ $ , .
If you are collaborating, decide on a protocol for file naming to help keep things consistent then stick to your protocol! You can document your protocol in your data management plan.
For dates, put them in YYYY_MM_DD order at the beginning of the file/folder name.
Bulk File Renaming
Folder structure can greatly affect your efficiency when dealing with your data and it is particularly important to made a decision on folder structure when working collaboratively. If you are joining a team, there may be existing protocols and folder structure in place that you will have to follow.
There are 3 main common-sense tips for organising your folder structure.
1. Decide on a hierarchy that fits your project and needs-
For example, will you have a lot of types of file and want to look for files of a type, such as interviews, transcripts, videos?
Alternatively, will you want to find files by which stage of a process you are at, such as initial data collection, cleaning, analysis?
Would it be best to keep documentation with the files it refers to, or in a separate folder?
2. Name folders appropriately, referring to the guidance given for file naming.
3. Separate active and completed files- consider a different secure storage location for files you no longer need if there are too many of these.
Once you've made a decision, you will want to consider writing this up in a README file that can be viewed by those adding files/editing them so that your structure can be consistently applied.
The Consortium of European Social Sciences Data Archives [CESSDA] have created some useful guidance on creating a data file structure, which may be of use.
File Structure and Research Sites
If you are using a Research Site, you will not need to create a file structure but will instead need to configure your site and library and add metadata to files to make them easy to filter and find. The Resarch Area column in the example library (shown below) demonstrates metadata that this site is using to sort and search files. You can access University of Plymouth training on team sites via Employee Self Service, or find more information and contacts via the Research Sites pages.
Versioning simply means managing different versions of your data. For example you may want the ability to re-visit a previous version of your data, or you may want to make sure that collaborators working on the same document are not overwriting the wrong versions.
There are many ways to version, some of which are explored below:
Your project may only require a way for you to distinguish between versions of your documents or files. You can easily do this by numbering the different versions of your document.
There may be a disciplinary norm for version numbers. For example, in software/coding you may use semantic version numbers, which is structured as so...
In each case it may be a good idea to include a version control table in your documents. This would detail which versions of the document there are, what the change was, who made it and when. This allows you to keep control of what changes have been made to the document.
|1.1||KYD||Amended results table||2019.03.02|
|1.2||KJR||Formatted section 3||2019.03.20|
|1.3||SB||Altered references 3 and 4||2019.04.11|
|2.0||JT||Restructured sections as per group meeting||2019.04.28|
One Drive will save previous versions of documents/files and allow you to revert to previous versions of files. For more information on this and using One Drive collaboratively, refer to the Research Sites pages and the Office:365 Advanced Techniques pages. There are also tutorials available via LinkedIn Learning, such as the one below.
Originally used for software developers, GitHub is a popular solution for projects where you want full control over versions and collaboration. Learning GitHub is a transferable skill to many jobs and disciplines, so learning it is a good use of time. These slides from the Mozilla Science Lab 'Working Open Workshop' in Berlin are a great introduction to how GitHub can he useful for managing the files and data in your project.
Subversion is another type of software version control tool, which operates on a centralised system, while Github works on a decentralised system. There are pros and cons to both but an Internet search will usually find you information biased on one side or another. It really comes down to what works best for you and what you prefer to use.