The volume and type of data you are collecting will affect your decisions around storage and back-up. It might also affect how you can share the data or collaborate with others. You will want to try and preempt any of these issues and explain how you will deal with them within your data management plan.
File formats are integral to ensuring that data is reusable in the future. Your choice of file format will determine the software that is required to open it. If that piece of software become obsolete, or the file format of that software is updated to the point where older versions are no longer compatible, or there are barriers to obtaining the software required, then yourself and others will not be able to make use of the data in the future! You should also check with your chosen repository, as to which files types are accepted.
Some considerations that may affect which file format you use:
Does your funder have any expectations as to how the data would be presented?
Does your research community have expectations as to while file formats are used?
What file formats does your chosen repository allow?
Is your file format widely adopted? Is it proprietary or open?
Is there any backwards compatibility of the file format?
Is there good support for metadata within the file format?
You should consider saving the data that you wish to retain in a format that can be opened by a wide variety of software, and which is unlikely to become outdated in the foreseeable future.
You may wish to save your files in an Open File Format at a later stage of the project, but it is also worth bearing to mind that transforming a file from one type to another may change the quality or functionality of the file. The MANTRA Research Data Management course has a module on file formats and transformations, which may be of use if you would like to learn more. An updated version of this content can also be found via the Research Data Management and Sharing MOOC in Week 3.
The table below includes recommendations on file formats gathered from a variety of sources, including the Lancaster University library pages and the UK Data Service.
Use | Recommended Format | Acceptable Formats | Avoid |
---|---|---|---|
Textual |
Rich Text Format (.rtf) plain text, ASCII (.txt) eXtensible Mark-up Language (.xml) text according to an appropriate Document Type Definition (DTD) or schema |
Hypertext Mark-up Language (.html) widely-used formats: MS Word (.doc/.docx) some software-specific formats: NUD*IST, NVivo and ATLAS.ti |
.doc |
Audio | Free Lossless Audio Codec (FLAC) (.flac) |
MPEG-1 Audio Layer 3 (.mp3) if original created in this format Audio Interchange File Format (.aif) Waveform Audio Format (.wav) |
.wma; .ra; .ram; compression |
Geospatial Data |
ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn optional) geo-referenced TIFF (.tif, .tfw) CAD data (.dwg) tabular GIS attribute data Geography Markup Language (.gml) |
ESRI Geodatabase format (.mdb) MapInfo Interchange Format (.mif) for vector data Keyhole Mark-up Language (.kml) Adobe Illustrator (.ai), CAD data (.dxf or .svg) binary formats of GIS and CAD packages |
|
Video |
MPEG-4 (.mp4) OGG video (.ogv, .ogg) motion JPEG 2000 (.mj2) |
AVCHD video (.avchd) | .wmv; .mov; .avi; compression |
Image | TIFF 6.0 uncompressed (.tif) |
JPEG (.jpeg, .jpg, .jp2) if original created in this format GIF (.gif) TIFF other versions (.tif, .tiff) RAW image format (.raw) Photoshop files (.psd) BMP (.bmp) PNG (.png) Adobe Portable Document Format (PDF/A, PDF) (.pdf) |
.psd; compression |
Data | .sql; .csv; .xml | .xlsx | .xls; proprietary DB formats |
Tabular Data (extensive metadata) |
SPSS portable format (.por) delimited text and command ('setup') file (SPSS, Stata, SAS, etc.) structured text or mark-up file of metadata information, e.g. DDI XML file |
proprietary formats of statistical packages: SPSS (.sav), Stata (.dta), MS Access (.mdb/.accdb) | |
Tabular Data (minimal metadata) |
comma-separated values (.csv) tab-delimited file (.tab) delimited text with SQL data definition statements |
delimited text (.txt) with characters not present in data used as delimiters widely-used formats: MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf), OpenDocument Spreadsheet (.ods) |