File formats

What to consider when choosing a file format

There are important things to consider when choosing a file format for digital data, and the choice should be planned early in the research cycle to ensure that the format suits all purposes that might be necessary.

The points to consider are:

  • What format is best suited for data creation?
  • What format is best suited for data analyses and other planned uses?
  • What format is best suited for long-term sustainability and sharing of data?
  • Should you choose an open versus a proprietary format?
  • Should the format be lossy or not?
  • is the format suitable for conversion?

The format and software in which research data are created usually depend on how researchers choose to collect and analyse data, on the hardware being used or the availability of software. It can also be determined by discipline-specific standards and customs. For example:

  • Image, audio and video data formats may be determined by the kind of camera or recording equipment being used. Unless high-quality data are initially recorded, one cannot go back and upgrade those later. It may be wise to collect data in maximum fidelity as they can always be downgraded and reduced in size, but not the other way around. Also, consider which format would be best suited in view of all the planned uses and conversions.
  • Numerical data are typically placed in spreadsheets or databases, where cases or records are plotted against variables or measurements. For social science surveys, the standard file format of choice is often SPSS due to its statistical analysis ability. In ecological research CSV or MS Excel are more widely used, being the standard data input format for many analytical packages.
  • Qualitative research data, like interviews, may initially be collected as digital audio recordings, such as in WAVor MP3format, and then be transcribed as textual files, such as in MS Word. Such data are frequently analysed using computer-assisted qualitative data analysis software (CAQDAS), such as NVivo or ATLAS.ti, whereby textual files are imported into the CAQDASdatabase.

Formats for long-term accessibility

When thinking about long-term accessibility and usability of research data, sustainable digital file formats and software are needed. For many formats, there is a danger that they will become obsolete in the future, which would make the data impossible to read and interpret.

Despite the backward compatibility of many software packages to import data created in previous software versions and the interoperability between competing popular software programmes, the safest option to guarantee long-term data access is to convert data to standard or open formats.

Not only can most software packages interpret these, but they are also suitable for data interchange and transformation, and are likely to stand a better chance of being reused well into the future.

File formats can be proprietary or open

  • Proprietary formats are owned by a company that claims intellectual property rights for the use of the software by granting licenses. Standard formats include the widely used proprietary Microsoft Office software products, (MS Word, Rich Text Format and MS Excel), or the popular SPSS format. These are likely to have long-term sustainability as they are so widely used.
  • Examples of open file formats are PDF/A, CSV, TIFF, OpenDocument Format (ODF), ASCII, tab-delimited format, comma-separated values and XML.
  • File formats can also be lossy or lossless. Lossy formats save space by removing detailed information that is assumed to be unimportant. For example, the lossy format JPEG removes fine detail in images, whilst the lossless format TIFF keeps all the detail. Also, repeatedly editing and saving files in lossy format results in a greater loss of information.
  • While researchers will use the most suitable data formats and software according to planned analyses during their research, once data analysis is completed and data are to be prepared for long-term storing, data conversion must be considered. Using open, standard, interchangeable and longer-lasting formats, avoids being unable to use the data in the future. This is also recommended for any backups. For long-term digital preservation, data centres and archives hold data in open and standard formats.

Follow this link for information on the file formats recommended by the UK Data Archive for long-term preservation.