Digitisation

Converting research data into digital format

Data are easier to manage and share when research data and accompanying documentation are in digital format.

Non-digital data can be converted to the digital source in a variety of ways, depending on their format and condition. Information can be entered manually by keyboard into a text or database template.

Image scans can be created via a document scanner or by digital photography.  Text can also be digitised via optical character recognition from image scans.

Research data collected in the past might be available in a variety of media and formats. The most common are as follows:

  • Text-based materials, such as handwritten diaries, field notes, tables, diagrams, annotated printed questionnaires or typewritten text. These are usually in paper format.
  • Images of people, places or objects, such as from ethnographic work. These are usually photographs or slides.
  • Audio-visual recordings of interviews or observation, typically held as analogue audio cassettes, microcassettes, video tapes or reel-to-reel tapes.
  • Diagrams, such as maps, plans or blueprints, usually paper-based.

Digitising text

Textual data can be digitised to different levels depending on the quality of the writing or typeface.

Scanning

Scanning as an image file and saving as a TIFF image file. This is the best method for information in poor typeface, readable handwritten text, or text with multiple tables and graphs. If information needs to be anonymised, black marker black-out is used on a copy (not the original), prior to scanning.

Precious materials should be photocopied before feeding into any multi-feed scanner, in case they get damaged. If using a digital camera to capture images, it should have sufficient resolution capability, measured in megapixels. The camera should be secured with a horizontal mount and ensure that there is good overhead or dedicated lighting. Camera images should be transferred to safe media, and files well-organised.

Searchable PDF/A

If there are multiple pages in the original document, resulting scanned TIFF files can be collated into a searchable PDF/A file using ‘Paper Capture’ in Adobe Acrobat. The PDF/A file can be bookmarked to aid navigation, with contents page and headings.

Rich text version via OCR

Text with good typeface can be scanned as an image file and then processed using optical character recognition (OCR) software that recognises text. Some training of the system may be required to enable it to recognise non-standardised words, such as technical terminology.

Checking and proofing of the resulting OCR text against the original text source is necessary as errors do occur. This can be rather time-consuming work. The resulting file can be saved and formatted as a word-processed file.

Transcription

Text can be manually transcribed by keyboard from the original source. In this process the new source should be kept as close to the original as possible and if changes are made, such as correcting typos, they should be indicated in brackets.

Formats and copyright

Formats

  • Images can be scanned or photographed and saved in a TIFF file format.
  • Audio is best digitised to WAV file format and video to MPEG or motion JPEG 2000 format.
  • Maps can be digitised into raster format through scanning or into vector format by digitising map features, such as lines, points, shapes, with a digitising tablet or via on-screen digitisation from a scanned image.

Copyright and process

When considering the digitisation of sources where the content is not your own, attention needs to be paid to the copyright of the original material. You should also document the digitisation process as this provides important information on the data quality, data source and purpose of the digitisation. Remember that there is already some useful metadata within these files.