Data documentation: quantitative data
Data documentation: quantitative data
Structured tabular data should have as documentation (where applicable):
- Variable names, labels and descriptions (maximum 80 characters).
- Units of measurement for variables.
- Reference to the question number of a survey or questionnaire.
Example: variable ‘q11hexw’ with label ‘Q11: Hours spent taking physical exercise in a typical week’ —— the label gives the unit of measurement and a reference to the question number (Q11) |
- Value code labels.
Example: variable ‘p1sex’ = ‘sex of respondent’ with codes ‘1=female’, ‘2=male’, ‘8=don’t know’, ‘9=not answered’ |
- Coding and classification schemes explained, with a bibliographic and dated reference (some standards change over time).
Examples: Standard Occupational Classification, 2000 —— a series of codes to classify respondents’ jobs; ISO 3166 alpha-2 country codes —— an international standard of 2-letter country codes |
- Codes for missing data, with reason data are missing (blanks, system-missing or ‘0’ values are best avoided).
Example: ’99=not recorded’, ’98=not provided (no answer)’, ’97=not applicable’, ’96=not known’, ’95=error’ |
- Ceviating universe information for variables in case of skipped cases or questions.
- Derived or constructed variables created after collection, giving code, algorithm or command files used to create them —— simple derivations, such as grouping age data into age intervals, can be explained in the variable and value labels; complex derivations can be described by providing the algorithms, logical statements or functions used to create derived variables, such as the SPSS or Stata command files.
Uncoded, ungrouped and underived raw data provide more re-use options than those where coding, grouping or derivation has been applied, allowing secondary users to apply their own codes, groupings or derivations.
Embedding data documentation
Many data software packages have facilities for data annotation and description as variable attributes (labels, codes, data type, missing values), table relationships, etc..
- Example embedded documentation SPSS file: Variable descriptions and attributes, such as codes, data type, missing values, can be documented for each variable in ‘Variable View’ or via syntax, whereby embedded data documentation is then contained in the SPSS command file.
- Example embedded documentation MS Access database: Variable descriptions and attributes can be documented in ‘Design View’ and relationships between tables and files can be created.
- GIS e.g ArcGIS: Shapefiles or layers and tables can be organised in a geo-database with rich metadata created in ArcCatalog.
- Example embedded documentation MS Excel spreadsheet: An additional worksheet within the data file can contain variable and data-related documentation.
A structured dataset may also be accompanied by a codebook detailing all variables and their values. This can be created by importing frequency distribution outputs, created from the software package used, into a word processor, with annotation added where necessary.
Structured metadata: XML schemas
More comprehensive variable level documentation, including basic data dictionary information, question text and question routing instructions, can also be created using a structured metadata format.
XML is often used to enable this, such as in the Data Documentation Initiative (DDI). Detailed DDI documentation can be directly created from various software packages, using DDI-specific XML authoring tools.
Such standardised documentation in XML format can be used for data extract and analysis engines, such as Nesstar; see for example the datasets included in our Nesstar catalogue.