Strategies for structuring and documenting data so that you, and others, can find and use it into the future.
- Defining documentation and metadata
- Purpose of documentation and metadata
- Decisions about documentation and metadata
- Types of documentation and metadata standards
- Controlled vocabularies
Researchers must ensure that sufficient documentation or metadata (i.e. information about the data) is created and maintained to enable research data to be found, used and managed throughout its lifecycle.
Documentation and metadata requirements will differ depending on the discipline and the nature of the research. They should be identified during data planning and adopted for use by all the researchers working on the project.
Data documentation provides provenance or context for the data and ensures that the data can be understood in the long term. It may include information such as:
- why the data was collected
e.g. information about the research project aims and objectives
- how data was collected
e.g. instruments and processes used
- how data is structured
e.g. names, labels and descriptions for data elements, and any rules relating to the values that are in them (coding schemes, classification schemes)
- quality control measures, and any modifications to the data over time
- confidentiality and consent agreements
- listings of data objects
- any other information aimed at helping data users to analyse and interpret the data
e.g. user guides or manuals.
In the context of research data, metadata can be considered as a subset of your overall data documentation. Metadata is usually structured using standards or schemas. Common types of metadata include:
- descriptive metadata: identifies the resource and enables it to be discovered
- administrative and technical metadata: enables a resource to be better managed, and in some cases preserved over time, by capturing information such as creation and modification dates, file formats and access restrictions.
Documentation and metadata assist with many aspects of research, scholarly publishing and research data management, including the following:
- identifying the data
- helping to find the data
- associating the data with its owners and creators
- creating links between the data and other related data or publications
- providing context for the data, e.g. by locating its collection or creation at a certain time and place
- enabling the quality of the data to be assessed, and research results to be validated.
Data that has been poorly documented will be difficult (or impossible) to find. Even if the data can be found, its value will be diminished if it is hard to interpret the contents, and to judge the quality or validity of the data. If it is not possible to determine when, where, how, and by whom the data was originally produced, there is also the risk that the data could be exploited inappropriately, or even accidentally destroyed.
- What documentation and metadata is needed?
This should be driven by the short and long term needs of the users of the data. You need to consider how the project team will retrieve and use data during the project, and also how the needs of any future researchers might be met. If you plan to deposit data in a repository or archive, you should consider deposit requirements early in the project to avoid having to retrospectively create the required level of documentation and metadata.
- What is the data object being described?
The documentation or metadata may need to describe a data collection or dataset, individual items, or parts of an item: more than one type of documentation and metadata is likely to apply, depending on the needs of the users of the data that you have identified.
- Where will the documentation and metadata be located?
Some metadata can be stored internal to the data object that is being described, while some documentation and metadata would usually be stored externally. If you are storing documentation metadata externally, you will need to consider what tools may be available for storing and managing the documentation.
- Who is responsible for the documentation and metadata?
You need to determine who will create the documentation and metadata during the project, and who will be responsible for maintaining it in future.
- How will the documentation and metadata be created?
In some cases metadata can be generated or extracted from data objects automatically or semi-automatically. In other cases, human effort will be required to create documentation and metadata.
Data registers and inventories
The Australian Code for Responsible Conduct of Research requires all researchers to maintain a list of their research data assets.
The UK Data Audit Framework Methodology proposes the following as the absolute minimum set of elements for a data register or inventory:
- Asset manager
e.g. web address, identifier
i.e. is the data asset vital, important or minor?
- Classification comments
- General comments
Metadata standards (general) - Dublin Core
Some common descriptive standards are available that work for many different kinds of material and across disciplines. The most widely-used of these is Dublin Core.
This simple and general metadata standard facilitates the finding, sharing and management of data. It includes elements such as Title, Creator, Subject, Date and Type.
Dublin Core (or DC) is not specific to the research environment, to certain disciplines or to particular technologies. It can be used to describe many different types of data (not just digital), and is widely used as the metadata standard in institutional repositories, including the Monash University Research Repository.
Metadata standards (discipline-specific)
In many disciplines, existing standards or best practices will be available that are specifically designed for describing and sharing data within a particular discipline or cluster.
Some examples include:
The Text Encoding Initiative produces a standard way of both describing and marking up digital textual materials used widely for studies in areas such as language and linguistics, literature and history. The Visual Resources Association Core is used in the cultural heritage community and has a focus on image collections. Dublin Core is also widely used in the humanities.
- Geospatial data
The Content Standard for Digital Geospatial Metadata (CSDGM), ISO 19115:2003.
- Social sciences
Data Documentation Initiative (DDI)
- Scientific experimental data
No generally agreed model yet exists. However, the Core Scientific Meta-Data Model (CSMD), a study-data oriented model has been developed to capture high level information about scientific studies and the data that they produce.
Discipline-specific standards abound. In choosing documentation and metadata standards for your research data, consideration of what is commonly used in your discipline should be part of your data planning.
An identifier is a reference number or name for a data object and forms a key part of your documentation and metadata. To be useful over the long-term, identifiers need to be:
- Unique - globally unique if possible, but at the very least unique within your particular systems and processes
- persistent - the identifier should not change over time.
Some common kinds of identifiers are:
- International Standard Book Numbers (ISBN) for published books
- Digital Object Identifiers (DOIs) for electronically published journal articles - can also be used for datasets. For further information refer to the ANDS website - Digital Object Identifiers (DOIs) for Datasets.
- Universal Resource Locators (URLs) or website addresses (though the persistence of these over time is not always guaranteed)
- Primary keys - reference numbers assigned (usually automatically, by database software) to each record in a database
- Handles - reference numbers assigned, e.g. as part of the process of deposit into the Monash University Research Repository.
Wherever possible, you should use an existing controlled vocabulary. Even if you need to adapt or customise an existing standard, this is likely to be preferable to creating something from scratch. Agreeing on a controlled vocabulary and applying it consistently will make your documentation and metadata more valuable in terms of providing searchability and context for your data in the future and enabling it to be shared with other researchers in the same discipline. Keywords and tags can be easier to apply, but if researchers do not agree on their choice of terminology then the ability of the data to be found and used in future may be diminished.
File naming for digital files
With the increasing use of systems such Google Drive, LabArchives, Monash figshare etc. that encourage collaborative working, it is important your folders, documents and records are named in a consistent and logical manner so they can be located, identified and retrieved as quickly and easily as possible.
You should develop file naming conventions early in a research project, and agree on these with colleagues and collaborators before data is created.
When you are deciding on digital file naming conventions, consider:
- Always use capital letters to delimit words, not spaces
- Avoiding punctuation altogether, or using hyphens and underscores rather than spaces, especially where files may be accessed using a web browser
- Try to make file names short, but meaningful
- If you need to incorporate a date in a file or folder name always state it as YYYY or YYYYMM or YYYYMMDD
- If you need to incorporate a number in a file name give it a two or three-digit number. Never use a single digit.
- When using version numbering put this at the end of the name
- Although you can don't create names containing these characters : / < > | " ? ; = + & * $
- Avoid initials, abbreviations and codes that are not commonly understood.
- It's never a good idea using common words such as ‘draft’ or ‘letter’ at the start of file names, unless this will make it easier to retrieve the record.
- Avoid unnecessary repetition and redundancy in file names and file paths.
File "properties" and internal document structures
Many software programmes enable the addition of structured metadata in the form of "Properties". Common pieces of metadata that can be added include title, author, organisation, subjects and keywords, and additional comments.
Researchers can also ensure that digital files are well-structured internally. By simply adding document titles, authors and their contact details, dates, version control information, and column and row labels for tables and spreadsheets, you greatly increase the ability of your research data to be found, managed and interpreted over time.
Data dictionaries, data definition files and schema
A data dictionary, data definition file or schema describes the attributes of data fields, and may include any rules relating to how data is entered. This information can be stored stored internally (e.g. as a table in a database) or externally (e.g. as a separate document). External documentation should be retained with the data in the long-term, as it will provide valuable context to the data over time.
Subject headings, thesauri, taxonomies and ontologies are all examples of controlled vocabularies, i.e. lists of words or phrases used to provide consistent classification (or tagging). Like metadata standards, these vocabularies range from the very generic (e.g. Library of Congress Subject Headings and Dewey Classification) through to very discipline-specific lists created and maintained by experts in that field.
Storing your documentation and metadata
Storage and backup is just as essential for your documentation and metadata as it is for your research data, and the same guidelines apply.