This lesson is still being designed and assembled (Pre-Alpha version)

File Review

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Why is file review important for reproducibility?

  • What curation activities are associated with file review?

Objectives
  • Understand why it is important to review deposited files

  • Recognize file review best practices

In this episode, we unpack the Data Quality Review Framework, focusing on File Review. Other episodes in this lesson elaborate on Documentation, Data, and Code Review.

File Review in Practice

File review is an important part of the Data Quality Review (DQR) covered in the previous episode. This episode will outline best practices to ensure the integrity of files in the reproducibility package.

Data reuse relies on clearly identified, functional, and long-term accessible files. Tasks associated with file review include inspecting the files and identifying file format-specific curation tasks, creating persistent identifiers and metadata, and creating preservation file formats. Preservation-oriented steps, such as implementing a migration strategy for file formats, and ongoing bit monitoring, are also part of file review.

File review is especially important as a preparation for packaging the files for deposit into a trustworthy repository for long-term access and sharing (see Lesson 4: Compendium Packaging).

File Inspection

Files must be inspected to ensure that they are all present and correctly named, and that they open. File sizes and formats may be recorded and checksums created.

File inspection tasks may include the following:

File Format-specific Curation Tasks

Some aspects of file review may vary based on the file format or subject domain. The Data Curation Network Primers have helpful curation process recommendations for particular file formats.

Spotlight: Data Repository Metadata

Data curation involves the creation of metadata at both the file level and the study level to facilitate further data quality review and discoverability. Required metadata fields vary by repository, but should be mapped to standard metadata schemas for preservation and interoperability.

Metadata librarians or other metadata experts on staff may create a Metadata Application Profile for the data repository that data curators refer to to know what metadata to create for each deposit they curate.

Here are some resources for identifying metadata fields to include:

Verify and Enhance File-Level Metadata

Verify and enhance metadata associated with each individual file. Metadata-related tasks may include:

Spotlight: Fixity in Digital Preservation

Fixity, in the preservation sense, means the assurance that a digital file has remained unchanged, i.e. fixed. Fixity helps establish trust between data producers, stewards (e.g. repositories and archives), and users.

Fixity checking is the process of verifying that a digital object has not been altered or corrupted. A checksum on a file is a ‘digital fingerprint’ used to detect if the contents of a file have changed. Checksums can be generated using a range of readily available and open source tools.

See more information here:

Code File Metadata

Curators may want to enhance the metadata for code files, which includes some specific fields. For example,

Spotight: Software Metadata

If the research compendium includes software, curators may add or enhance associated metadata. The Software Metadata Recommended Format Guide (SMRF) summarizes and defines the metadata elements recommended by the Software Preservation Network to describe software materials in the context of a wide range of collections.

Christophersen, Allan. Colón-Marrero , Elena. Dietrich, Dianne. Falcao, Patricia. Fox, Claire. Hanson, Karen. Kwan, Allen. McEniry, Matthew. (2022, February 8). Software Metadata Recommended Formats Guide. Software Preservation Network. https://www.softwarepreservationnetwork.org/smrf-guide/

Transform or Migrate to Preservation Formats

Files that are not in a preservation format (e.g., .csv, .txt, .pdf), and do not have a built-in converter, should be manually converted to a preservation format. Transforming files into non-proprietary formats facilitates potential reuse and ensures the files remain accessible long term. For files from proprietary statistical software (e.g. .dta, .sav, .do, and others), both a converted preservation format file and the original file may be included in the reproducibility package.

Spotlight: Preservation Formats

Some resources for identifying recommended preservation formats:

  • Library of Congress Recommended Formats
  • Cornell eCommons’ recommended file formats and probability for long-term preservation matrix
  • OpenAIRE Data formats for preservation
  • National Archives Tables of File Formats
  • UK Data Service Recommended formats
  • UK National Archives

Verify and Enhance Study-Level Metadata

Verify and enhance metadata to be applied at the level of the study. Metadata-related tasks may include:

Study Mnemonic

Following good data management practice, researchers would have created a project folder for the study that contains all the files used in the study. The name of the project folder should be unique to identify the particular set of files. This study mnemonic is also useful for reviewers who can use it to tag communications with authors and as codebook identifiers in archives.

An example reproducibility package done by CISER’s Results Reproduction Service team with study mnemonic (“R2-2019-MEEMKEN-1“) used as Codebook Identifier and e-mail subject tags to track communications with authors: https://archive.ciser.cornell.edu/reproduction-packages/2828.

Reviewing deposited research files quality can seem abstract when thinking about it hypothetically. The exercise below is tailored to further our understanding of file review by spending some time with example files. The exercise asks you to identify problems with the files and suggest resolutions. The Solution copy of the example files shows ways of resolving the issues.

Exercise: File Review Challenge

Review the example files for inclusion in a research compendium. Make recommendations to resolve any issues. In real life, you would communicate with the researchers to resolve any missing files or problematic issues you found.

  • The depositors submitted a file list document: 2018-Kim-Documentation.docx. Are all files present?
  • The study mnemonic “2018-Kim” will be used for this example. Do the files follow the appropriate naming convention?
  • Do all the files open? If not what is the problem?
  • Record file sizes. Does anything look off?
  • What study-level metadata could be suggested based on these files?
  • Should any files be converted to a different file format for preservation?

Solution

Are all files present?

  • The file 2018-Kim-anonymized_participants.xlsx, listed in 2018-Kim-Documentation.docx is missing.

Do files follow the appropriate naming convention?

  • One file “analysis_final.do” disobeys the convention.

Do all the files open?

  • One file 2018-Kim-Acknowledgements.xlzx has a typo in the file extension that prevents it from opening.
  • One file is a .dta file, which requires Stata
  • The .do file is run in Stata, but can be viewed in any text editor

Record file sizes

  • One file 2018-Kim-questionnaire.txt is 0 Kb.

Suggest study-level metadata

  • Dc.creator = Kim, Hyuncheol Bryant
  • Dc.title = Reproduction Materials for: The Role of Education Interventions in Improving Economic Rationality
  • Dc.language = english
  • Dc.language.iso = eng
  • Dc.creator.orcid = 0000-0001-5304-0274
  • Dc.relation.isreferencedby = Choi, Syngjoo, Kim, Hyuncheol Bryant, Kim, Booyuel, et al. “The role of education interventions in improving economic rationality.” Science. Volume 362, Issue 6410 (2018-10-05): 83-86. https://doi.org/10.1126/science.aar6987.

Recommend a preservation format for each file

  • .xlsz → .csv
  • .docx → .txt
  • .dta → .csv
  • .do → .txt
  • For proprietary file formats, it is best practice to include both a preservation format version and the original proprietary format version of the file in the reproducibility package. E.g. include both the submitted .dta file and a .csv version

Key Points

  • Files must be inspected.

  • Study and file metadata should be detailed, accurate, and formatted in a standard schema.

  • Aspects of file review may vary based on file format.

  • Transforming files into non-proprietary formats facilitates reuse and ensures long term accessibility.