datasafe.manifest module

Manifests for datasafe items.

Each item (currently: dataset) stored in the datasafe is accompanied by a file containing a lot of (automatically obtained) useful information about the item stored. Typically, the YAML format is used for the manifest file, and the file named MANIFEST.yaml generically.

Manifests provide eays access to information on the items of a dataset such as data format, associated files and their meanings, and checksums allowing to detect data corruption. Particularly information regarding the file format could be retrieved from the item(s) stored, but only by using specialised data and metadata imporers. Thus, manifests allow the the datasafe component to be much more independent of other packages and modules.

Note

While manifest files are a general concept, currently they are only implemented for datasets stored in the datasafe. This will, however, most probably change in the future with the further development of the datasafe.

Manifests for datasets

In case of a dataset, the information contained ranges from general information on the dataset (LOI, whether it is complete) to the format of data and metadata to the actual file names and the checksums over data and data and metadata.

As an example, the contents of a manifest file for a dataset are shown below, for a fictitious dataset consisting of two (empty) files (data in test and metadata in test.info):

format:
  type: datasafe dataset manifest
  version: 0.1.0
dataset:
  loi: ''
  complete: false
files:
  metadata:
  - name: test.info
    format: cwEPR Info file
    version: 0.1.4
  data:
    format: undetected
    names:
    - test
checksums:
- name: CHECKSUM
  format: MD5 checksum
  span: data, metadata
  value: f46475b4905fe2e1a388dc5c6a07ecbc
- name: CHECKSUM_data
  format: MD5 checksum
  span: data
  value: 74be16979710d4c4e7c6647856088456

A few comments on this example:

  • The file identifies its own format, using the format key on the highest level, including type and version. This allows for automatically handling different formats and different versions of the same format.

  • YAML is a human-readable and (even more important) human-writable standard supported by many programming languages. Hence, information stored in this way can be easily processed both, by other programs as well as in the (far) future. Text files are probably the only format with real longtime support.

  • Checksums are used to allow for integrity checks, i.e. inadvertent change of data or metadata. At the same time, as they are generated using the content, but not the names of the files, they can be used to check for duplicates.

  • Using MD5 as a hashing algorithm may raise some criticism. Clearly, MD5 shall not be used any more for security purposes, as it needs to be considered broken (since years already, as of 2021). However, to only check for inadvertend changes (or duplicates) of data, it is still a good choice, due to being fast and still widely supported.

Working with manifests

To work with manifests in a program, the YAML file needs to be represented in form of an object, and this object should be able to get its contents from as well as writing its contents to a YAML file. Furthermore, wouldn’t it be helpful if a manifest object could check for the integrity of the accompanying files, (re)creating checksums and comparing them to those stored in the manifest?

This is what the Manifest class provides you with. Suppose you have a dataset and an accompanying manifest. Checking the integrity of the dataset could be as simple as:

manifest = Manifest()
manifest.from_file()
integrity = manifest.check_integrity()

if not all(integrity.values()):
    fails = [key for key, value in integrity.items() if not value]
    for fail in fails:
        print(f"The following checksum failed: '{fail}'")

Of course, in your code, you will most probably do more sensible things than only printing which checksum check failed.

Conversely, if you would want to create a manifest file, in the simplest case all you would need to do is to specify which filenames are data and metadata files, respectively:

manifest = Manifest()
manifest.data_filenames = [<your data filenames>]
manifest.metadata_filenames = [<your metadata filenames>]
manifest.to_file()

This would create a file MANIFEST.yaml including the auto-generated checksums and the information regarding the metadata file format (as long as it is either the info file format or YAML).

File format detection

A big question remains: How to (automatically) detect the file format of a given dataset? Probably there is no general solution to this problem that would work in all possible cases. Furthermore, it is implausible for this package to contain format detectors for all file formats one could think of. Therefore, the following strategy is employed:

  • File format detection is delegated to helper classes that are provided with the list of filenames a dataset consists of.

  • Using the Python plugin architecture (entry points), users can provide their own helper classes to detect file formats.

The general helper class to derive own helper classes from is FormatDetector. Rather than being a purely abstract class, it does already its job in detecting metadata file formats, namely info and YAML files. Therefore, you usually will only need to implement the logic for detecting the file format(s) as such.

Important

Please bear in mind that detecting a file format is entirely different from actually importing the data contained in the respective files. The latter is the responsibility of separate packages dealing with data processing and analysis, e.g. packages derived from the ASpecD framework.

Registering your own file format detectors

The entry point for registering file format detectors is called labinform_fileformats, and an excerpt of the setup.py file of this package showing the relevant section is shown below:

setuptools.setup(
    # ...
    entry_points={
        'labinform_fileformats': [
            'epr = datasafe.manifest:EPRFormatDetector',
        ],
    },
    # ...
)

As you can see, already in this package, a detector (for electron paramagnetic resonance spectroscopy data) is registered. Similarly, as soon as you create your own package containing classes derived from FormatDetector, give it a setup.py file defining an entry point as shown above (with your own name/s instead of “epr” above), your detectors will automatically be used as soon as you install your package (in the virtual environment you are using, of course).

Detectors provided with this package

As always, this package has been designed with to some concrete use case in mind, and therefore, detectors for those file formats regularly encountered by the package authors are available.

Currently, the following file format detectors are available in this package and can be used as templates for own developments (but note that the strategies used may not be the best, though they should work):

  • EPRFormatDetector

    Detector for different vendor file formats used in electron paramangetic resonance (EPR) spectroscopy.

Limitations

In its current implementation (as of 11/2021), manifests require metadata files to accompany the data files of a dataset. Without an existing metadata file, the Manifest class will raise an error when trying to create a manifest file.

This may be a practical limitation for legacy data recorded without at the same time creating a metadata file containing some of the most important pieces of information not contained in the data files themselves.

Module documentation

class datasafe.manifest.Manifest[source]

Bases: object

Representation of the information contained in a manifest file

A file named MANIFEST.yaml contains relevant information about the data storage of a single measurement. Besides the type and format of the MANIFEST.yaml itself, it contains the LOI of the dataset, the names, format and versions of data and metadata files and the respective checksums.

loi

lab object identifier (LOI) corresponding to dataset

Type:

str

data_filenames

filenames of data files

Type:

list

metadata_filenames

filenames of metadata files

Type:

list

data_checksum

checksum over data only

Type:

str

checksum

checksum over data and metadata

Type:

str

manifest_filename

filename for Manifest file, defaults to MANIFEST.yaml

Type:

str

format_detector

Helper class to detect file formats

Type:

datasafe.manifest.FormatDetector

from_dict(manifest_dict)[source]

Obtain information from (ordered) dict

Parameters:

manifest_dict (collections.OrderedDict) – Dict containing information of a manifest

from_file(filename='MANIFEST.yaml')[source]

Obtain information from Manifest file

Usually, manifests are stored as YAML file on the file system in files named MANIFEST.yaml.

Parameters:

filename (str) –

Name of the file to read manifest from

Default: “MANIFEST.yaml”

Raises:
  • datasafe.exceptions.MissingInformationError – Raised if no filename to read from is provided

  • datasafe.exceptions.NoFileError – Raised if file to read from does not exist on the file system

to_dict()[source]

Return information contained in object as (ordered) dict

Returns:

manifest_ – Manifest as (ordered) dict

Return type:

collections.OrderedDict

Raises:
to_file()[source]

Safe manifest to file

The information for the actual manifest file first gets collected in an ordered dict of the designated structure using to_dict(). The dict populated this way is then written to a yaml file (usually) named MANIFEST.yaml (as specified by manifest_filename).

check_integrity()[source]

Check integrity of dataset, comparing stored with generated checksums.

To check the integrity of a dataset, the checksums stored within the manifest file will be compared to newly generated checksums over data and metadata together as well as over data alone.

Allows to check for consistency of manifest and data, and hence to detect any corruption of the data. You may check for integrity like this:

integrity = manifest.check_integrity()
if not all(integrity.values()):
    fails = [key for key, value in integrity.items() if not value]
    for fail in fails:
        print(f"The following checksum failed: '{fail}'")

This would first check if there are any failed checks, and if so, for each fail print the failing key. Of course, in your case you will do more sensible things than just printing the keys.

Returns:

integrity – dict with fields data and all containing boolean values

Return type:

dict

Raises:

datasafe.exceptions.MissingInformationError – Raised if not all necessary information is available.

class datasafe.manifest.FormatDetector[source]

Bases: object

Helper class to detect file formats.

For real use, you need to implement a class subclassing this class providing real information, as this class only provides dummy test output.

As each format has its own peculiarities, you need to come up with sensible ways to actually detect both, metadata and data formats. Generally, it should be sufficient to provide an implementation for the private method _detect_data_format() that returns the actual data format as string.

However, at least info files and YAML files (with a certain structure) as metadata files are supported out of the box. To add detectors for further metadata formats, add methods with the naming scheme _parse_<extension> with “<extension>” being the file extension of your metadata file.

For YAML files, requirements are that there exists a key “format” at the top level of the file that contains the keys “type” and “version”.

data_filenames

filenames of data files

Type:

list

metadata_filenames

filenames of metadata files

Type:

list

Raises:

datasafe.exceptions.NoFileError – Raised if no data file(s) are provided

metadata_format()[source]

Obtain format of the metadata file(s).

Generally, the metadata format is checked using the file extension.

Two formats are automatically detected: info (.info) and YAML (.yml, .yaml). To support other formats, you need to provide methods with the naming scheme _parse_<extension> with “<extension>” being the file extension of your metadata file.

Returns:

metadata_info – List of ordered dicts (collections.OrderedDict) each containing “name”, “format”, and “version” as fields.

If no metadata filenames are provided in metadata_filenames, an empty list is returned.

Return type:

list

data_format()[source]

Obtain format of the data file(s).

You need to subclasses this class and override the non-public method _detect_data_format() to actually detect the file format, as this method only provides “test” as format.

Returns:

data_format – Name of the format of the data files

Return type:

str

Raises:

datasafe.exceptions.NoFileError – Raised if no data file(s) are provided

detection_successful()[source]

Return whether a file format could be detected successfully.

Returns:

success – Status of file format detection.

Can be used upstream to distinguish whether this class should be applied as file format detector.

Return type:

bool

class datasafe.manifest.EPRFormatDetector[source]

Bases: FormatDetector

Detector for EPR file formats.

Here, EPR stands for electron paramagnetic resonance (spectroscopy).

Currently, the following vendor-specific file formats can be distinguished (without guarantee that each of the formats will be accurately detected in any case):

  • Bruker EMX (.par, .spc)

  • Bruker ESP (.par, .spc)

  • Bruker BES3T (.DSC, .DTA [, …])

  • Magnettech XML (.xml)

  • Magnettech CSV (.csv)

The following assumptions will be made for each of the formats, besides existing data files with the given extensions (case-insensitive):

  • Bruker EMX (.par, .spc)

    First line of PAR file starts with DOS

    Note that according to its official specification, this format cannot be distinguished from the Bruker ESP format that stores its actual data in a different binary format. But in practice, the criterion given seems to be quite robust.

  • Bruker ESP (.par, .spc)

    First line of PAR file does not start with DOS

    Note that according to its official specification, this format cannot be distinguished from the Bruker EMX format that stores its actual data in a different binary format. But in practice, the criterion given seems to be quite robust.

  • Bruker BES3T (.DSC, .DTA [, …])

    Existence of at least two files with extension ‘.DSC’ and ‘.DTA’ (case-insensitive)

  • Magnettech XML (.xml)

    Second line of file starts with <ESRXmlFile

  • Magnettech CSV (.csv)

    First line of file starts with Name,, third line of file starts with Recipe