datasafe.manifest module
Manifests for datasafe items.
Each item (currently: dataset) stored in the datasafe is accompanied by a
file containing a lot of (automatically obtained) useful information about
the item stored. Typically, the YAML format is used for the manifest file,
and the file named MANIFEST.yaml
generically.
Manifests provide eays access to information on the items of a dataset such as data format, associated files and their meanings, and checksums allowing to detect data corruption. Particularly information regarding the file format could be retrieved from the item(s) stored, but only by using specialised data and metadata imporers. Thus, manifests allow the the datasafe component to be much more independent of other packages and modules.
Note
While manifest files are a general concept, currently they are only implemented for datasets stored in the datasafe. This will, however, most probably change in the future with the further development of the datasafe.
Manifests for datasets
In case of a dataset, the information contained ranges from general information on the dataset (LOI, whether it is complete) to the format of data and metadata to the actual file names and the checksums over data and data and metadata.
As an example, the contents of a manifest file for a dataset are shown below,
for a fictitious dataset consisting of two (empty) files (data in test
and
metadata in test.info
):
format:
type: datasafe dataset manifest
version: 0.1.0
dataset:
loi: ''
complete: false
files:
metadata:
- name: test.info
format: cwEPR Info file
version: 0.1.4
data:
format: undetected
names:
- test
checksums:
- name: CHECKSUM
format: MD5 checksum
span: data, metadata
value: f46475b4905fe2e1a388dc5c6a07ecbc
- name: CHECKSUM_data
format: MD5 checksum
span: data
value: 74be16979710d4c4e7c6647856088456
A few comments on this example:
The file identifies its own format, using the
format
key on the highest level, including type and version. This allows for automatically handling different formats and different versions of the same format.YAML is a human-readable and (even more important) human-writable standard supported by many programming languages. Hence, information stored in this way can be easily processed both, by other programs as well as in the (far) future. Text files are probably the only format with real longtime support.
Checksums are used to allow for integrity checks, i.e. inadvertent change of data or metadata. At the same time, as they are generated using the content, but not the names of the files, they can be used to check for duplicates.
Using MD5 as a hashing algorithm may raise some criticism. Clearly, MD5 shall not be used any more for security purposes, as it needs to be considered broken (since years already, as of 2021). However, to only check for inadvertend changes (or duplicates) of data, it is still a good choice, due to being fast and still widely supported.
Working with manifests
To work with manifests in a program, the YAML file needs to be represented in form of an object, and this object should be able to get its contents from as well as writing its contents to a YAML file. Furthermore, wouldn’t it be helpful if a manifest object could check for the integrity of the accompanying files, (re)creating checksums and comparing them to those stored in the manifest?
This is what the Manifest
class provides you with. Suppose you have
a dataset and an accompanying manifest. Checking the integrity of the
dataset could be as simple as:
manifest = Manifest()
manifest.from_file()
integrity = manifest.check_integrity()
if not all(integrity.values()):
fails = [key for key, value in integrity.items() if not value]
for fail in fails:
print(f"The following checksum failed: '{fail}'")
Of course, in your code, you will most probably do more sensible things than only printing which checksum check failed.
Conversely, if you would want to create a manifest file, in the simplest case all you would need to do is to specify which filenames are data and metadata files, respectively:
manifest = Manifest()
manifest.data_filenames = [<your data filenames>]
manifest.metadata_filenames = [<your metadata filenames>]
manifest.to_file()
This would create a file MANIFEST.yaml
including the auto-generated
checksums and the information regarding the metadata file format (as long as
it is either the info file format or YAML).
File format detection
A big question remains: How to (automatically) detect the file format of a given dataset? Probably there is no general solution to this problem that would work in all possible cases. Furthermore, it is implausible for this package to contain format detectors for all file formats one could think of. Therefore, the following strategy is employed:
File format detection is delegated to helper classes that are provided with the list of filenames a dataset consists of.
Using the Python plugin architecture (entry points), users can provide their own helper classes to detect file formats.
The general helper class to derive own helper classes from is
FormatDetector
. Rather than being a purely abstract class, it does
already its job in detecting metadata file formats, namely info and YAML
files. Therefore, you usually will only need to implement the logic for
detecting the file format(s) as such.
Important
Please bear in mind that detecting a file format is entirely different from actually importing the data contained in the respective files. The latter is the responsibility of separate packages dealing with data processing and analysis, e.g. packages derived from the ASpecD framework.
Registering your own file format detectors
The entry point for registering file format detectors is called
labinform_fileformats
, and an excerpt of the setup.py
file of this
package showing the relevant section is shown below:
setuptools.setup(
# ...
entry_points={
'labinform_fileformats': [
'epr = datasafe.manifest:EPRFormatDetector',
],
},
# ...
)
As you can see, already in this package, a detector (for electron
paramagnetic resonance spectroscopy data) is registered. Similarly, as soon
as you create your own package containing classes derived from
FormatDetector
, give it a setup.py
file defining an entry point
as shown above (with your own name/s instead of “epr” above), your detectors
will automatically be used as soon as you install your package (in
the virtual environment you are using, of course).
Detectors provided with this package
As always, this package has been designed with to some concrete use case in mind, and therefore, detectors for those file formats regularly encountered by the package authors are available.
Currently, the following file format detectors are available in this package and can be used as templates for own developments (but note that the strategies used may not be the best, though they should work):
-
Detector for different vendor file formats used in electron paramangetic resonance (EPR) spectroscopy.
Limitations
In its current implementation (as of 11/2021), manifests require metadata
files to accompany the data files of a dataset. Without an existing metadata
file, the Manifest
class will raise an error when trying to create
a manifest file.
This may be a practical limitation for legacy data recorded without at the same time creating a metadata file containing some of the most important pieces of information not contained in the data files themselves.
Module documentation
- class datasafe.manifest.Manifest[source]
Bases:
object
Representation of the information contained in a manifest file
A file named
MANIFEST.yaml
contains relevant information about the data storage of a single measurement. Besides the type and format of theMANIFEST.yaml
itself, it contains the LOI of the dataset, the names, format and versions of data and metadata files and the respective checksums.- format_detector
Helper class to detect file formats
- from_dict(manifest_dict)[source]
Obtain information from (ordered) dict
- Parameters:
manifest_dict (
collections.OrderedDict
) – Dict containing information of a manifest
- from_file(filename='MANIFEST.yaml')[source]
Obtain information from Manifest file
Usually, manifests are stored as YAML file on the file system in files named
MANIFEST.yaml
.- Parameters:
filename (
str
) –Name of the file to read manifest from
Default: “MANIFEST.yaml”
- Raises:
datasafe.exceptions.MissingInformationError – Raised if no filename to read from is provided
datasafe.exceptions.NoFileError – Raised if file to read from does not exist on the file system
- to_dict()[source]
Return information contained in object as (ordered) dict
- Returns:
manifest_ – Manifest as (ordered) dict
- Return type:
- Raises:
MissingInformationError – Raised if
data_filenames
is emptyNoFileError – Raised if any of the files listed in either
data_filenames
ormetadata_filenames
does not exist on the file system
- to_file()[source]
Safe manifest to file
The information for the actual manifest file first gets collected in an ordered dict of the designated structure using
to_dict()
. The dict populated this way is then written to a yaml file (usually) namedMANIFEST.yaml
(as specified bymanifest_filename
).
- check_integrity()[source]
Check integrity of dataset, comparing stored with generated checksums.
To check the integrity of a dataset, the checksums stored within the manifest file will be compared to newly generated checksums over data and metadata together as well as over data alone.
Allows to check for consistency of manifest and data, and hence to detect any corruption of the data. You may check for integrity like this:
integrity = manifest.check_integrity() if not all(integrity.values()): fails = [key for key, value in integrity.items() if not value] for fail in fails: print(f"The following checksum failed: '{fail}'")
This would first check if there are any failed checks, and if so, for each fail print the failing key. Of course, in your case you will do more sensible things than just printing the keys.
- Returns:
integrity – dict with fields
data
andall
containing boolean values- Return type:
- Raises:
datasafe.exceptions.MissingInformationError – Raised if not all necessary information is available.
- class datasafe.manifest.FormatDetector[source]
Bases:
object
Helper class to detect file formats.
For real use, you need to implement a class subclassing this class providing real information, as this class only provides dummy test output.
As each format has its own peculiarities, you need to come up with sensible ways to actually detect both, metadata and data formats. Generally, it should be sufficient to provide an implementation for the private method
_detect_data_format()
that returns the actual data format as string.However, at least info files and YAML files (with a certain structure) as metadata files are supported out of the box. To add detectors for further metadata formats, add methods with the naming scheme
_parse_<extension>
with “<extension>” being the file extension of your metadata file.For YAML files, requirements are that there exists a key “format” at the top level of the file that contains the keys “type” and “version”.
- Raises:
datasafe.exceptions.NoFileError – Raised if no data file(s) are provided
- metadata_format()[source]
Obtain format of the metadata file(s).
Generally, the metadata format is checked using the file extension.
Two formats are automatically detected: info (.info) and YAML (.yml, .yaml). To support other formats, you need to provide methods with the naming scheme
_parse_<extension>
with “<extension>” being the file extension of your metadata file.- Returns:
metadata_info – List of ordered dicts (
collections.OrderedDict
) each containing “name”, “format”, and “version” as fields.If no metadata filenames are provided in
metadata_filenames
, an empty list is returned.- Return type:
- data_format()[source]
Obtain format of the data file(s).
You need to subclasses this class and override the non-public method
_detect_data_format()
to actually detect the file format, as this method only provides “test” as format.- Returns:
data_format – Name of the format of the data files
- Return type:
- Raises:
datasafe.exceptions.NoFileError – Raised if no data file(s) are provided
- class datasafe.manifest.EPRFormatDetector[source]
Bases:
FormatDetector
Detector for EPR file formats.
Here, EPR stands for electron paramagnetic resonance (spectroscopy).
Currently, the following vendor-specific file formats can be distinguished (without guarantee that each of the formats will be accurately detected in any case):
Bruker EMX (.par, .spc)
Bruker ESP (.par, .spc)
Bruker BES3T (.DSC, .DTA [, …])
Magnettech XML (.xml)
Magnettech CSV (.csv)
The following assumptions will be made for each of the formats, besides existing data files with the given extensions (case-insensitive):
Bruker EMX (.par, .spc)
First line of PAR file starts with
DOS
Note that according to its official specification, this format cannot be distinguished from the Bruker ESP format that stores its actual data in a different binary format. But in practice, the criterion given seems to be quite robust.
Bruker ESP (.par, .spc)
First line of PAR file does not start with
DOS
Note that according to its official specification, this format cannot be distinguished from the Bruker EMX format that stores its actual data in a different binary format. But in practice, the criterion given seems to be quite robust.
Bruker BES3T (.DSC, .DTA [, …])
Existence of at least two files with extension ‘.DSC’ and ‘.DTA’ (case-insensitive)
Magnettech XML (.xml)
Second line of file starts with
<ESRXmlFile
Magnettech CSV (.csv)
First line of file starts with
Name,
, third line of file starts withRecipe