datasafe.checksum module

Checksums for datasafe items.

Checksums fulfil a twofold function within the dataset component of the LabInform framework: They allow for easily checking whether the data items of a dataset entry have been corrupted on transfer or during time, and they allow for easily detecting duplicates.

Design goals

To fulfil their duties, a few general design goals have been developed and implemented:

Checksums are always generated for file contents, not file names, thus rendering the file names (that may easily change) irrelevant for the actual checksum.
Checksums over a list of files are generated per file (using its content), the generated checksums sorted and a checksum generated for the sorted list of checksums. Thus, data_filenames cannot interfere with the final checksum, as they are irrelevant for the sorting of the checksums the final checksum is generated for.
For datasets, two checksums are generated, one spanning both, data and metadata, the other spanning only the data. The reason behind: Metadata are of human origin and therefore inherently prone to errors and subject to (in)frequent updates and corrections. Data, however, shall never change after they have been recorded.

Algorithms

A note on the algorithms used: The module allows to use all algorithms for creating checksums that are currently supported by the hashlib module. However, although MD5 usually is considered unsafe, for the purposes checksums are used in the current context (non-cryptographic), it is clearly sufficient. This is why still, the default algorithm used by the Checksum class is MD5.

Module documentation

class datasafe.checksum.Generator[source]

Bases: object

Class for creating checksums

algorithm

Hash algorithm to use.

Defaults to “md5”, can be everything available in hashlib module.

Type:: str

hash_string(string='')[source]

Create checksum for string

Parameters:: string (str) – String to compute hash for
Returns:: checksum – Computed checksum
Return type:: str

hash_strings(list_of_strings=None)[source]

Create checksum for list of strings

The strings will be sorted before generating the checksum. Hence, if you want to create a checksum of checksums (e.g., for a checksum of several files), sorting is independent of the data_filenames and only depends on the actual file contents, resulting in stable hashes.

Parameters:: list_of_strings (list) – List of strings to compute hash for
Returns:: checksum – Computed checksum
Return type:: str

generate(filenames=None)[source]

Generate checksum for (list of) file(s).

For a single file, the checksum of its content will be generated.

For a list of files, for each file, the checksum of its content will be generated and afterwards the checksum over the checksums.

The checksums of the individual files will be sorted before generating the final checksum. Hence, sorting is independent of the data filenames and only depends on the actual file contents, resulting in stable hashes.

Parameters:

filenames –

string or list of strings

filename(s) of files to generate a checksum of their content(s)

Returns:

checksum – Checksum generated over (list of) file(s)

Return type:

str