Skip to content

ISCC - Meta-Code#

A similarity preserving hash for digital asset metadata.

The Meta-Code is the first unit of a canonical ISCC. It is calculated from the metadata of a digital asset. The purpose of the Meta-Code is to aid the discovery of digital assets with similar metadata and the detection of metadata anomalies.

Because of its special role we call the metadata supplied to the algorithm SEED-METADATA. SEED-METADATA has 3 possible inputs:

  • name (required): The name or title of the work manifested by the digital asset.
  • description (optional): A user-presentable textual description of the digital asset.
  • properties (optional): Industry-sector or use-case specific structured metadata or a raw file header (binary blob).

The first 32-bits of a Meta-Code are calculated as a simliarity hash from the name field. The second 32-bits are also calculated from the name field if no other input was supplied. If a description is suplied but no properties, the description will be used for the second 32-bits. If properties are supplied it will be used in favour of description for the second 32-bits.

Due to the broad applicability of the ISCC we do not prescribe a particular schema for the metadata supplied to the properties-field. However, structured metadata should be supplied as an object that is JSON/JCS serializable - preferably JSON-LD to support interoperability and machine interpretation.

In addition to the Meta-Code we also create a cryptographic hash (the metahash) of the supplied SEED-METADATA. It is used to bind metadata to the digital asset. We use a blake-3 multihash with base32-encoding as cryptographic hash. If properties are supplied, their raw bytes payload or their JCS serialized JSON data will be the sole input to the cryptographic hash. Else we use a space seperated concatenation of the cleaned name and description fields as inputs.

Tip

For reasons of reproducibility, applications that generate ISCCs, should prioritize metadata that is automatically extracted from the digital asset.

If embedded metadata is not available or known to be unreliable an application should rely on external metadata or explicitly ask users to supply at least the name-field. Applications should then first embed the user supplied metadata into the asset before calculating the ISCC-CODE.

If neither embedded nor external metadata is available, the application may resort to use the normalized filename of the digital asset as value for the name-field. An application may also skip generation of a Meta-Code entirely and create an ISCC-CODE without a Meta-Code.

gen_meta_code(name, description = None, properties = None, bits = 64) #

Create an ISCC Meta-Code using the latest standard algorithm.

Parameters:

Name Type Description Default
name str

Name or title of the work manifested by the digital asset

required
description Union[str,bytes,None]

Optional description for disambiguation

None
bits int

Bit-length of resulting Meta-Code (multiple of 64)

64

Returns:

Type Description
dict

ISCC object with Meta-Code and properties name, description, properties, metahash

gen_meta_code_v0(name, description = None, properties = None, bits = 64) #

Create an ISCC Meta-Code with the algorithm version 0.

Note

The input for the properties field can be:

  • Structured (JSON/JCS serializable) metadata
  • Raw bytes from a file header

Parameters:

Name Type Description Default
name str

Name or title of the work manifested by the digital asset

required
description Optional[str]

A User presentable textual description of the digital asset for disambiguation purposes (may include markdown).

None
properties Optional[Properties]

Use-Case or industry-specific metadata. Either JSON serializable structured data or a binary blob.

None
bits int

Bit-length of resulting Meta-Code (multiple of 64)

64

Returns:

Type Description
dict

ISCC object with possible fields: iscc, name, description, properties, metahash

soft_hash_meta_v0(name, extra = None) #

Calculate simmilarity preserving 256-bit hash digest from asset metadata.

Textual input should be stripped of markup, normalized and trimmed before hashing. Json metadata should be normalized with JCS

Note

The processing algorithm depends on the type of the extra input. If the extra field is supplied and non-empty, we create separate hashes for title and extra and interleave them in 32-bit chunks:

  • If the input is None or an empty str/bytes object the Meta-Hash will be generated from the title-field only.

  • If the extra-input is a non-empty text string (str) the string is lower-cased and the processing unit is an utf-8 endoded character (possibly multibyte). The resulting hash is interleaved with the title-hash.

  • If the extra-input is a non-empty bytes object the processing is done bytewise and the resulting hash is interleaved with the title-hash.

Parameters:

Name Type Description Default
name str

Title of the work manifested in the digital asset

required
extra Union[str,bytes,None]

Additional metadata for disambiguation

None

Returns:

Type Description
bytes

256-bit simhash digest for Meta-Code

trim_text(text, nbytes) #

Trim text such that its utf-8 encoded size does not exceed nbytes.

remove_newlines(text) #

Remove newlines.

The name field serves as a displayable title. We remove newlines and leading and trailing whitespace. We also collapse consecutive spaces to single spaces.

Parameters:

Name Type Description Default
text

Text for newline removal

required

Returns:

Type Description
str

Single line of text

clean_text(text) #

Clean text for display.

  • Normalize with NFKC normalization.
  • Remove Control Characters (except newlines)
  • Reduce multiple consecutive newlines to a maximum of two newlines
  • Strip leading and trailing whitespace