ISCC - Text Code#

A similarity preserving hash for plain-text content (soft hash).

The ISCC Text-Code is generated from plain-text that has been extracted from a media assets.

Warning

Plain-text extraction from documents in various formats (especially PDF) may yield very diffent results depending on the extraction tools being used. The iscc-sdk uses Apache Tika to extract text from documents for Text-Code generation.

Algorithm overview

Apply text_collapse function to text input
Count characters of collapsed text
Apply soft_hash_text_v0 to collapsed text

`gen_text_code_v0(text, bits = ic.core_opts.text_bits)` #

Create an ISCC Text-Code with algorithm v0.

Note

Any markup (like HTML tags or markdown) should be removed from the plain-text before passing it to this function.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text for Text-Code creation	required
`bits`	`int`	Bit-length of ISCC Code Hash (default 64)	`ic.core_opts.text_bits`

Returns:

Type	Description
`dict`	ISCC schema instance with Text-Code and an aditional property `characters`

Source code in iscc_core\code_content_text.py

def gen_text_code_v0(text, bits=ic.core_opts.text_bits):
    # type: (str, int) -> dict
    """
    Create an ISCC Text-Code with algorithm v0.

    !!! note
        Any markup (like HTML tags or markdown) should be removed from the plain-text
        before passing it to this function.

    :param str text: Text for Text-Code creation
    :param int bits: Bit-length of ISCC Code Hash (default 64)
    :return: ISCC schema instance with Text-Code and an aditional property `characters`
    :rtype: dict
    """

    text = text_collapse(text)
    characters = len(text)
    digest = soft_hash_text_v0(text)

    text_code = ic.encode_component(
        mtype=ic.MT.CONTENT,
        stype=ic.ST_CC.TEXT,
        version=ic.VS.V0,
        bit_length=bits,
        digest=digest,
    )

    iscc = "ISCC:" + text_code

    return dict(iscc=iscc, characters=characters)

`text_collapse(text)` #

Normalize and simplify text for similarity hashing.

Decompose with NFD normalization.
Remove all whitespace characters and convert text to lower case
Filter control characters, marks (diacritics), and punctuation
Recombine with NFKC normalization.

Note

See: Unicode normalization.

Parameters:

Name	Type	Description	Default
`text`	`str`	Plain text to be collapsed.	required

Returns:

Type	Description
`str`	Collapsed plain text.

Source code in iscc_core\code_content_text.py

def text_collapse(text):
    # type: (str) -> str
    """
    Normalize and simplify text for similarity hashing.

    - Decompose with NFD normalization.
    - Remove all whitespace characters and convert text to lower case
    - Filter control characters, marks (diacritics), and punctuation
    - Recombine with NFKC normalization.

    !!! note

        See: [Unicode normalization](https://unicode.org/reports/tr15/).

    :param str text: Plain text to be collapsed.
    :return: Collapsed plain text.
    :rtype: str
    """

    # Decompose with NFD and convert to lower case
    text = unicodedata.normalize("NFD", text).lower()

    # Remove whitespace and filter characters in one pass
    filtered_chars = []

    for ch in text:
        if not ch.isspace() and unicodedata.category(ch)[0] not in ic.core_opts.text_unicode_filter:
            filtered_chars.append(ch)

    # Recombine
    return unicodedata.normalize("NFKC", "".join(filtered_chars))

`soft_hash_text_v0(text)` #

Creates a 256-bit similarity preserving hash for text input with algorithm v0.

Slide over text with a text_ngram_size wide window and create xxh32 hashes
Create a minhash_256 from the hashes generated in the previous step.

Note

Before passing text to this function it must be:

stripped of markup
normalized
stripped of whitespace
lowercased

Parameters:

Name	Type	Description	Default
`text`	`str`	Plain text to be hashed.	required

Returns:

Type	Description
`bytes`	256-bit similarity preserving byte hash.

Source code in iscc_core\code_content_text.py

def soft_hash_text_v0(text):
    # type: (str) -> bytes
    """
    Creates a 256-bit similarity preserving hash for text input with algorithm v0.

    - Slide over text with a
      [`text_ngram_size`][iscc_core.options.CoreOptions.text_ngram_size] wide window
      and create [`xxh32`](https://cyan4973.github.io/xxHash/) hashes
    - Create a [`minhash_256`][iscc_core.minhash.alg_minhash_256] from the hashes generated
      in the previous step.

    !!! note

        Before passing text to this function it must be:

        - stripped of markup
        - normalized
        - stripped of whitespace
        - lowercased

    :param str text: Plain text to be hashed.
    :return: 256-bit similarity preserving byte hash.
    :rtype: bytes
    """
    ngrams = ic.sliding_window(text, ic.core_opts.text_ngram_size)
    features = [xxhash.xxh32_intdigest(s.encode("utf-8")) for s in ngrams]
    hash_digest = ic.alg_minhash_256(features)
    return hash_digest

ISCC - Text Code#

gen_text_code_v0(text, bits = ic.core_opts.text_bits) #

text_collapse(text) #

soft_hash_text_v0(text) #

`gen_text_code_v0(text, bits = ic.core_opts.text_bits)` #

`text_collapse(text)` #

`soft_hash_text_v0(text)` #