ISCC - Text Code#
A similarity preserving hash for plain-text content (soft hash).
The ISCC Text-Code is generated from plain-text that has been extracted from a media assets.
Tip
Plain-text extraction from documents in various formats (especially PDF) may yield very diffent results depending on the extraction tools being used. For reproducible Text-Code generation use Apache Tika v2.2.1 to extract text from your documents.
Algorithm overview
- Apply
text_collapse
function to text input - Count characters of collapsed text
- Apply
soft_hash_text_v0
to collapsed text
gen_text_code_v0(text, bits = 64)
#
Create an ISCC Text-Code with algorithm v0.
Note
Any markup (like HTML tags or markdown) should be removed from the plain-text before passing it to this function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str |
Text for Text-Code creation |
required |
bits |
int |
Bit-length of ISCC Code Hash (default 64) |
64 |
Returns:
Type | Description |
---|---|
dict |
ISCC schema instance with Text-Code and an aditional property |
Source code in iscc_core\code_content_text.py
def gen_text_code_v0(text, bits=ic.core_opts.text_bits):
# type: (str, int) -> dict
"""
Create an ISCC Text-Code with algorithm v0.
!!! note
Any markup (like HTML tags or markdown) should be removed from the plain-text
before passing it to this function.
:param str text: Text for Text-Code creation
:param int bits: Bit-length of ISCC Code Hash (default 64)
:return: ISCC schema instance with Text-Code and an aditional property `characters`
:rtype: dict
"""
text = text_collapse(text)
characters = len(text)
digest = soft_hash_text_v0(text)
text_code = ic.encode_component(
mtype=ic.MT.CONTENT,
stype=ic.ST_CC.TEXT,
version=ic.VS.V0,
bit_length=bits,
digest=digest,
)
iscc = "ISCC:" + text_code
return dict(iscc=iscc, characters=characters)
soft_hash_text_v0(text)
#
Creates a 256-bit similarity preserving hash for text input with algorithm v0.
- Slide over text with a
text_ngram_size
wide window and createxxh32
hashes - Create a
minhash_256
from the hashes generated in the previous step.
Note
Before passing text to this function it must be:
- stripped of markup
- normalized
- stripped of whitespace
- lowercased
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str |
Plain text to be hashed. |
required |
Returns:
Type | Description |
---|---|
bytes |
256-bit similarity preserving byte hash. |
Source code in iscc_core\code_content_text.py
def soft_hash_text_v0(text):
# type: (str) -> bytes
"""
Creates a 256-bit similarity preserving hash for text input with algorithm v0.
- Slide over text with a
[`text_ngram_size`][iscc_core.options.CoreOptions.text_ngram_size] wide window
and create [`xxh32`](https://cyan4973.github.io/xxHash/) hashes
- Create a [`minhash_256`][iscc_core.minhash.alg_minhash_256] from the hashes generated
in the previous step.
!!! note
Before passing text to this function it must be:
- stripped of markup
- normalized
- stripped of whitespace
- lowercased
:param str text: Plain text to be hashed.
:return: 256-bit similarity preserving byte hash.
:rtype: bytes
"""
ngrams = ic.sliding_window(text, ic.core_opts.text_ngram_size)
features = [xxhash.xxh32_intdigest(s.encode("utf-8")) for s in ngrams]
hash_digest = ic.alg_minhash_256(features)
return hash_digest
text_collapse(text)
#
Normalize and simplify text for similarity hashing.
- Decompose with NFD normalization.
- Remove all whitespace characters and convert text to lower case
- Filter control characters, marks (diacritics), and punctuation
- Recombine with NFKC normalization.
Note
See: Unicode normalization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str |
Plain text to be collapsed. |
required |
Returns:
Type | Description |
---|---|
str |
Collapsed plain text. |
Source code in iscc_core\code_content_text.py
def text_collapse(text):
# type: (str) -> str
"""
Normalize and simplify text for similarity hashing.
- Decompose with NFD normalization.
- Remove all whitespace characters and convert text to lower case
- Filter control characters, marks (diacritics), and punctuation
- Recombine with NFKC normalization.
!!! note
See: [Unicode normalization](https://unicode.org/reports/tr15/).
:param str text: Plain text to be collapsed.
:return: Collapsed plain text.
:rtype: str
"""
# Decompose with NFD
text = unicodedata.normalize("NFD", text)
# Remove all whitespace and convert text to lower case
text = "".join(text.split()).lower()
# Filter control characters, marks (diacritics), and punctuation
text = "".join(
ch for ch in text if unicodedata.category(ch)[0] not in ic.core_opts.text_unicode_filter
)
# Recombine
text = unicodedata.normalize("NFKC", text)
return text