ISCC - Text Code#
A similarity preserving hash for plain-text content (soft hash).
The ISCC Text-Code is generated from plain-text that has been extracted from a media assets.
Warning
Plain-text extraction from documents in various formats (especially PDF) may yield very diffent results depending on the extraction tools being used. The iscc-sdk uses Apache Tika to extract text from documents for Text-Code generation.
Algorithm overview
- Apply
text_collapse
function to text input - Count characters of collapsed text
- Apply
soft_hash_text_v0
to collapsed text
gen_text_code_v0(text, bits = ic.core_opts.text_bits)
#
Create an ISCC Text-Code with algorithm v0.
Note
Any markup (like HTML tags or markdown) should be removed from the plain-text before passing it to this function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
Text for Text-Code creation |
required |
bits |
int
|
Bit-length of ISCC Code Hash (default 64) |
ic.core_opts.text_bits
|
Returns:
Type | Description |
---|---|
dict
|
ISCC schema instance with Text-Code and an aditional property |
Source code in iscc_core\code_content_text.py
text_collapse(text)
#
Normalize and simplify text for similarity hashing.
- Decompose with NFD normalization.
- Remove all whitespace characters and convert text to lower case
- Filter control characters, marks (diacritics), and punctuation
- Recombine with NFKC normalization.
Note
See: Unicode normalization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
Plain text to be collapsed. |
required |
Returns:
Type | Description |
---|---|
str
|
Collapsed plain text. |
Source code in iscc_core\code_content_text.py
soft_hash_text_v0(text)
#
Creates a 256-bit similarity preserving hash for text input with algorithm v0.
- Slide over text with a
text_ngram_size
wide window and createxxh32
hashes - Create a
minhash_256
from the hashes generated in the previous step.
Note
Before passing text to this function it must be:
- stripped of markup
- normalized
- stripped of whitespace
- lowercased
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
Plain text to be hashed. |
required |
Returns:
Type | Description |
---|---|
bytes
|
256-bit similarity preserving byte hash. |