Skip to content

ISCC - Simhash#

alg_simhash(hash_digests) #

Creates a similarity preserving hash from a sequence of equal sized hash digests.

Parameters:

Name Type Description Default
hash_digests list

A sequence of equaly sized byte-hashes.

required

Returns:

Type Description
bytes

Similarity byte-hash

Source code in iscc_core\simhash.py
def alg_simhash(hash_digests):
    # type: (list[bytes]) -> bytes
    """
    Creates a similarity preserving hash from a sequence of equal sized hash digests.

    :param list hash_digests: A sequence of equaly sized byte-hashes.
    :returns: Similarity byte-hash
    :rtype: bytes
    """

    n_bytes = len(hash_digests[0])
    n_bits = n_bytes * 8
    vector = [0] * n_bits

    for digest in hash_digests:
        h = bitarray()
        h.frombytes(digest)
        for i in range(n_bits):
            vector[i] += h[i]

    minfeatures = len(hash_digests) / 2
    shash = 0

    for i in range(n_bits):
        if vector[i] >= minfeatures:
            shash |= 1 << (n_bits - 1 - i)

    return shash.to_bytes(n_bytes, "big")