Hamming distance and pHash with PHP and mySQL

In today’s digital age, the volume of images generated and shared daily is enormous. As such, it becomes crucial to develop mechanisms for efficiently searching, comparing, and managing these images. One such mechanism, image fingerprinting, is a popular technique for identifying, categorising, and comparing digital images. At the heart of image fingerprinting and its comparative functionality is a concept known as the Hamming distance.

What is Image Fingerprinting?

Image fingerprinting, also known as perceptual hashing, is a technique used to identify similarities between different digital images. It involves creating a unique identifier or “fingerprint” for each image. This fingerprint is typically a string of numbers (a hash), generated such that it remains consistent even if the image undergoes minor modifications like resizing or lossy compression.

Hamming Distance

In the realm of digital information, the Hamming distance is a widely used metric to measure the difference between two strings of equal length. It is named after Richard Hamming, an American mathematician and computer scientist.

Simply put, the Hamming distance between two strings is the number of positions at which the corresponding symbols (or bits) are different. For instance, the Hamming distance between the strings “karolin” and “kathrin” is 3 (at positions 4, 5, and 6).

What is Perceptual Image Hashing (pHash)?

Perceptual image hashing involves generating a unique identifier, or ‘hash’, for an image, which is used to compare and find similar images. What makes perceptual hashing special is its robustness against image modifications. This means that even if an image is slightly altered, such as being resized or compressed, its hash remains the same or is very similar. This is different from cryptographic hashes, which drastically change even with minor alterations to the input.

Perceptual hashes are particularly useful in tasks such as image retrieval, deduplication, digital watermarking, and even copyright enforcement.

How Does pHash Work?

Perceptual image hashing generally follows a series of steps to generate a hash that captures the visual features of an image. These steps are designed to ensure that the hash remains invariant to common image modifications. Here’s a basic process for creating a perceptual hash:

  1. Preprocessing: The image is converted to grayscale and resized to a smaller size, often 8×8. This standardizes the image input and reduces the amount of data to process.
  2. Compute the DCT (Discrete Cosine Transform): The DCT separates the image into a collection of frequencies and scalars. The top-left corner represents the lowest frequencies.
  3. Reduce the DCT: Keep the top-left 8×8 pixels from the DCT. These represent the lowest frequencies in the picture.
  4. Compute the hash: Calculate the mean value of these 64 pixels. Assign ‘1’ to pixels with a value greater than or equal to the mean, and ‘0’ to pixels with a value less than the mean. This gives us a 64-bit hash value.

Hamming Distance in Image Fingerprinting

In the context of image fingerprinting, Hamming distance plays a vital role in comparing the fingerprints (hashes) of two images. It provides a simple, efficient means of determining how similar or dissimilar two image hashes are, thereby allowing us to gauge the similarity between the images themselves.

Here’s how it works in a nutshell:

  1. Generate the perceptual hash for each image.
  2. Once you have the hashes for both images, calculate the Hamming distance between them. This involves comparing the hashes bit by bit and counting the number of positions where they differ.
  3. The resulting Hamming distance serves as a measure of similarity between the two images. A smaller Hamming distance indicates greater similarity. In many applications, a Hamming distance of zero to five (out of a possible 64, if using a hash length of 64 bits) is often used to suggest that two images are near duplicates.

Advantages and Limitations of Hamming Distance

The use of Hamming distance in image fingerprinting offers several advantages. It’s computationally lightweight, making it quick and efficient even for large datasets. Additionally, it’s intuitively easy to understand and implement, providing a robust mechanism for image comparison.

However, Hamming distance does have limitations. It’s a binary metric that doesn’t take into account the degree of difference between mismatched bits. Furthermore, it is only applicable when comparing strings or hashes of equal length. This means that the method of hashing used must generate hashes of a consistent length, regardless of the input image size or aspect ratio.

Implementing pHash in PHP

If not already installed, you will need the GD library. You can check if it’s installed using phpinfo() and looking for the GD section. If it’s not installed, you can install it using the command:

cimba:~ $ sudo apt-get install php-gd

We begin by loading an image, converting it to grayscale, and resizing it to an 8×8 pixel image

$imagePath = 'path_to_your_image.jpg';

// Load the image
$image = imagecreatefromjpeg($imagePath);

// Convert image to grayscale
imagefilter($image, IMG_FILTER_GRAYSCALE);

// Resize the image to 8x8
$image = imagescale($image, 8, 8);

// Now $image is a grayscale 8x8 image

Finally, we’ll create a 64-bit hash based on whether each pixel’s color is above or below the average color:

$hash = '';

for ($x = 0; $x < 8; $x++) {
    for ($y = 0; $y < 8; $y++) {
        $pixelColor = imagecolorat($image, $x, $y);
        $hash .= ($pixelColor > $averageColor ? '1' : '0');
    }
}

// $hash is now a 64-bit hash of the image

To compare two images, you can simply compute the Hamming distance between their hashes. The Hamming distance is the number of positions at which the corresponding values are different:

function hammingDistance($hash1, $hash2) {
    $i = 0;
    $distance = 0;
    
    while (isset($hash1[$i])) {
        if ($hash1[$i] != $hash2[$i]) {
            $distance++;
        }
        $i++;
    }
    
    return $distance;
}

// Usage
$distance = hammingDistance($hash1, $hash2);

Real-life implementation of pHash in PHP

To avoid writing all this untested code, and create tests, experiment and eventually succeed you can use existing composer library from Jens Segers called imagehash

Besides just GD library, this one will use Imagick as preferred one.

Install the package by running

cimba:~ $ composer require jenssegers/imagehash

Then we can simply use this code snippet to get hash or hex value of the hash

use Jenssegers\ImageHash\ImageHash;
use Jenssegers\ImageHash\Implementations\DifferenceHash;

$hasher = new ImageHash(new DifferenceHash());
$hash = $hasher->hash('path/to/image.jpg');

return $hash->toInt();

The resulting Hash object, is a Integer image fingerprint that can be stored in your database once calculated. 

The Hash object can return the internal binary hash in a couple of different format:

echo $hash->toHex(); // 7878787c7c707c3c
echo $hash->toBits(); // 0111100001111000011110000111110001111100011100000111110000111100
echo $hash->toInt(); // 8680820757815655484
echo $hash->toBytes(); // "\x0F\x07ƒƒ\x03\x0F\x07\x00"

Choose your preference for storing your hashes in your database. If you want to reconstruct a Hash object from a previous calculated value, use:

$hash = Hash::fromHex('7878787c7c707c3c');
$hash = Hash::fromBin('0111100001111000011110000111110001111100011100000111110000111100');
$hash = Hash::fromInt('8680820757815655484');

Calculate Hamming distance in mySQL

Hamming distance can be computed in MySQL using bitwise operations. However, it’s important to note that MySQL doesn’t natively support calculating the Hamming distance directly. We need to build a function ourselves for this purpose.

Here’s a way to calculate the Hamming distance of two BIGINT values:

DELIMITER $$
CREATE FUNCTION HAMMING_DISTANCE(a BIGINT UNSIGNED, b BIGINT UNSIGNED)
RETURNS INT DETERMINISTIC
BEGIN
	DECLARE dist INT DEFAULT 0;
	DECLARE val BIGINT UNSIGNED;
	
	SET val = a ^ b;
	
	WHILE val != 0 DO
		SET dist = dist + (val & 1);
		SET val = val >> 1;
	END WHILE;
	
	RETURN dist;
END$$
DELIMITER ;

Assuming we have a HAMMING_DISTANCE function created , and we have a table named images with a phash column storing the perceptual hashes of given images, we can do something like the following:

SET @passed_value := 'your_hash_here';

SELECT *, HAMMING_DISTANCE(phash, @passed_value) as distance 
FROM images 
HAVING distance < 5;

Conclusion

This is a very simple implementation of the pHash algorithm in PHP. In practice, you may want to use a more advanced algorithm that considers more features of the image and is more robust to changes in size, aspect ratio, or color balance. Nevertheless, this basic version can still be surprisingly effective for many tasks.

Please note that the performance of mySQL hamming distance calculation can be slower on large datasets, as MySQL is not primarily designed for this kind of binary operation. Consider testing with your dataset to ensure performance is acceptable for your application.


Posted

in

About Nemanja

Full-stack WordPress developer from Belgrade, Serbia.

I can help you with cleaning bugs from your website, build custom WordPress or WooCommerce plugin tailed exactly to your requirement, or migrate your website from one hosting to another.

As an active member of the WordPress community since 2014, I keep involved in all levels of the project, be it helping in the WordPress support forums, organising and speaking and WordPress events, or translating plugins and themes.

Whether you need a custom theme or plugin, third-party API integration, or someone to supercharge a slow WordPress website, you can count on me to find a solution.


Latest posts

Comments

Leave a Reply