A Brief Introduction to Image Embedding

6 min readJun 19, 2024

Have you ever wondered how a machine can see an image? How in today’s world, an AI could describe an image when we give it? Alright fellas, before we dive deep into the AI, how about we learn the basic first?

GPT-4o released in May 2024 already have the ability to describe the image given by us (by author)

Introduction

Do and Cheung (2016) define image embedding as a method for computers to map local features from an image to a higher dimensional representation that is useful for image retrieval. For example, imagine an image of a brick above can be embedded as a vector of numbers with length of 1024, such as [2, 0, 253, …, 253, 253, 255]. Each number that exists in the array actually corresponds to an attribute or feature of the image, it could be a color, shape, texture, or others. Basically, the vector captures the essence of images so that we, using the computer, can compare it with other images using mathematical operations.

What’s the Point, Anyway?

Well, if we can convert an image into an array of numbers, your imagination is really the only limit. How about something simple first?

Image analysis: Image embedding can help us identify patterns or irregularities in a wide range of data. This makes it easier to perform diagnoses or conduct a research.
Image generation: A vast number of images create an immense volume of data (or in this case, arrays of images). Similar images tend to generate analogous patterns, so in order to recreate them, we need to identify a comparable model based on the label, supplemented with fancy algorithms such as diffusion models. Et voilà! You have an image generator at your disposal.
Image recommendation: We can use image embeddings to recommend similar products based on the visual features of the items a user is viewing or has purchased.

How Convolutional Neural Network (CNN) works and what exactly the computer sees from the image (taken from NVIDIA)

Sounds Cool, Now How Can We Do It?

To make things easier, you can open this notebook, this notebook will demonstrate how to use the MediaPipe Tasks Python API to compare two separate image files and determine their similarity. The similarity values will range from -1 to 1, with 1 indicating identical images. This is achieved through a technique known as cosine similarity. Let’s get started!

Preparation

You can start by installing the necessary dependencies for your project. Here is the commands to install mediapipe:

!pip install -q mediapipe

The next step is to download the off-the-shelf model that will be used for image embedding. In this example, you will use MobileNet, but you may choose any other suitable model or a model you have built for your specific use cases with MediaPipe Tasks. Here’s how you can download and set up MobileNet for image embedding:

!wget -O embedder.tflite -q https://storage.googleapis.com/mediapipe-models/image_embedder/mobilenet_v3_small/float32/1/mobilenet_v3_small.tflite

For the final preparation step, you’ll need two separate images for comparison. You can use the following code to download the provided images.

import urllib

IMAGE_FILENAMES = ['burger.jpg', 'burger_crop.jpg']

for name in IMAGE_FILENAMES:
  url = f'https://storage.googleapis.com/mediapipe-assets/{name}'
  urllib.request.urlretrieve(url, name)

, or you can choose your own images from another source. Let’s say I want to find the similarity between me and Joey Brr (I wouldn’t be surprised if we were similar, duh)

Performing Image Embedder

Now that you have retrieved the two images for comparison, you can view them to ensure they appear as expected.

import cv2
from google.colab.patches import cv2_imshow
import math
import numpy as np
import matplotlib.pyplot as plt

DESIRED_HEIGHT = 480
DESIRED_WIDTH = 480

def resize_and_show(image):
  h, w = image.shape[:2]
  if h < w:
    img = cv2.resize(image, (DESIRED_WIDTH, math.floor(h/(w/DESIRED_WIDTH))))
  else:
    img = cv2.resize(image, (math.floor(w/(h/DESIRED_HEIGHT)), DESIRED_HEIGHT))
  cv2_imshow(img)


# Preview the images.
images = {name: cv2.imread(name) for name in IMAGE_FILENAMES}
for name, image in images.items():
  print(name)
  resize_and_show(image)

Now, you should see two separate images

Two images that we want to convert into a numeric array embedder (photo provided by LAPRESSE and author)

Image embeddings are numerical representations of images encoded into lower-dimensional vectors. Once everything is set up, you can start performing inference. Begin by creating the necessary options to associate your model with the Image Embedder, along with some customizations.

Next, create the Image Embedder and format your two images for MediaPipe to enable the use of cosine similarity for comparison.

Finally, you can display the cosine similarity value.

import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

# Create options for Image Embedder
base_options = python.BaseOptions(model_asset_path='embedder.tflite')
l2_normalize = True #@param {type:"boolean"}
quantize = True #@param {type:"boolean"}
options = vision.ImageEmbedderOptions(
    base_options=base_options, l2_normalize=l2_normalize, quantize=quantize)


# Create Image Embedder
with vision.ImageEmbedder.create_from_options(options) as embedder:

  # Format images for MediaPipe
  first_image = mp.Image.create_from_file(IMAGE_FILENAMES[0])
  second_image = mp.Image.create_from_file(IMAGE_FILENAMES[1])
  first_embedding_result = embedder.embed(first_image)
  second_embedding_result = embedder.embed(second_image)

  # Calculate and print similarity
  similarity = vision.ImageEmbedder.cosine_similarity(
      first_embedding_result.embeddings[0],
      second_embedding_result.embeddings[0])
  print(similarity)

Guess what the output…

Well, well, well.I guess we were nothing alike at all.

Deep Dive Into Cosine Similarity and Image Embedding

Formula of cosine similarity (taken from Wikipedia)

The formula in the image represents the cosine similarity between two vectors A and B. Cosine similarity is a measure used to determine how similar two vectors are, and it is often used in various fields such as information retrieval, text mining, and recommendation systems.

Cosine similarity is a useful metric for comparing the orientation of two vectors, irrespective of their magnitude. It finds applications in various domains where similarity or correlation between datasets is of interest.

# Assume first_embedding_result is already defined as in your context
# Accessing the embedding array
embedding_array_1 = first_embedding_result.embeddings[0].embedding
embedding_array_2 = second_embedding_result.embeddings[0].embedding

# Display embedding values
print(embedding_array_1)
print(embedding_array_2)

The provided code is written in Python and deals with embeddings, which are numerical representations of the image. In the first two lines, the code retrieves the first embedding from two different results (likely obtained from some machine learning model or service). These embeddings are stored as arrays of numbers in the variables embedding_array_1 and embedding_array_2. The last two lines simply print out the contents of these arrays, allowing you to inspect or use them further in your program.

Now we can see the array, let’s see the length, shall we?

print(len(embedding_array_1))
print(len(embedding_array_2))

Oh, the lengths are 1024, which means the images are resized to 32x32 pixels. Now we beautify them, because why not.

Finally, we plot the matrix to see image representation in a color bar using this code

# Plotting the matrix
plt.imshow(embedding_matrix_1, cmap='gray', interpolation='nearest')
plt.colorbar()  # Add a colorbar to show the intensity scale
plt.show()

and this

# Plotting the matrix
plt.imshow(embedding_matrix_2, cmap='gray', interpolation='nearest')
plt.colorbar()  # Add a colorbar to show the intensity scale
plt.show()

Here are the representation, for Joey Brr and my face respectively

Matrix representation of the Joe Burrow image

Conclusion

In this blog post, we explored the concept of image embeddings, discussed their significance, and demonstrated how to generate them using Python alongside various open-source libraries. We illustrated the process with tools like Mediapipe, OpenCV, and MobileNet to create image embeddings effectively. Our aim was to provide an educational, comprehensible, and practical guide for you.