Zero-Shot Learning Explained Simply & How to use Zero-Shot Learning in Computer Vision

Zero-shot learning enables AI models to recognize unseen classes without prior training. This guide simplifies the concept and demonstrates practical applications in computer vision tasks.

Sean Dorje

Sep 26, 2023

4 min read

Imagine being able to identify a cat in a picture without ever having seen one before, solely based on a description you've read. This idea forms the basis of "zero-shot learning" in computer vision—a revolutionary technique that allows models to recognize objects without prior direct training on them.

In this guide tailored for beginners, we delve into:

  • The essence of Zero-Shot Learning in Computer Vision.

  • The mechanics behind Zero-Shot Learning.

  • Practical steps on how to utilize zero-shot learning in computer vision applications.

What is Zero-shot learning?

Zero-shot learning in computer vision is like recognizing something you've never seen before just from a description. Instead of relying on vast amounts of data for every new object, zero-shot learning allows models to understand and categorize new, unseen objects without specific training on them. It's all about making connections between what the model already knows and the new information it encounters.

At its core, zero-shot learning is a part of transfer learning. This involves taking knowledge from one task and applying it to another. Think of it like learning to ride a scooter and using that skill to learn skateboarding faster.

In the realm of computer vision, OpenAI's CLIP is a standout example of leveraging zero-shot learning to identify images.

Why does Zero-shot learning matter in computer vision?

Imagine wanting to identify something rare, like a unique dog breed, without having heaps of pictures of that specific breed. Zero-shot learning (ZSL) makes this possible. Instead of needing countless labeled examples of every object, ZSL lets computer vision systems recognize things they haven't been explicitly trained on.

Here's why it's a game-changer:

  • Saves Time and Money: Data Labeling can be slow and pricey, especially when experts, like doctors for medical images, are needed.

  • Tackles Rare Items: For objects that are seldom seen, like specific defects in products, gathering enough data is tough. ZSL steps in here.

  • Handles Complexity: Think about the many dog breeds. Traditional unsupervised learning might struggle, but ZSL can discern even subtle differences.

  • Broad Applications: ZSL isn't just about identifying images. It aids in tracking objects, understanding language, adapting artistic styles, and much more.

In essence, zero-shot learning is invaluable when detailed labeled data is hard to get or just doesn't exist. It's like teaching a computer to make educated guesses, and often, those guesses are spot on.

How does Zero-Shot Learning Work

Zero-Shot Learning (ZSL) is a transformative approach in machine learning, enabling a model to recognize and classify categories it's never seen during training.

Seen and Unseen Classes: At its core, ZSL operates with categories the model has been trained on (Seen Classes) and those it hasn't (Unseen Classes).

Auxiliary Information: This is the critical bridge. It's often semantic or textual data that helps the model associate its prior knowledge with new, unfamiliar categories.

How Does ZSL Function?

Two-Stage Process:

  1. Training: The model is introduced to a labeled set of known data samples.

  2. Inference: Using the knowledge gained and the auxiliary information, the model attempts to decipher and classify new categories.

Fun fact: Humans excel at Zero-Shot Learning due to our extensive language knowledge base. We leverage this to describe unseen objects, connecting them to known visual concepts. Similarly, ZSL in computers is based on a labeled training set of both seen and unseen classes. These classes converge in a high-dimensional vector space called the semantic space. Here, knowledge from familiar categories can be transitioned to unfamiliar ones.

Methods in ZSL:

  • Classifier-based Approaches: Here, the model is trained to sort and classify using known data. Through various methods, it learns to map this understanding to unknown categories.

  • Instance-based Approaches: Instead of broad categories, the model focuses on specific examples. It might borrow knowledge from known categories or even create synthetic examples to understand the new ones.

In essence, ZSL allows a computer vision model to make educated guesses about new data, similar to how we humans might describe something we've never seen based on what we know.

How to use Zero-Shot Learning in Computer Vision Applications

Here is a step by step, straightforward guide to use Zero shot learning in use cases like Image Classifcation. We’re going to be using OpenAI Clip model.

Setting Up the Pipeline:

Firstly, you'll need to import the necessary functions and set up the zero-shot image classification pipeline. Here's how:


from transformers import pipeline

# There are various models available in the model hub.
model_name = "openai/clip-vit-large-patch14-336"
classifier = pipeline("zero-shot-image-classification", model=model_name)


Classifying Images:

Once your pipeline is set, you can classify images into any specified categories. And yes, you can specify multiple class labels!

image_to_classify = "path_to_cat_and_dog_image.jpeg"
labels_for_classification = ["cat and dog",
                             "lion and cheetah",
                             "rabbit and lion"]
scores = classifier(image_to_classify,
                    candidate_labels=labels_for_classification)


Interpreting the Results:

After the classification, you'll get a list of dictionaries with the results. For instance:

[{'score': 0.9950482249259949, 'label': 'cat and dog'},
 {'score': 0.004863627254962921, 'label': 'rabbit and lion'},
 {'score': 8.816882473183796e-05, 'label': 'lion and cheetah'}]

The top result (or the first dictionary in the list) gives you the label with the highest likelihood. You can display this as:

image_to_classify = "path_to_cat_and_dog_image.jpeg"
labels_for_classification = ["cat and dog",
                             "lion and cheetah",
                             "rabbit and lion"]
scores = classifier(image_to_classify,
                    candidate_labels=labels_for_classification)


This would output;

The highest probability is 0.995 for the label cat and dog

With these steps, you're well on your way to harnessing the power of zero-shot learning in your computer vision applications!

However, if you want to skip complicated steps and additional fine-tuning. Just simply use ezML’s API. We’ve designed our platform specficailly to accommodate zero shot learning.

Transform Your Business with Computer Vision

Experience the benefits of our advanced computer vision solutions.

Transform Your Business with Computer Vision

Experience the benefits of our advanced computer vision solutions.

Transform Your Business with Computer Vision

Experience the benefits of our advanced computer vision solutions.

Transform Your Business with Computer Vision

Experience the benefits of our advanced computer vision solutions.