2026-03-18GoogleGeminiAI searchembeddingsmultimodal AI

Google's new AI searches photos, videos, and audio all at once

Gemini Embedding 2 is Google's first AI model that can search across text, images, video, audio, and PDFs simultaneously — with a free tier and support for 100+ languages.

Google quietly released Gemini Embedding 2 Preview — the first AI model that can search across text, photos, videos, audio, and PDF documents all at once. Instead of building separate search systems for each type of content, this single model understands them all in the same "language."

Think of it this way: you could search your company's files by typing "quarterly revenue chart" and the AI would find matching photos, slides, video clips, and documents — even if none of them contain those exact words.

Google Gemini API platform for building AI-powered search

How "search everything at once" actually works

Traditional search matches keywords. AI-powered search works differently — it converts content into embeddings (long lists of numbers that capture the meaning of something). When two items have similar numbers, they're related — even if they look completely different on the surface.

What makes Gemini Embedding 2 special is that it puts all content types into the same number space. A photo of a sunset, a video of a sunset, and the words "beautiful sunset" all get similar numbers. This means you can search across every type of file with a single query.

What it handles

📝 Text — up to 8,192 tokens (roughly 6,000 words), in 100+ languages
🖼️ Images — up to 6 photos per request (PNG, JPEG)
🎵 Audio — up to 80 seconds (MP3, WAV)
🎬 Video — up to 128 seconds (MP4, MOV)
📄 PDFs — up to 6 pages per request

The numbers: how it compares

Google's previous embedding model (gemini-embedding-001) was text-only with a 2,048-token limit. The new model is a massive leap:

4x more text capacity — from 2,048 to 8,192 tokens
5 content types — up from text-only
Flexible precision — adjust from broad scanning (128 dimensions) to pinpoint accuracy (3,072 dimensions)
MTEB score of 68.16 — a standard AI search quality benchmark, competitive with top models

The clever part: Google uses a technique called Matryoshka Representation Learning (named after Russian nesting dolls). It means you can use smaller, cheaper embeddings without losing much quality — a 768-dimension embedding scores 67.99, barely below the full 68.16 at 2,048 dimensions.

What it costs (there's a free tier)

Google offers a free tier for development and testing. For production use, pricing per 1 million tokens:

📝 Text: $0.20 per 1M tokens
🖼️ Images: $0.45 per 1M tokens (~$0.00012 per image)
🎵 Audio: $6.50 per 1M tokens (~$0.00016 per second)
🎬 Video: $12.00 per 1M tokens (~$0.00079 per frame)

For context: embedding a thousand product photos costs roughly 12 cents. Making your entire photo library AI-searchable is surprisingly affordable.

Try it yourself

If you're a developer, you can start with a simple Python script:

pip install google-genai

from google import genai

client = genai.Client()

# Embed text
result = client.models.embed_content(
    model='gemini-embedding-2-preview',
    contents='What is the meaning of life?'
)

# Embed an image
with open('photo.png', 'rb') as f:
    image_bytes = f.read()

result = client.models.embed_content(
    model='gemini-embedding-2-preview',
    contents=[types.Part.from_bytes(
        data=image_bytes,
        mime_type='image/png'
    )]
)

You can even combine text and images in a single embedding — useful for things like social media posts with photos, where both the caption and the image carry meaning.

Who this is built for

App builders: If you're creating a search feature — for a photo library, knowledge base, or e-commerce catalog — this model handles every content type in one API call.

Companies with messy archives: Got thousands of PDFs, training videos, and product photos scattered across drives? This model can make all of it searchable from one search bar.

AI developers building RAG systems: RAG (Retrieval-Augmented Generation — the technique that lets chatbots answer questions using your own documents) now works across all media types, not just text.

One important caveat

If you already use Google's text-only embedding model, you can't mix old and new embeddings. The models use different number systems, so you'll need to re-process your existing content. Google is upfront about this — it's a one-time migration cost for a much more capable system.

Full documentation is available on the Google AI for Developers site. The model is in preview status, with general availability expected later this year.

Related Content — Get Started with Easy Claude Code | Free Learning Guides | More AI News

Sources

Stay updated on AI news

Simple explanations of the latest AI developments