Google's new AI searches photos, videos, and audio all at once
Gemini Embedding 2 is Google's first AI model that can search across text, images, video, audio, and PDFs simultaneously — with a free tier and support for 100+ languages.
Google quietly released Gemini Embedding 2 Preview — the first AI model that can search across text, photos, videos, audio, and PDF documents all at once. Instead of building separate search systems for each type of content, this single model understands them all in the same "language."
Think of it this way: you could search your company's files by typing "quarterly revenue chart" and the AI would find matching photos, slides, video clips, and documents — even if none of them contain those exact words.
How "search everything at once" actually works
Traditional search matches keywords. AI-powered search works differently — it converts content into embeddings (long lists of numbers that capture the meaning of something). When two items have similar numbers, they're related — even if they look completely different on the surface.
What makes Gemini Embedding 2 special is that it puts all content types into the same number space. A photo of a sunset, a video of a sunset, and the words "beautiful sunset" all get similar numbers. This means you can search across every type of file with a single query.
What it handles
📝 Text — up to 8,192 tokens (roughly 6,000 words), in 100+ languages
🖼️ Images — up to 6 photos per request (PNG, JPEG)
🎵 Audio — up to 80 seconds (MP3, WAV)
🎬 Video — up to 128 seconds (MP4, MOV)
📄 PDFs — up to 6 pages per request
The numbers: how it compares
Google's previous embedding model (gemini-embedding-001) was text-only with a 2,048-token limit. The new model is a massive leap:
4x more text capacity — from 2,048 to 8,192 tokens
5 content types — up from text-only
Flexible precision — adjust from broad scanning (128 dimensions) to pinpoint accuracy (3,072 dimensions)
MTEB score of 68.16 — a standard AI search quality benchmark, competitive with top models
The clever part: Google uses a technique called Matryoshka Representation Learning (named after Russian nesting dolls). It means you can use smaller, cheaper embeddings without losing much quality — a 768-dimension embedding scores 67.99, barely below the full 68.16 at 2,048 dimensions.
What it costs (there's a free tier)
Google offers a free tier for development and testing. For production use, pricing per 1 million tokens:
📝 Text: $0.20 per 1M tokens
🖼️ Images: $0.45 per 1M tokens (~$0.00012 per image)
🎵 Audio: $6.50 per 1M tokens (~$0.00016 per second)
🎬 Video: $12.00 per 1M tokens (~$0.00079 per frame)
For context: embedding a thousand product photos costs roughly 12 cents. Making your entire photo library AI-searchable is surprisingly affordable.
Try it yourself
If you're a developer, you can start with a simple Python script:
pip install google-genai
from google import genai
client = genai.Client()
# Embed text
result = client.models.embed_content(
model='gemini-embedding-2-preview',
contents='What is the meaning of life?'
)
# Embed an image
with open('photo.png', 'rb') as f:
image_bytes = f.read()
result = client.models.embed_content(
model='gemini-embedding-2-preview',
contents=[types.Part.from_bytes(
data=image_bytes,
mime_type='image/png'
)]
)
You can even combine text and images in a single embedding — useful for things like social media posts with photos, where both the caption and the image carry meaning.
Who this is built for
App builders: If you're creating a search feature — for a photo library, knowledge base, or e-commerce catalog — this model handles every content type in one API call.
Companies with messy archives: Got thousands of PDFs, training videos, and product photos scattered across drives? This model can make all of it searchable from one search bar.
AI developers building RAG systems: RAG (Retrieval-Augmented Generation — the technique that lets chatbots answer questions using your own documents) now works across all media types, not just text.
One important caveat
If you already use Google's text-only embedding model, you can't mix old and new embeddings. The models use different number systems, so you'll need to re-process your existing content. Google is upfront about this — it's a one-time migration cost for a much more capable system.
Full documentation is available on the Google AI for Developers site. The model is in preview status, with general availability expected later this year.
Related Content — Get Started with Easy Claude Code | Free Learning Guides | More AI News
Stay updated on AI news
Simple explanations of the latest AI developments