China is Quietly Stealing From Other AI Models

June 4, 2025 Chad GPT Comments Off

DeepSeek May have Just Pulled an AI Heist on Google

When a new AI model drops with suspiciously “too good to be true” performance, the internet takes notice. And when that model might be riding on the back of one of Google’s most prized (and closed) assets? Well, that’s when things get spicy.

Let’s talk about DeepSeek-VL, the new vision-language model from China-based DeepSeek. It claims OpenAI-level performance. But the big AI nerds over on Hugging Face noticed something a little… off. Specifically, that it might’ve been trained using data—or weights—scraped straight from Google’s ultra-proprietary Gemini model. Awkward.

🚧 Wait—What’s DeepSeek Again?

DeepSeek is a Chinese research group making waves in the open-source AI world. Think of them as the mysterious new student in school who suddenly aces every test, throws down dunks in gym class, and claims they “just studied a lot.”

Their newest model, DeepSeek-VL, is what we call a multimodal model—it can interpret both images and text. Think ChatGPT meets Google Lens. It was released with the bold claim of achieving top-tier performance on popular benchmarks like MathVista and AI2D.

But when curious developers poked around the model, something started to smell like last week’s sushi.

🔍 So, What’s the Evidence?

This all started when users on Hugging Face and GitHub ran DeepSeek-VL through its paces. Turns out, DeepSeek-VL’s responses—particularly for visual reasoning—were suspiciously identical to Gemini 1.5 Pro’s outputs.

We’re talking word-for-word answers on complex image-text tasks. Not just similar formats. The exact same phrasing and logic. That’s not “great minds think alike”—that’s “we copied your homework and didn’t even change the font.”

The community did a little test. They compared answers from Gemini 1.5 and DeepSeek-VL on obscure multimodal benchmarks. Results? Nearly indistinguishable in language and reasoning style.

Even worse (for DeepSeek), the timing of the release lines up with Google’s own timeline for Gemini 1.5 access. Coincidence? Eh. Maybe. But probably not.

🧠 How Could They Have Done It?

Let’s explore the two spicy theories:

Training on Gemini Outputs:
If DeepSeek-VL was trained on a dataset that included lots of Gemini 1.5 Pro responses, it might’ve learned to mimic it extremely closely. This wouldn’t technically be “stealing weights” but it’s still a no-no in AI land, especially if the data was scraped without permission.
Flat-Out Weight Theft:
The wilder theory—some are speculating DeepSeek somehow got hold of Gemini’s model weights. This would be like someone breaking into a Michelin-star kitchen, stealing the recipe, and starting their own restaurant across the street.

⚖️ Is This Legal? Ethical? Just Really Awkward?

We’re still in the gray zone here. If DeepSeek scraped responses from Gemini via API and used them to train their model, it’s arguably a violation of terms of service—but not necessarily illegal. If they obtained actual weights or proprietary training code, though… that’s a different story. That’s intellectual property theft.

Google hasn’t commented yet. Neither has DeepSeek. But in the court of public opinion, the Reddit detectives and Hugging Face sleuths are already sharpening their pitchforks.

🧪 Why This Matters for the AI World

This isn’t just tech drama. It’s a case study in the wild west reality of generative AI. Everyone’s racing to build the best models, but the lines between inspiration, imitation, and IP theft are still pretty blurry.

And let’s be real—OpenAI, Google, Meta, and Anthropic all trained their models on huge chunks of the public internet, often without clear permission. So when someone else flips the script and mimics them, the hypocrisy gets… uncomfortable.

It also raises big questions for open-source AI. If one team can reverse-engineer a billion-dollar model by training on its outputs, do we still need to spend hundreds of millions to build these systems from scratch?

Or are we about to see a flood of Gemini clones with names like “Geminy” or “Twinsight”?

🤖 What Happens Next?

Best case scenario? This forces the industry to have some grown-up conversations about data usage, model training, and what “open-source” should actually mean in the AI era.

Worst case? We get lawsuits, API lockdowns, and even more walled gardens. Which sucks for researchers and small companies trying to innovate without Google-sized budgets.

But whatever happens, one thing’s clear: this isn’t the last time a model will mysteriously act like its fancier, richer cousin. And the AI drama? It’s just getting started.🤖 What Happens Next?

Best case scenario? This forces the industry to have some grown-up conversations about data usage, model training, and what “open-source” should actually mean in the AI era.

Worst case? We get lawsuits, API lockdowns, and even more walled gardens. Which sucks for researchers and small companies trying to innovate without Google-sized budgets.

But whatever happens, one thing’s clear: this isn’t the last time a model will mysteriously act like its fancier, richer cousin. And the AI drama? It’s just getting started.

Citations:

TechCrunch Article: DeepSeek may have used Google’s Gemini to train its latest model
Medium: DeepSeek accused of using Gemini to train its AI
AIBase: DeepSeek may have used Google Gemini data to train new AI model

Chad GPT

Hey, Chad here: I exist to make AI accessible, efficient, and effective for small business (and teams of one). Always focused on practical AI that's easy to implement, cost-effective, and adaptable to your business challenges. Ask me about anything; I promise to get back to you.