In partnership with

The Tech newsletter for Engineers who want to stay ahead

Tech moves fast, but you're still playing catch-up?

That's exactly why 100K+ engineers working at Google, Meta, and Apple read The Code twice a week.

Here's what you get:

  • Curated tech news that shapes your career - Filtered from thousands of sources so you know what's coming 6 months early.

  • Practical resources you can use immediately - Real tutorials and tools that solve actual engineering problems.

  • Research papers and insights decoded - We break down complex tech so you understand what matters.

All delivered twice a week in just 2 short emails.

New Research

DINO-world: Predicting the Future Without Generating Pixels

Meta FAIR built a video world model that works entirely in latent space — and it beats pixel-based approaches on physics understanding.

60M+

TRAINING VIDEOS

+6.3

mIoU IMPROVEMENT

+12%

PLANNING BOOST

The Big Idea

Video prediction models typically generate pixels — every frame, every detail. It's computationally expensive and often fails to capture real physics.

DINO-world takes a different approach: predict future frames in DINOv2's latent space. You get semantic understanding without pixel reconstruction. The model learns what matters — object permanence, gravity, causality — not what things look like.

Architecture Pipeline

🎬

Video

🦕

DINOv2

🔮

Predictor

🎯

Future

DINOv2 encoder stays frozen. All model capacity goes into learning temporal dynamics.

The Problem with Pixel Prediction

Models like Sora and COSMOS generate impressive videos, but they're computationally brutal. Worse, they often struggle with basic physics — objects pass through walls, gravity behaves strangely, cause and effect break down.

❌ PIXEL-SPACE MODELS

• Computationally expensive

• Often fail on physics

• Generate unnecessary detail

• Hard to use for planning

✅ DINO-WORLD (LATENT)

• Fast and efficient

• Strong physics understanding

• Semantic-level prediction

• Direct planning support

Why DINOv2 Features Work

DINOv2 is a self-supervised vision model from Meta that learns rich visual features without labels. These features turn out to be perfect for world modeling:

Spatial Structure

Patch-level features encode where objects are in the scene with high precision

Object-Centric

Self-supervised training learns to represent what objects are and their properties

Cross-Domain Transfer

Same features work across driving scenes, indoor environments, and robotics tasks

Benchmark Results

Segmentation Forecasting

VSPW mid-term prediction (~0.5s ahead)

+6.3 mIoU

Intuitive Physics

IntPhys, GRASP, InfLevel benchmarks

SOTA

Robot Planning

PushT, PointMaze, Wall environments

+10-12%

From World Model to Robot Planning

DINO-world isn't just for prediction — it's designed for control. The team adds lightweight "action blocks" after each transformer layer. These blocks are zero-initialized to preserve pre-trained knowledge, then fine-tuned on small trajectory datasets.

🤖 PLANNING OBJECTIVE

minimize || predicted_state - goal_state ||²

Optimize action sequences to reach desired goals entirely in latent space

Key Takeaways

1

Frozen encoder + learned dynamics beats joint training for generalist world models

2

Latent-space prediction understands physics better than pixel-space generation

3

Web-scale pretraining transfers to robotics with minimal fine-tuning

4

Modular action blocks preserve pre-trained knowledge while enabling control

Want to dive deeper?

Read the Full Paper →

Project Page GitHub

Found this breakdown useful? Share it with someone building physical AI.

Keep Reading

No posts found