LDA-1B Explained: How “Garbage Data” Is Powering the Next Robot AI Breakthrough

The arms race around robot foundation models has just welcomed a new player. A joint team from Peking University, Tsinghua University, Galaxy General, and Zhiyuan Institute has introduced LDA-1B, pushing parameter size directly to the billion scale.

Behind this number sits a more aggressive idea: stop focusing only on expert demonstration data. Those “garbage data” pieces that used to be thrown into the recycle bin might actually be the nutrients robots really need.

The traditional training path for robots is straightforward—find a skilled operator, record their actions, and let the robot learn by imitation. This behavior cloning approach has been widely used by OpenAI and Google DeepMind. But the problem is obvious: data utilization is painfully low.

A failed robot attempt? Discarded.
A casually recorded human operation video? Not high quality, discarded.
Data from a different robot platform? Incompatible format, still discarded.

The LDA-1B team asked a simple question: what happens if all that wasted data is actually used?

How LDA-1B Uses All Available Data

They assembled a dataset called EI-30k—30,000 hours of embodied interaction data, covering both human operations and robot trajectories.

This scale is already massive in robotics. For comparison, the previous largest dataset, Open X-Embodiment, had just over 1,000 hours.

But scale isn’t the key. The key is diversity.

The dataset includes:

Successful demonstrations and failed attempts
High-precision robot data and casually recorded human videos
Dual-arm manipulations and dexterous hand operations

By traditional standards, much of this data is inconsistent and would never make it into a training set.

LDA-1B takes a different approach: assign different roles to data of different quality.

High-quality expert demonstrations are used to learn policy.
Lower-quality or “unqualified” data is used to learn dynamics of the physical world.

A failed grasping video cannot be directly imitated—but it tells the model, “this way of grasping will fail.” That is dynamics knowledge.

The idea sounds simple, but it raises a technical challenge:
how can LDA-1B learn both “what the next frame looks like” and “what action to take next” at the same time?

LDA-1B and Prediction in DINO Latent Space

The team’s solution is to move prediction tasks into the latent space of DINO.

DINO, a visual self-supervised model developed by Meta, compresses images into highly abstract feature representations. In this space, LDA-1B doesn’t need to care about surface details like “is the table wooden or white,” but instead focuses on core physical information like “where objects are” and “how they move.”

This design brings two advantages:

Much higher computational efficiency, avoiding pixel-level redundancy
Stronger generalization across environments, since LDA-1B learns abstract physical rules instead of scene-specific visual features

Within a unified multi-modal diffusion Transformer framework, the model jointly denoises action chunks and future DINO sequences.

Heterogeneous data plays unique and complementary roles across:

Visual prediction
Dynamics learning
Policy learning

The team collected EI-30k with over 30,000 hours of diverse human-robot interaction data, covering different event durations and manipulation tasks.

Architecturally, LDA-1B uses a Multi-modal Diffusion Transformer. This allows it to handle asynchronous visual and action streams. In the real world, camera frame rates and robot control frequencies are often misaligned, which traditional models struggle to process.

The introduction of diffusion modeling also enables LDA-1B to train stably at the billion-parameter scale—something previously difficult for robot models.

LDA-1B Performance Across Three Task Categories

For evaluation, the team selected three representative scenarios:

Contact-Rich Tasks in LDA-1B

These test a robot’s perception and control of force—tasks like inserting a USB cable or tightening screws, where precise force feedback is essential.

LDA-1B outperforms the previous π0.5 model by 21%.

Dexterous Manipulation with LDA-1B

Even more challenging, these tasks require coordinated multi-finger control, such as rotating a Rubik’s Cube or using tools.

Here, LDA-1B shows an even larger advantage, improving performance by 48%.

Long-Horizon Planning in LDA-1B

These evaluate planning ability. The robot must complete a sequence of sub-tasks to achieve a final goal.

LDA-1B achieves a 23% improvement in this category.

More interesting are the fine-tuning experiments.

The team deliberately used “low-quality” data—failed cases and incomplete trajectories that would normally be discarded.

The result: using just 30% of this data improved performance by 10%.

This finding overturns a common industry belief:
so-called “garbage data” is not a burden—it can be a hidden asset for LDA-1B.

LDA-1B and a New Path to World Models

The technical approach behind LDA-1B responds to a bigger question: how should robot foundation models learn?

There are currently two dominant paradigms:

Behavior Cloning
Represented by OpenAI’s robotics work and Physical Intelligence’s π0 series. The idea is simple: watch experts and imitate them.

World Models
Represented by works like Genie and DIAMOND. The idea is to first understand how the physical world works, then decide how to act.

Behavior cloning suffers from low data efficiency—it only learns from successful cases.
World models have struggled with crude implementations—either focusing only on video prediction without actions, or relying on datasets too small for large-scale training.

LDA-1B takes a third path.

It unifies dynamics learning, policy learning, and visual prediction into a single framework, allowing data of different qualities to play different roles.

This idea—Unified World Model—has been proposed before in theory. But LDA-1B is the first to implement it at the billion-parameter scale with stable training.

From an engineering perspective, the biggest contribution of LDA-1B isn’t a single breakthrough. It’s proving one thing:

Robot foundation models can scale like language models—by “consuming” massive amounts of heterogeneous data.

Data that used to be wasted can be turned into knowledge inside LDA-1B.

Is LDA-1B the Cure for Data Hunger?

Robotics has long faced an awkward reality: data is expensive.

Recording one hour of high-quality robot demonstration data requires skilled operators, standardized environments, and precise sensors. The cost can reach thousands of dollars.

Even well-funded labs struggle to train models with data at the scale of language models.

LDA-1B offers a new direction.

Instead of spending heavily to collect perfect data, make use of imperfect data.

Human operation videos uploaded to YouTube, failed robot attempts during debugging, datasets collected across different labs and platforms—these previously ignored resources can now become training material for LDA-1B.

That said, some uncertainties remain.

The paper does not fully disclose the composition and sourcing of the EI-30k dataset, which creates a barrier for other teams attempting to reproduce LDA-1B at this scale.

There’s also the issue of deployment: a billion-parameter model comes with significant computational cost. Robots, unlike servers, cannot simply scale up compute.

Still, at this moment, LDA-1B sets a new reference point for robot foundation models:

Larger scale
More diverse data
More unified methods

Now the question is how the rest of the field will respond to LDA-1B.