We see World Models as a grand challenge in robotics. They have the potential to solve general purpose simulation and evaluation, enabling robots that are safe, reliable, and intelligent.
We’ve previously shared work on our robot world model which imagines possible futures given action proposals. We also announced our first challenge: compression, which focuses on minimizing training loss across a diverse robot dataset. The lower the loss, the better the model understands the training data. This challenge is still active, offering a $10k prize to the first submission that achieves a loss of 8.0 on our private test set. Our Github repo provides code and pretrained weights for Llama and GENIE-based world models.
Today, we are announcing the next phase of the World Model Challenge: Sampling.
Sampling focuses on generating realistic future outcomes in video sequences by predicting the next frame given a sequence of prior frames. The goal is to produce coherent and plausible continuations of the video, accurately reflecting the dynamics of the scene. We encourage you to explore a variety of future prediction methods beyond traditional next-logit prediction. Techniques such as Generative Adversarial Networks, Diffusion Models, and MaskGIT are all welcome for generating the next frame. To be competitive, submissions should achieve a PSNR of around 26.5 or above. We will open our evaluation server for submissions and release our metric in March 2025. The top entry will be announced in June 2025 and will receive a $10,000 prize.
To help accelerate research in this direction, we’re releasing a new dataset of 100 hours of raw robot video alongside our robot state sequences which enable world model training. Our raw videos will be shared under the CC-BY-NC-SA 4.0 license, and we will continue to share tokenized datasets under Apache 2.0.
We’re also thrilled to announce that we’re partnering with NVIDIA’s World Models team to further tokenize our video sequences with their newly-announced Cosmos video tokenizer. NVIDIA’s work in visual tokenization and quantization creates highly compressed, temporal representations of our robot data, optimized for such research. The Cosmos-tokenized dataset can be found here.
On the horizon is our third challenge, Evaluation. This is our ultimate goal: can you predict how well a robot will perform before testing it in the real world? This challenge aims to assess the ability to evaluate and rank different robot policies using a world model, without the need for physical deployment.
The official details for the evaluation challenge have not been released yet—stay tuned for the announcement.
Submit solutions to: challenge@1x.tech
GitHub - starter code, evals, baseline implementations
Discord - chat with our engineers
Posted in collaboration with NVIDIA. Read their update on Robot Learning