Cooking With NEO Beta and Nick DiGiovanni
A behind the scenes look into the production of Nick DiGiovanni and NEO Beta's cooking challenge including BTS content and technical details.
A behind the scenes look into the production of Nick DiGiovanni and NEO Beta's cooking challenge including BTS content and technical details.
Our goal with this video was to create a fun and exciting glimpse into a not-so-distant future where humanoids cook delicious meals for us.
After connecting with Nick and his team, we brainstormed ideas for an engaging narrative. When Nick suggested a robot cook-off with NEO Beta as the “final boss,” we quickly set to work refining our cooking skills.
While cooking a steak dinner from start to finish was an impressive demonstration of NEO Beta's ability to navigate complex tasks, it is important to note that this showcase was done using teleoperation to ensure the successful execution of all tasks involved. That said, the only thing standing between NEO and a fully autonomous, medium-rare steak is the input data.
In the video, NEO Beta and Nick met up in Sunnyvale, California, near the 1X headquarters, for a home cooking showdown to see who could make the perfect medium-rare steak. The audience would ultimately decide the winner.
Nick generously brought NEO Beta a custom “Chef NEO” coat and stocked the fridges with more Wagyu beef than any humanoid could ever need.
Fun Fact: We (wrongly) assumed NEO Beta might mess up at least one steak since this was an entirely new skill. To be safe, we had multiple backup steaks on hand. Proving us wrong, NEO Beta nailed it on the first try, and the 1X team and Nick got to enjoy the extra steaks afterward.
The shoot was full of laughs, imperfections like knocking over the olive oil, and memorable moments like when NEO Beta pulled off butter basting—reminding everyone on set they were witnessing a landmark moment in the history of home robotics.
Another Fun Fact: Nick admitted that NEO Beta was a better chef than Gordon Ramsay and Uncle Roger. He asked us to keep it a secret, but we couldn’t resist sharing.
While videos like this are extremely fun and important for showing the masses that humanoid robots are here and the fun is just beginning– we find it equally important to be transparent about how we capture these shots to maintain trust in the robotics community and set the right expectations for our product.
NEO Beta Movements
All movements were powered by 1X’s VR Teleoperation App running on Meta Quest.
Dialogue
NEO Beta's lines were filmed using voice pass-through with a filter. Although NEO is equipped with real-time conversation capabilities via our GPT-4o voice integration, we opted for controlled dialogue to align with Nick’s vision for the video.
NEO Beta’s voice was designed in collaboration with Nick’s team to find a tone that was playful and appealing to both his audience and our vision of NEO as a friendly, home-assisting humanoid.
Disclaimers
Cooking Disclaimer: Although NEO Beta successfully cooked a meal alongside Nick, cooking won’t be an immediate feature available to the first NEO users. We want to ensure NEO gains experience with safer tasks before handling sharp or hot objects.
Authenticity Disclaimer: While NEO Beta completed all tasks from seasoning the steak to flipping and removing it from the pan end to end, NEO Beta did require assistance to turn on the burner. Nick and his team felt that including this scene was important for their audience, so we agreed to create this article for transparency.
We’re incredibly excited that NEO Beta was able to cook a full steak dinner with minimal assistance. Sooner than you think, NEO will be cooking similar meals—and more—in your kitchen.
Huge thanks to Nick, Tim and Zach for putting this all together.
We hope you enjoyed the video. Share it with friends and family and let us know what you all think on X
or shoot us an email!
We see World Models as a grand challenge in robotics. They have the potential to solve general purpose simulation and evaluation, enabling robots that are safe, reliable, and intelligent.
We’ve previously shared work on our robot world model which imagines possible futures given action proposals. We also announced our first challenge: compression, which focuses on minimizing training loss across a diverse robot dataset. The lower the loss, the better the model understands the training data. This challenge is still active, offering a $10k prize to the first submission that achieves a loss of 8.0 on our private test set. Our Github repo provides code and pretrained weights for Llama and GENIE-based world models.
Today, we are announcing the next phase of the World Model Challenge: Sampling.
Sampling focuses on generating realistic future outcomes in video sequences by predicting the next frame given a sequence of prior frames. The goal is to produce coherent and plausible continuations of the video, accurately reflecting the dynamics of the scene. We encourage you to explore a variety of future prediction methods beyond traditional next-logit prediction. Techniques such as Generative Adversarial Networks, Diffusion Models, and MaskGIT are all welcome for generating the next frame. To be competitive, submissions should achieve a PSNR of around 26.5 or above. We will open our evaluation server for submissions and release our metric in March 2025.
To help accelerate research in this direction, we’re releasing a new dataset of 100 hours of raw robot video alongside our robot state sequences which enable world model training. Our raw videos will be shared under the CC-BY-NC-SA 4.0 license, and we will continue to share tokenized datasets under Apache 2.0.
We’re also thrilled to announce that we’re partnering with NVIDIA’s World Models team to further tokenize our video sequences with their newly-announced Cosmos video tokenizer. NVIDIA’s work in visual tokenization and quantization creates highly compressed, temporal representations of our robot data, optimized for such research. The Cosmos-tokenized dataset can be found here.
On the horizon is our third challenge, Evaluation. This is our ultimate goal: can you predict how well a robot will perform before testing it in the real world? This challenge aims to assess the ability to evaluate and rank different robot policies using a world model, without the need for physical deployment.
The official details for the evaluation challenge have not been released yet—stay tuned for the announcement.
Submit solutions to: challenge@1x.tech
GitHub - starter code, evals, baseline implementations
Discord - chat with our engineers
Posted in collaboration with NVIDIA. Read their update on Robot Learning
In machine learning, a world model is a computer program that can imagine how the world evolves in response to an agent’s behavior. Building on advancements in video generation and world models for autonomous vehicles, we have trained a world model that serves as a virtual simulator for our robots.
From the same starting image sequence, our world model can imagine multiple futures from different robot action proposals.
It can also predict non-trivial object interactions like rigid bodies, effects of dropping objects, partial observability, deformable objects (curtains, laundry), and articulated objects (doors, drawers, curtains, chairs).
In this post we’ll share why world models for robots are important, the capabilities and limitations of our current models, and a new dataset and public competition to encourage more research in this direction.
World models solve a very practical and yet often overlooked challenge when building general-purpose robots: evaluation. If you train a robot to perform 1000 unique tasks, it is very hard to know whether a new model has made the robot better at all 1000 tasks, compared to a prior model. Even the same model weights can experience a rapid degradation in performance in a matter of days due to subtle changes in the environment background or ambient lighting.
If the environment keeps changing over time, then old experiments performed in that environment are no longer reproducible because the old environment no longer exists! This problem gets worse if you are evaluating multi-task systems in a constantly-changing setting like the home or the office. This makes careful robotic science in the real world frustratingly hard.
Careful measurement of capabilities allows one to predict how capabilities will scale when one increases data, compute, and model size – these “scaling laws” defend the enormous investment that goes into general-purpose AI systems like ChatGPT. If robotics is to have its “ChatGPT moment”, we must first establish its “Scaling Laws”.
Physics-based simulation (Bullet, Mujoco, Isaac Sim, Drake) are a reasonable way to quickly test robot policies. They are resettable and reproducible, allowing researchers to carefully compare different control algorithms. However, these simulators are mostly designed for rigid body dynamics and require a lot of manual asset authoring. How to simulate robot hands opening a cardboard box of coffee filters, cutting fruit with a knife, unscrewing a frozen jar of preserves, or interacting with other intelligent agents like humans? Everyday objects and animals encountered in home environments are notoriously difficult to simulate, so simulation environments used in robotics tend to be visually sterile and lack the diversity of the real world use case. Small-scale evaluation on a limited number of tasks in real or sim is not predictive of large-scale evaluation in the real world.
We’re taking a radically new approach to evaluation of general-purpose robots: learning a simulator directly from raw sensor data and using it to evaluate our policies across millions of scenarios. By learning a simulator directly from real data, you can absorb the full complexity of the real world without manual asset creation.
Over the last year, we’ve gathered thousands of hours of data on EVE humanoids doing diverse mobile manipulation tasks in homes and offices and interacting with people. We combined the video and action data to train a world model that can anticipate future video from observations and actions.
Our world model is capable of generating diverse outcomes based on different action commands. Below we show various generations conditioning the world model on four different trajectories, each of which start from the same initial frames. As before, the examples shown are not included during training.
The main value of the world model comes from simulating object interactions. In the following generations, we provide the model the same initial frames and three different sets of actions to grasp boxes. In each scenario, the box(es) grasped are lifted and moved in accordance with the motion of the gripper, while the other boxes remain undisturbed.
Even when actions are not provided, the world model generates plausible video, such as learning that people and obstacles should be avoided when driving:
We can also generate long-horizon videos. The example below simulates a complete t-shirt folding demonstration. T-shirts and deformable objects tend to be difficult to implement in rigid body simulators.
Our model can fail to maintain the shape and color of objects during interaction, and at times, objects may completely disappear. Additionally, when objects are occluded or displayed at unfavorable angles, their appearance can become distorted throughout the generation.
The generation on the left demonstrates that our model has an emergent understanding of physical properties, as evidenced by the spoon falling to the table when released by the gripper. However, there are many instances where generations fail to adhere to physical laws, such as on the right where the plate remains suspended in the air.
We placed EVE in front of a mirror to see if generations would result in mirrored actions, but we did not see successful recognition or “self-understanding"
As shown by the examples above, there is still much work to be done. World models have the potential to solve general purpose simulation and evaluation, enabling robots that are safe, reliable, and intelligent in a wide variety of scenarios. As such, we see this effort as a grand challenge in robotics that the community can work on solving together. To help accelerate progress towards solving world models for robotics, we are releasing over 100 hours of vector-quantized video (Apache 2.0), pretrained baseline models, and the 1X World Model Challenge, a three-stage challenge with cash prizes.
The first challenge, compression, is about how well one can minimize training loss on an extremely diverse robot dataset. The lower the loss, the better the model understands the training data. Even though there are many different ways to implement a world model, optimizing loss well is a general objective that underpins nearly all large-scale deep learning tasks. A $10k prize is awarded to the first submission that achieves a loss of 8.0 on our private test set. The Github repo provides code and pretrained weights for Llama and GENIE-based world models.
The second challenge, sampling, is about how well and how quickly a model can generate videos of the future. Details of the Sampling Challenge will be announced soon, based on lessons learned from running the Stage 1 Challenge.
The third challenge, evaluation, is our holy grail: can you predict how well a robot performs before you test it in the real world? Details of the Evaluation Challenge will be announced after we’ve learned lessons from Stage 1 and Stage 2 Challenges.
Submit solutions to: challenge@1x.tech
If you’re excited about these directions, we have open roles on the 1X AI team. Internally, we have a large dataset of high resolution robot data across even more diverse scenarios. Our ambitions for world models go beyond just solving the general evaluation problem; once you can step an agent in this world model and perform evaluation, you can follow on with policy enhancement and policy training in a completely learned simulation.
Github - starter code, evals, baseline implementations
Discord - chat with our engineers
We have previously developed an autonomous model that can merge many tasks into a single goal-conditioned neural network. However, when multi-task models are small (<100M parameters), adding data to fix one task’s behavior often adversely affects behaviors on other tasks. Increasing the model parameter count can mitigate this forgetting problem, but also take longer to train, which slows down our ability to find out what demonstrations we should gather to improve robot behavior.
How do we iterate quickly on the data while building a generalist robot that can do many tasks with a single neural network? We want to decouple our ability to quickly improve task performance from our ability to merge multiple capabilities into a single neural network. To accomplish this, we’ve built a voice-controlled natural language interface to chain short-horizon capabilities across multiple small models into longer ones. With humans directing the skill chaining, this allows us to accomplish the long-horizon behaviors shown in this video:
Although humans can do long horizon chores trivially, chaining multiple autonomous robot skills in a sequence is hard because the second skill has to generalize to all the slightly random starting positions that the robot finds itself in when the first skill finishes. This compounds with every successive skill - the third skill has to handle the variation in outcomes of the second skill, and so forth.
From the user perspective, the robot is capable of doing many natural language tasks and the actual number of models controlling the robot is abstracted away. This allows us to merge the single-task models into goal-conditioned models over time. Single-task models also provide a good baseline to do shadow mode evaluations: comparing how a new model’s predictions differ from an existing baseline at test-time. Once the goal-conditioned model matches single-task model predictions well, we can switch over to a more powerful, unified model with no change to the user workflow.
Directing robots with this high-level language interface offers a new user experience for data collection. Instead of using VR to control a single robot, an operator can direct multiple robots with high level language and let the low-level policies execute low-level actions to realize those high-level goals. Because high-level actions are sent infrequently, operators can even control robots remotely, as shown below:
Note that the above video is not completely autonomous; humans are dictating when robots should switch tasks. Naturally, the next step after building a dataset of vision-to-natural language command pairs is to automate the prediction of high level actions using vision-language models like GPT-4o, VILA, and Gemini Vision.
Stay tuned!
Eric Jang
In the latest episode of the Venture Europe Podcast, Bernt Børnich, CEO of 1X, sits down with host Calin Fabri to explore the evolving world of humanoid robotics.
Bernt shares his journey from a curious child dismantling kitchen gadgets to founding and leading 1X. He gives insight into the development of NEO, 1X’s next-generation android designed to assist with everyday tasks at home. He discusses the importance of designing safe, compliant humanoids capable of working alongside people in their daily environments.
Bernt also discusses 1X's strategic expansion, with AI development centered in San Francisco Bay and a new manufacturing facility built in Norway.
Throughout the episode, he explores the technical and ethical challenges of integrating androids into society, aiming to create an abundant supply of labor.
Listen on Apple Podcast
Listen on Google Podcast
Listen on Amazon Music
MOSS; NORWAY: 1X is currently developing its own production facility, actuator manufacturing, and robot assembly facility in Moss, Norway, right next to our campus and engineering team. This decision is more than just a matter of convenience—it's a commitment to keep building a vertically integrated company where every component of EVE and NEO is designed and produced in-house.
“The close proximity of both the actuator manufacturing, robot assembly, and testing site offers great advantages, especially for our team of creative engineers, brimming with fresh, yet untested ideas. Being adjacent to the manufacturing and assembly process allows them to quickly understand the practical aspects of transforming their creative concepts into feasible, efficient-to-manufacture products, says VP of Manufacturing Operations & Engineering, Csaba Hartmann.
The manufacturing team consists of diverse professionals, including specialized manufacturing engineers and mechanical designers, process engineers, automation experts, quality engineers, supply chain experts, safety officers, and others. Each member plays a role in designing, trialing, and rolling out our large-scale manufacturing initiatives, contributing to enhancing scalability, rapid iterations, and safety at every stage of the manufacturing and assembly process.
“Enabling teams that work side by side with each other and thus can easily get and act on feedback, is crucial for us to evolve and improve our products rapidly”, says Hartmann.
All 1X androids are designed with a safety-first mindset, featuring gearless motors and a soft exterior. Our commitment to safety extends beyond design, incorporating measures throughout the assembly process to ensure products are built to specs: thorough testing, quality control, and precise assembly processes.
We’re adopting quality control measures inspired by the automotive industry. We conduct thorough Design Failure Mode and Effects Analysis (DFMEA) on each assembly component to proactively identify and mitigate potential safety risks.
“Our quality team interprets the results of the DFMEA and PFMEA and then defines the rigorous checks for the assembly process to ensure no safety aspect is overlooked,” says Hartmann.
The assembly process includes rigorous checks of critical quality parameters to ensure no safety aspect is overlooked. Precision in the use of testing and assembly tools is emphasized to maintain high standards of accuracy. All components, especially motors, undergo extensive testing at multiple stages of assembly to validate their performance and reliability.
"At 1X, we prioritize scalable, cost-efficient manufacturing by integrating engineering expertise and rigorous quality control. Our approach leverages advanced technologies and carefully selected materials to enhance production efficiency. Committed to scalability, we ensure every process is optimized for cost-effectiveness and growth", says 1X CEO Bernt Børnich.
If you find this work interesting, we’d like to call attention to a few roles that we are hiring for to accelerate our mission toward creating an abundant supply of labor via safe intelligent androids:
We also have other open roles across mechanical, electrical, and software disciplines. Follow 1x_tech on X for more updates, and join us in living in the future.
1X will be attending the NVIDIA GTC Conference on March 18th. Our involvement signifies 1X's dedication to advancing in the field of Embodied AI, showcasing our latest developments, and engaging with the global AI community.
The NVIDIA GTC Conference is renowned for being a pivotal event that gathers innovators, researchers, and industry leaders worldwide to explore the latest advancements in AI, machine learning, and related technologies. Attendees can look forward to a program full of insightful talks, dynamic workshops, and demonstrations.
For more information about the conference or to register:
NVIDIA GTC Conference Official Page
Conference Program
We look forward to connecting with professionals to share our passion for AI and robotics at the event. See you at NVIDIA GTC.