Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Robotics startup 1X Technologies has developed a new generative model that can make it much more efficient to train robotics systems in simulation. The model, which the company announced in a new blog post, addresses one of the important challenges of robotics, which is learning “world models” that can predict how the world changes in response to a robot’s actions.
Given the costs and risks of training robots directly in physical environments, roboticists usually use simulated environments to train their control models before deploying them in the real world. However, the differences between the simulation and the physical environment cause challenges.
“Robicists typically hand-author scenes that are a ‘digital twin’ of the real world and use rigid body simulators like Mujoco, Bullet, Isaac to simulate their dynamics,” Eric Jang, VP of AI at 1X Technologies, told VentureBeat. “However, the digital twin may have physics and geometric inaccuracies that lead to training on one environment and deploying on a different one, which causes the ‘sim2real gap.’ For example, the door model you download from the Internet is unlikely to have the same spring stiffness in the handle as the actual door you are testing the robot on.”
Generative world models
To bridge this gap, 1X’s new model learns to simulate the real world by being trained on raw sensor data collected directly from the robots. By viewing thousands of hours of video and actuator data collected from the company’s own robots, the model can look at the current observation of the world and predict what will happen if the robot takes certain actions.
The data was collected from EVE humanoid robots doing diverse mobile manipulation tasks in homes and offices and interacting with people.
“We collected all of the data at our various 1X offices, and have a team of Android Operators who help with annotating and filtering the data,” Jang said. “By learning a simulator directly from the real data, the dynamics should more closely match the real world as the amount of interaction data increases.”
The learned world model is especially useful for simulating object interactions. The videos shared by the company show the model successfully predicting video sequences where the robot grasps boxes. The model can also predict “non-trivial object interactions like rigid bodies, effects of dropping objects, partial observability, deformable objects (curtains, laundry), and articulated objects (doors, drawers, curtains, chairs),” according to 1X.
Some of the videos show the model simulating complex long-horizon tasks with deformable objects such as folding shirts. The model also simulates the dynamics of the environment, such as how to avoid obstacles and keep a safe distance from people.
Challenges of generative models
Changes to the environment will remain a challenge. Like all simulators, the generative model will need to be updated as the environments where the robot operates change. The researchers believe that the way the model learns to simulate the world will make it easier to update it.
“The generative model itself might have a sim2real gap if its training data is stale,” Jang said. “But the idea is that because it is a completely learned simulator, feeding fresh data from the real world will fix the model without requiring hand-tuning a physics simulator.”
1X’s new system is inspired by innovations such as OpenAI Sora and Runway, which have shown that with the right training data and techniques, generative models can learn some kind of world model and remain consistent through time.
However, while those models are designed to generate videos from text, 1X’s new model is part of a trend of generative systems that can react to actions during the generation phase. For example, researchers at Google recently used a similar technique to train a generative model that could simulate the game DOOM. Interactive generative models can open up numerous possibilities for training robotics control models and reinforcement learning systems.
However, some of the challenges inherent to generative models are still evident in the system presented by 1X. Since the model is not powered by an explicitly defined world simulator, it can sometimes generate unrealistic situations. In the examples shared by 1X, the model sometimes fails to predict that an object will fall down if it is left hanging in the air. In other cases, an object might disappear from one frame to another. Dealing with these challenges still requires extensive efforts.
One solution is to continue gathering more data and training better models. “We’ve seen dramatic progress in generative video modeling over the last couple of years, and results like OpenAI Sora suggest that scaling data and compute can go quite far,” Jang said.
At the same time, 1X is encouraging the community to get involved in the effort by releasing its models and weights. The company will also be launching competitions to improve the models with monetary prizes going to the winners.
“We’re actively investigating multiple methods for world modeling and video generation,” Jang said.