From wiping up spills to serving up food, robots are being taught to carry out increasingly complicated household tasks. Many such home-bot trainees are learning through imitation; they are programmed to copy the motions that a human physically guides them through.

It turns out that robots are excellent mimics. But unless engineers also program them to adjust to every possible bump and nudge, robots don’t necessarily know how to handle these situations, short of starting their task from the top.

Now MIT engineers are aiming to give robots a bit of common sense when faced with situations that push them off their trained path. They’ve developed a method that connects robot motion data with the “common sense knowledge” of large language models, or LLMs.

Their approach enables a robot to logically parse many given household task into subtasks, and to physically adjust to disruptions within a subtask so that the robot can move on without having to go back and start a task from scratch — and without engineers having to explicitly program fixes for every possible failure along the way.   

A robotic hand tries to scoop up red marbles and put them into another bowl while a researcher’s hand frequently disrupts it. The robot eventually succeeds.
“Imitation learning is a mainstream approach enabling household robots. But if a robot is blindly mimicking a human’s motion trajectories, tiny errors can accumulate and eventually derail the rest of the execution,” says Yanwei Wang, a graduate student in MIT’s Department of Electrical Engineering and Computer Science (EECS). “With our method, a robot can self-correct execution errors and improve overall task success.”

Wang and his colleagues detail their new approach in a study they will present at the International Conference on Learning Representations (ICLR) in May. The study’s co-authors include EECS graduate students Tsun-Hsuan Wang and Jiayuan Mao, Michael Hagenow, a postdoc in MIT’s Department of Aeronautics and Astronautics (AeroAstro), and Julie Shah, the H.N. Slater Professor in Aeronautics and Astronautics at MIT.

Language task

The researchers illustrate their new approach with a simple chore: scooping marbles from one bowl and pouring them into another. To accomplish this task, engineers would typically move a robot through the motions of scooping and pouring — all in one fluid trajectory. They might do this multiple times, to give the robot a number of human demonstrations to mimic.

“But the human demonstration is one long, continuous trajectory,” Wang says.

The team realized that, while a human might demonstrate a single task in one go, that task depends on a sequence of subtasks, or trajectories. For instance, the robot has to first reach into a bowl before it can scoop, and it must scoop up marbles before moving to the empty bowl, and so forth. If a robot is pushed or nudged to make a mistake during any of these subtasks, its only recourse is to stop and start from the beginning, unless engineers were to explicitly label each subtask and program or collect new demonstrations for the robot to recover from the said failure, to enable a robot to self-correct in the moment.

“That level of planning is very tedious,” Wang says.

Instead, he and his colleagues found some of this work could be done automatically by LLMs. These deep learning models process immense libraries of text, which they use to establish connections between words, sentences, and paragraphs. Through these connections, an LLM can then generate new sentences based on what it has learned about the kind of word that is likely to follow the last.

For their part, the researchers found that in addition to sentences and paragraphs, an LLM can be prompted to produce a logical list of subtasks that would be involved in a given task. For instance, if queried to list the actions involved in scooping marbles from one bowl into another, an LLM might produce a sequence of verbs such as “reach,” “scoop,” “transport,” and “pour.”

“LLMs have a way to tell you how to do each step of a task, in natural language. A human’s continuous demonstration is the embodiment of those steps, in physical space,” Wang says. “And we wanted to connect the two, so that a robot would automatically know what stage it is in a task, and be able to replan and recover on its own.”

Mapping marbles

For their new approach, the team developed an algorithm to automatically connect an LLM’s natural language label for a particular subtask with a robot’s position in physical space or an image that encodes the robot state. Mapping a robot’s physical coordinates, or an image of the robot state, to a natural language label is known as “grounding.” The team’s new algorithm is designed to learn a grounding “classifier,” meaning that it learns to automatically identify what semantic subtask a robot is in — for example, “reach” versus “scoop” — given its physical coordinates or an image view.

“The grounding classifier facilitates this dialogue between what the robot is doing in the physical space and what the LLM knows about the subtasks, and the constraints you have to pay attention to within each subtask,” Wang explains.

The team demonstrated the approach in experiments with a robotic arm that they trained on a marble-scooping task. Experimenters trained the robot by physically guiding it through the task of first reaching into a bowl, scooping up marbles, transporting them over an empty bowl, and pouring them in. After a few demonstrations, the team then used a pretrained LLM and asked the model to list the steps involved in scooping marbles from one bowl to another. The researchers then used their new algorithm to connect the LLM’s defined subtasks with the robot’s motion trajectory data. The algorithm automatically learned to map the robot’s physical coordinates in the trajectories and the corresponding image view to a given subtask.

The team then let the robot carry out the scooping task on its own, using the newly learned grounding classifiers. As the robot moved through the steps of the task, the experimenters pushed and nudged the bot off its path, and knocked marbles off its spoon at various points. Rather than stop and start from the beginning again, or continue blindly with no marbles on its spoon, the bot was able to self-correct, and completed each subtask before moving on to the next. (For instance, it would make sure that it successfully scooped marbles before transporting them to the empty bowl.)

“With our method, when the robot is making mistakes, we don’t need to ask humans to program or give extra demonstrations of how to recover from failures,” Wang says. “That’s super exciting because there’s a huge effort now toward training household robots with data collected on teleoperation systems. Our algorithm can now convert that training data into robust robot behavior that can do complex tasks, despite external perturbations.”


Anyone who has ever tried to pack a family-sized amount of luggage into a sedan-sized trunk knows this is a hard problem. Robots struggle with dense packing tasks, too.

For the robot, solving the packing problem involves satisfying many constraints, such as stacking luggage so suitcases don’t topple out of the trunk, heavy objects aren’t placed on top of lighter ones, and collisions between the robotic arm and the car’s bumper are avoided.

Some traditional methods tackle this problem sequentially, guessing a partial solution that meets one constraint at a time and then checking to see if any other constraints were violated. With a long sequence of actions to take, and a pile of luggage to pack, this process can be impractically time consuming.   

MIT researchers used a form of generative AI, called a diffusion model, to solve this problem more efficiently. Their method uses a collection of machine-learning models, each of which is trained to represent one specific type of constraint. These models are combined to generate global solutions to the packing problem, taking into account all constraints at once.

Their method was able to generate effective solutions faster than other techniques, and it produced a greater number of successful solutions in the same amount of time. Importantly, their technique was also able to solve problems with novel combinations of constraints and larger numbers of objects, that the models did not see during training.

Due to this generalizability, their technique can be used to teach robots how to understand and meet the overall constraints of packing problems, such as the importance of avoiding collisions or a desire for one object to be next to another object. Robots trained in this way could be applied to a wide array of complex tasks in diverse environments, from order fulfillment in a warehouse to organizing a bookshelf in someone’s home.

“My vision is to push robots to do more complicated tasks that have many geometric constraints and more continuous decisions that need to be made — these are the kinds of problems service robots face in our unstructured and diverse human environments. With the powerful tool of compositional diffusion models, we can now solve these more complex problems and get great generalization results,” says Zhutian Yang, an electrical engineering and computer science graduate student and lead author of a paper on this new machine-learning technique.

Her co-authors include MIT graduate students Jiayuan Mao and Yilun Du; Jiajun Wu, an assistant professor of computer science at Stanford University; Joshua B. Tenenbaum, a professor in MIT’s Department of Brain and Cognitive Sciences and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); Tomás Lozano-Pérez, an MIT professor of computer science and engineering and a member of CSAIL; and senior author Leslie Kaelbling, the Panasonic Professor of Computer Science and Engineering at MIT and a member of CSAIL. The research will be presented at the Conference on Robot Learning.

Constraint complications

Continuous constraint satisfaction problems are particularly challenging for robots. These problems appear in multistep robot manipulation tasks, like packing items into a box or setting a dinner table. They often involve achieving a number of constraints, including geometric constraints, such as avoiding collisions between the robot arm and the environment; physical constraints, such as stacking objects so they are stable; and qualitative constraints, such as placing a spoon to the right of a knife.

There may be many constraints, and they vary across problems and environments depending on the geometry of objects and human-specified requirements.

To solve these problems efficiently, the MIT researchers developed a machine-learning technique called Diffusion-CCSP. Diffusion models learn to generate new data samples that resemble samples in a training dataset by iteratively refining their output.

To do this, diffusion models learn a procedure for making small improvements to a potential solution. Then, to solve a problem, they start with a random, very bad solution and then gradually improve it.

Animation of grid of robot arms with a box in front of each one. Each robot arm is grabbing objects nearby, like sunglasses and plastic containers, and putting them inside a box.
Using generative AI models, MIT researchers created a technique that could enable robots to efficiently solve continuous constraint satisfaction problems, such as packing objects into a box while avoiding collisions, as shown in this simulation.

For example, imagine randomly placing plates and utensils on a simulated table, allowing them to physically overlap. The collision-free constraints between objects will result in them nudging each other away, while qualitative constraints will drag the plate to the center, align the salad fork and dinner fork, etc.

Diffusion models are well-suited for this kind of continuous constraint-satisfaction problem because the influences from multiple models on the pose of one object can be composed to encourage the satisfaction of all constraints, Yang explains. By starting from a random initial guess each time, the models can obtain a diverse set of good solutions.

Working together

For Diffusion-CCSP, the researchers wanted to capture the interconnectedness of the constraints. In packing for instance, one constraint might require a certain object to be next to another object, while a second constraint might specify where one of those objects must be located.

Diffusion-CCSP learns a family of diffusion models, with one for each type of constraint. The models are trained together, so they share some knowledge, like the geometry of the objects to be packed.

The models then work together to find solutions, in this case locations for the objects to be placed, that jointly satisfy the constraints.

“We don’t always get to a solution at the first guess. But when you keep refining the solution and some violation happens, it should lead you to a better solution. You get guidance from getting something wrong,” she says.

Training individual models for each constraint type and then combining them to make predictions greatly reduces the amount of training data required, compared to other approaches.

However, training these models still requires a large amount of data that demonstrate solved problems. Humans would need to solve each problem with traditional slow methods, making the cost to generate such data prohibitive, Yang says.

Instead, the researchers reversed the process by coming up with solutions first. They used fast algorithms to generate segmented boxes and fit a diverse set of 3D objects into each segment, ensuring tight packing, stable poses, and collision-free solutions.

“With this process, data generation is almost instantaneous in simulation. We can generate tens of thousands of environments where we know the problems are solvable,” she says.

Trained using these data, the diffusion models work together to determine locations objects should be placed by the robotic gripper that achieve the packing task while meeting all of the constraints.

They conducted feasibility studies, and then demonstrated Diffusion-CCSP with a real robot solving a number of difficult problems, including fitting 2D triangles into a box, packing 2D shapes with spatial relationship constraints, stacking 3D objects with stability constraints, and packing 3D objects with a robotic arm.

Their method outperformed other techniques in many experiments, generating a greater number of effective solutions that were both stable and collision-free.

In the future, Yang and her collaborators want to test Diffusion-CCSP in more complicated situations, such as with robots that can move around a room. They also want to enable Diffusion-CCSP to tackle problems in different domains without the need to be retrained on new data.

“Diffusion-CCSP is a machine-learning solution that builds on existing powerful generative models,” says Danfei Xu, an assistant professor in the School of Interactive Computing at the Georgia Institute of Technology and a Research Scientist at NVIDIA AI, who was not involved with this work. “It can quickly generate solutions that simultaneously satisfy multiple constraints by composing known individual constraint models. Although it’s still in the early phases of development, the ongoing advancements in this approach hold the promise of enabling more efficient, safe, and reliable autonomous systems in various applications.”

This research was funded, in part, by the National Science Foundation, the Air Force Office of Scientific Research, the Office of Naval Research, the MIT-IBM Watson AI Lab, the MIT Quest for Intelligence, the Center for Brains, Minds, and Machines, Boston Dynamics Artificial Intelligence Institute, the Stanford Institute for Human-Centered Artificial Intelligence, Analog Devices, JPMorgan Chase and Co., and Salesforce.


