06-20-2023

Robots can now learn to accomplish any task by watching videos or observing people

Earth.com staff writer

Robots have been programmed to learn household chores by simply watching videos of humans performing these tasks.

This eerily fascinating new breakthrough was made by researchers from the School of Computer Science at Carnegie Mellon University (CMU).

This revolutionary advancement in robotics could significantly enhance the functionality of robots in domestic settings. This ability enables them to provide assistance in tasks such as cooking and cleaning.

How the team taught robots to learn by watching videos

The researchers successfully trained two robots to perform 12 varied tasks. These included opening a drawer, the oven door, or a lid, taking a pot off the stove, and picking up objects like a telephone or a can of soup.

Deepak Pathak, an assistant professor at CMU’s Robotics Institute, explained, “The robot can learn where and how humans interact with different objects through watching videos.”

He further added that the knowledge acquired through these videos was instrumental in training a model that allowed the robots to perform similar tasks in different environments.

Training robots conventionally involves humans manually demonstrating tasks or carrying out extensive training in a simulated environment. These are not just laborious methods, but also susceptible to failure.

The WHIRL method vs. the VRB method

Previously, Pathak and his students had proposed a novel method in which robots could learn from observing humans complete tasks. This method, named In-the-Wild Human Imitating Robot Learning (WHIRL), necessitated that the human perform the task in the same environment as the robot.

Pathak’s latest research, termed the Vision-Robotics Bridge (VRB), expands upon and refines the WHIRL concept. This new model bypasses the need for human demonstrations. It also eradicates the requirement for the robot to operate in an identical environment.

Much like WHIRL, however, the robot still needs practice to perfect a task. The researchers found that a robot could learn a new task in just 25 minutes.

Shikhar Bahl, a Ph.D. student in robotics, said, “We were able to take robots around campus and do all sorts of tasks.”

He added, “Robots can use this model to curiously explore the world around them. Instead of just flailing its arms, a robot can be more direct with how it interacts.”

The secret to teaching robots to learn through observation

The key to teaching the robots was applying the concept of affordances. This is an idea rooted in psychology that refers to what an environment offers an individual.

In the case of VRB, affordances were used to determine where and how a robot might interact with an object, based on observed human behavior.

For instance, if a robot watches a human opening a drawer, it identifies the points of contact—the handle—and the direction of the drawer’s movement—straight out from the starting location. After observing several videos of humans opening drawers, the robot is able to decipher how to open any drawer.

To train the robots, the team utilized large datasets of videos, such as Ego4D and Epic Kitchens. The Ego4D dataset comprises almost 4,000 hours of first-person perspective videos of daily activities from around the world. Some of these were collected by CMU researchers.

Similarly, Epic Kitchens features videos showcasing cooking, cleaning, and other kitchen tasks. These datasets are typically used to train computer vision models.

“We are using these datasets in a new and different way,” Bahl shared. He concluded, “This work could enable robots to learn from the vast amount of internet and YouTube videos available.”

This forward-thinking application of existing datasets promises a fascinating future where domestic robots can be trained to perform a wide variety of tasks. This will, undoubtedly, make our lives easier and more efficient.

More about WHIRL and VRB

In-the-Wild Human Imitating Robot Learning (WHIRL) refers to a learning approach for robots where they learn by observing humans perform tasks in the same environment.

This could be similar to a concept in robotics called Learning from Demonstration (LfD) or Imitation Learning. In LfD, a human demonstrator performs a task and the robot learns to replicate that behavior.

This approach can be very effective. However, it requires the human to be in the same environment as the robot and perform the task exactly as they want the robot to do it. This method can be time-consuming and cumbersome.

On the other hand, Vision-Robotics Bridge (VRB) is an advancement of this concept where the robot no longer requires the human to be in the same environment for the learning to occur. This method involves a kind of transfer learning, where knowledge gained in one context is applied to another.

In this case, VRB involves training a model using video data of humans performing tasks. They then transfer this model to a robot so it can perform similar tasks in different environments.

The use of video data for robotic learning is a promising area of research. Large datasets of videos, like the Ego4D and Epic Kitchens, can be valuable resources for training robots. They can help robots learn how humans interact with objects and their environment, which can then guide the robot’s own interactions.