Utilising LLMs for robotic navigation

17th June 2024
Harry Fowle

Researchers have found a method that uses language-based inputs instead of costly visual data to direct a robot through a multistep navigation task.

In the future, when we all have our own personal home robot, you might want to give your robot helper a string of specific commands and objectives to complete. Perhaps you need your home shopping order put into a specific spot, for example, the robot will need to combine your instructions with its visual observations to determine the steps it should take to complete this task.

For an AI agent, this task presents significant challenges. Traditional methods typically employ various bespoke machine-learning models to address different aspects of the task, necessitating considerable human effort and expertise in their development. These techniques, which rely on visual representations for navigation decisions, require vast amounts of visual data for training, which can be difficult to obtain.

To address these issues, researchers from MIT and the MIT-IBM Watson AI Lab developed a navigation technique that translates visual representations into language components, which are then processed by a single large language model capable of handling all stages of the multistep navigation process.

Instead of encoding visual features from a robot's surroundings as visual representations, which is resource-intensive, their approach generates textual descriptions of the robot’s perspective. A large language model then uses these text captions to predict the actions the robot should take to follow user instructions based on language.

Since this method relies solely on language-based representations, it can leverage a large language model to produce extensive synthetic training data efficiently.

Although this method does not surpass the performance of techniques using visual features, it is effective in scenarios with insufficient visual data for training. The researchers discovered that combining their language-based inputs with visual signals enhances navigation performance.

“By purely using language as the perceptual representation, ours is a more straightforward approach. Since all the inputs can be encoded as language, we can generate a human-understandable trajectory,” says Bowen Pan, EECS Graduate and Lead Author of a paper on this approach.

Pan’s co-authors include his advisor, Aude Oliva, Director of Strategic Industry Engagement at the MIT Schwarzman College of Computing, MIT director of the MIT-IBM Watson AI Lab, and a senior research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL); Philip Isola, an Associate Professor of EECS and a member of CSAIL; Senior Author Yoon Kim, an assistant professor of EECS and a member of CSAIL; and others at the MIT-IBM Watson AI Lab and Dartmouth College. The research will be presented at the Conference of the North American Chapter of the Association for Computational Linguistics.

Solving a vision problem with language

Since large language models are the most advanced machine-learning models, the researchers aimed to integrate them into the complex task of vision-and-language navigation, Pan explained.

However, these models can only handle text-based inputs and cannot process visual data from a robot’s camera. Therefore, the team had to find a way to utilise language instead.

Their approach employed a simple captioning model to generate text descriptions of a robot’s visual observations. These captions were then combined with language-based instructions and fed into a large language model, which determined the next navigation step for the robot.

The large language model produced a caption of the scene the robot should observe after completing that step. This updated the trajectory history, enabling the robot to keep track of its movements.

The model repeated these processes to create a trajectory that guided the robot to its goal, step by step.

To simplify the process, the researchers developed templates so observation information was presented to the model in a standard format, offering a series of choices the robot could make based on its surroundings.

For example, a caption might describe, “20 degrees to your right is a table with a lamp on it, directly ahead is a bookshelf filled with books,” and so on. The model then decides whether the robot should move towards the table or the bookshelf.

“One of the biggest challenges was figuring out how to encode this kind of information into language in a proper way to make the agent understand what the task is and how they should respond,” Pan says.

Advantages of language

When they evaluated this approach, it did not exceed the performance of vision-based techniques but presented several notable benefits.

Firstly, since synthesising text requires less computational power than generating detailed image data, their method could quickly create synthetic training data. In one experiment, they generated 10,000 synthetic trajectories from just 10 real-world visual trajectories.

This method also addressed the common issue where an agent trained in a simulated environment struggles to perform in the real world. This difficulty often arises because computer-generated images can differ significantly from real-world scenes due to factors like lighting and colour variations. However, language descriptions of synthetic and real images are much more difficult to distinguish, as Pan pointed out.

Furthermore, the text-based representations used by their model are more accessible for human interpretation since they are written in natural language.

“If the agent fails to reach its goal, we can more easily determine where it failed and why it failed. Maybe the history information is not clear enough or the observation ignores some important details,” Pan says.

Additionally, their method could be more readily adapted to different tasks and environments since it relies on a single type of input. As long as the data can be translated into language, the same model can be used without modifications.

One downside, however, is that this approach inherently misses some details captured by vision-based models, such as depth information.

Nevertheless, the researchers were surprised to discover that integrating language-based representations with vision-based methods enhanced the agent's navigation capabilities.

“Maybe this means that language can capture some higher-level information than cannot be captured with pure vision features,” he says.

This is an area the researchers plan to further investigate. They also aim to develop a captioner specifically designed for navigation, which could enhance the method’s performance. Additionally, they intend to explore the capacity of large language models to demonstrate spatial awareness and determine how this could improve language-based navigation.

Featured products

Product Spotlight

Upcoming Events

View all events
Latest global electronics news
© Copyright 2024 Electronic Specifier