Master Thesis on “Controlling a Robotic Arm with Instructions in Natural Language”

Supervisor: Prof. Ville Kyrki (ville.kyrki@aalto.fi)

Advisor: Dr. Tsvetomila Mihaylova (tsvetomila.mihaylova@aalto.fi), Tran Nguyen Le (tran.nguyenle@aalto.fi)

Keywords: robotic manipulation, natural language processing

With the growing advancement of robots and their integration in society, there is a growing need for people-friendly communication between robots and humans. A natural way of sending instructions to an autonomous system is by using language. A recent direction of research is the integration of large pretrained vision-language models for robotic control, where a human can give a robot instructions in natural language and the robot would perform a certain manipulation task.

Project Description

The goal of this master thesis is to integrate a large vision-language model (VLM) with a manipulation policy in order to control a robotic hand for predefined manipulation tasks, such as grasping or pushing.
The thesis includes reviewing the latest work on the topic, selection of suitable datasets, VLM, manipulation model and configuration of an experimental setup. Initially, the experiments will be executed in a simulator. An additional option would be to execute the experiments to a real robotic hand (Franka Emika Panda) and identify the challenges in the transfer between the simulation to the real robot.

Deliverables

Literature review of using vision-language models for different manipulation tasks; selection of a task to focus on based on the research
Integration of a vision-language model for a selected manipulation task
Execution of experiments in a simulator environment
Execution of the experiments on a robotic hand and identifying the gaps between the simulation and real-world implementation

Practical Information

Pre-requisites: Python, PyTorch, experience with robotic manipulation, experience with machine learning

Simulators to be used: Isaac Sim

Start: Available immediately

References

Programmatically Grounded, Compositionally Generalizable Robotic Manipulation, https://arxiv.org/abs/2304.13826
Instruction-Following Agents with Multimodal Transformer, https://arxiv.org/abs/2210.13431
PERCEIVER-ACTOR: A Multi-Task Transformer for Robotic Manipulation, https://peract.github.io/paper/peract_corl2022.pdf
Interactive Language: Talking to Robots in Real Time, https://arxiv.org/abs/2210.06407
PaLM-E: An Embodied Multimodal Language Model, https://arxiv.org/abs/2303.03378
Open-World Object Manipulation using Pre-Trained Vision-Language Models, https://robot-moo.github.io/
Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control, https://grounded-decoding.github.io/
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, https://say-can.github.io/
V2A – Vision to Action: Learning robotic arm actions based on vision and language, https://openaccess.thecvf.com/content/ACCV2020/papers/Nazarczuk_V2A_-_Vision_to_Action_Learning_robotic_arm_actions_based_ACCV_2020_paper.pdf

Tags: Master thesis topic

Intelligent Robotics