Master Thesis on “Controlling a Robotic Arm with Instructions in Natural Language”
Supervisor: Prof. Ville Kyrki (ville.kyrki@aalto.fi)
Advisor: Dr. Tsvetomila Mihaylova (tsvetomila.mihaylova@aalto.fi), Tran Nguyen Le (tran.nguyenle@aalto.fi)
Keywords: robotic manipulation, natural language processing
With the growing advancement of robots and their integration in society, there is a growing need for people-friendly communication between robots and humans. A natural way of sending instructions to an autonomous system is by using language. A recent direction of research is the integration of large pretrained vision-language models for robotic control, where a human can give a robot instructions in natural language and the robot would perform a certain manipulation task.
Project Description
The goal of this master thesis is to integrate a large vision-language model (VLM) with a manipulation policy in order to control a robotic hand for predefined manipulation tasks, such as grasping or pushing.
The thesis includes reviewing the latest work on the topic, selection of suitable datasets, VLM, manipulation model and configuration of an experimental setup. Initially, the experiments will be executed in a simulator. An additional option would be to execute the experiments to a real robotic hand (Franka Emika Panda) and identify the challenges in the transfer between the simulation to the real robot.
Deliverables
- Literature review of using vision-language models for different manipulation tasks; selection of a task to focus on based on the research
- Integration of a vision-language model for a selected manipulation task
- Execution of experiments in a simulator environment
- Execution of the experiments on a robotic hand and identifying the gaps between the simulation and real-world implementation
Practical Information
Pre-requisites: Python, PyTorch, experience with robotic manipulation, experience with machine learning
Simulators to be used: Isaac Sim
Start: Available immediately
References
- Programmatically Grounded, Compositionally Generalizable Robotic Manipulation, https://arxiv.org/abs/2304.13826
- Instruction-Following Agents with Multimodal Transformer, https://arxiv.org/abs/2210.13431
- PERCEIVER-ACTOR: A Multi-Task Transformer for Robotic Manipulation, https://peract.github.io/paper/peract_corl2022.pdf
- Interactive Language: Talking to Robots in Real Time, https://arxiv.org/abs/2210.06407
- PaLM-E: An Embodied Multimodal Language Model, https://arxiv.org/abs/2303.03378
- Open-World Object Manipulation using Pre-Trained Vision-Language Models, https://robot-moo.github.io/
- Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control, https://grounded-decoding.github.io/
- Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, https://say-can.github.io/
- V2A – Vision to Action: Learning robotic arm actions based on vision and language, https://openaccess.thecvf.com/content/ACCV2020/papers/Nazarczuk_V2A_-_Vision_to_Action_Learning_robotic_arm_actions_based_ACCV_2020_paper.pdf