Master Thesis on “Implementing and evaluating a Vision-Language-Action model for Robotic Manipulation“

A Franka robot performing the tasks “Move Yellow Corn onto Plate” and “Flip Pot Upright“ from https://openvla.github.io/.

Supervisor: Prof. Ville Kyrki (ville.kyrki@aalto.fi)

Advisors: Eric Hannus (eric.hannus@aalto.fi), Dr. Tran Nguyen Le (tran.nguyenle@aalto.fi)

Keywords: robot manipulation, vision-language-action model

Project Description

The success of foundation models (deep-learning models trained on large quantities of data resulting in a powerful model that is applicable to many down-stream tasks [1]) has led to an increased interest in utilizing such models in the robotics domain (see e.g. [2] for a review). For example, the common-sense knowledge of large language models has been used to generate high-level plans [3] and video-language models have been used to generate rewards for use in reinforcement learning [4]. Another class of promising models are Visual-Language-Action (VLA) models, trained on large and diverse datasets of real robot experience, which learn an end-to-end policy from language instructions and visual observations to low-level robot actions [5-7]. The goal of the thesis is for the student to implement and evaluate a recent model of this type, OpenVLA [6], in a real robot-laboratory setting.

Deliverables

Implementation and configuration of the OpenVLA [6] model in our lab, and creation of a test-environment suitable for zero-shot executing the model which has been pre-trained on existing datasets.

Evaluation of zero-shot performance of OpenVLA in the replicated environment and study robustness of zero-shot performance when changes are induced in the environment.

Potentially using the available fine-tuning capabilities to transfer the model to a novel environment.

Practical Information

Pre-requisites: Python, PyTorch, experience with robotic manipulation, experience with machine learning

Start: Available immediately

References

“On the Opportunities and Risks of Foundation Models” https://arxiv.org/abs/2108.07258

“Language-conditioned Learning for Robotic Manipulation: A Survey” https://arxiv.org/abs/2312.10807

“Do As I Can, Not As I Say: Grounding Language in Robotic Affordances” https://arxiv.org/abs/2204.01691

“RoboCLIP: One Demonstration is Enough to Learn Robot Policies” https://proceedings.neurips.cc/paper_files/paper/2023/hash/ae54ce310476218f26dd48c1626d5187-Abstract-Conference.html

“Open X-Embodiment: Robotic Learning Datasets and RT-X Models” https://arxiv.org/abs/2310.08864

“OpenVLA: An Open-Source Vision-Language-Action Model.” https://arxiv.org/abs/2406.09246

“Octo: An Open-Source Generalist Robot Policy” https://arxiv.org/abs/2405.12213

Tags: Master thesis topic

Intelligent Robotics

Project Description

Deliverables

Practical Information

References