Master Thesis on “Implementing and evaluating a Vision-Language-Action model for Robotic Manipulation“ 

A Franka robot performing the tasks “Move Yellow Corn onto Plate” and “Flip Pot Upright“ from https://openvla.github.io/.

Supervisor: Prof. Ville Kyrki (ville.kyrki@aalto.fi

Advisors: Eric Hannus (eric.hannus@aalto.fi), Dr. Tran Nguyen Le (tran.nguyenle@aalto.fi)  

Keywords: robot manipulation, vision-language-action model 

Project Description 

The success of foundation models (deep-learning models trained on large quantities of data resulting in a powerful model that is applicable to many down-stream tasks [1]) has led to an increased interest in utilizing such models in the robotics domain (see e.g. [2] for a review). For example, the common-sense knowledge of large language models has been used to generate high-level plans [3] and video-language models have been used to generate rewards for use in reinforcement learning [4]. Another class of promising models are Visual-Language-Action (VLA) models, trained on large and diverse datasets of real robot experience, which learn an end-to-end policy from language instructions and visual observations to low-level robot actions [5-7].  The goal of the thesis is for the student to implement and evaluate a recent model of this type, OpenVLA [6], in a real robot-laboratory setting. 

Deliverables 

  • Implementation and configuration of the OpenVLA [6] model in our lab, and creation of a test-environment suitable for zero-shot executing the model which has been pre-trained on existing datasets. 
  • Evaluation of zero-shot performance of OpenVLA in the replicated environment and study robustness of zero-shot performance when changes are induced in the environment. 
  •  Potentially using the available fine-tuning capabilities to transfer the model to a novel environment. 

Practical Information 

Pre-requisites: Python, PyTorch, experience with robotic manipulation, experience with machine learning 

Start: Available immediately 

References 

  1. “On the Opportunities and Risks of Foundation Models” https://arxiv.org/abs/2108.07258  
  1. “Language-conditioned Learning for Robotic Manipulation: A Survey” https://arxiv.org/abs/2312.10807 
  1. “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances” https://arxiv.org/abs/2204.01691  
  1. “RoboCLIP: One Demonstration is Enough to Learn Robot Policies” https://proceedings.neurips.cc/paper_files/paper/2023/hash/ae54ce310476218f26dd48c1626d5187-Abstract-Conference.html  
  1. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models” https://arxiv.org/abs/2310.08864 
  1. “OpenVLA: An Open-Source Vision-Language-Action Model.” https://arxiv.org/abs/2406.09246 
  1. “Octo: An Open-Source Generalist Robot Policy” https://arxiv.org/abs/2405.12213