Master Thesis on “Implementing and evaluating a Vision-Language-Action model for Robotic Manipulation“
Supervisor: Prof. Ville Kyrki (ville.kyrki@aalto.fi)
Advisors: Eric Hannus (eric.hannus@aalto.fi), Dr. Tran Nguyen Le (tran.nguyenle@aalto.fi)
Keywords: robot manipulation, vision-language-action model
Project Description
The success of foundation models (deep-learning models trained on large quantities of data resulting in a powerful model that is applicable to many down-stream tasks [1]) has led to an increased interest in utilizing such models in the robotics domain (see e.g. [2] for a review). For example, the common-sense knowledge of large language models has been used to generate high-level plans [3] and video-language models have been used to generate rewards for use in reinforcement learning [4]. Another class of promising models are Visual-Language-Action (VLA) models, trained on large and diverse datasets of real robot experience, which learn an end-to-end policy from language instructions and visual observations to low-level robot actions [5-7]. The goal of the thesis is for the student to implement and evaluate a recent model of this type, OpenVLA [6], in a real robot-laboratory setting.
Deliverables
- Implementation and configuration of the OpenVLA [6] model in our lab, and creation of a test-environment suitable for zero-shot executing the model which has been pre-trained on existing datasets.
- Evaluation of zero-shot performance of OpenVLA in the replicated environment and study robustness of zero-shot performance when changes are induced in the environment.
- Potentially using the available fine-tuning capabilities to transfer the model to a novel environment.
Practical Information
Pre-requisites: Python, PyTorch, experience with robotic manipulation, experience with machine learning
Start: Available immediately
References
- “On the Opportunities and Risks of Foundation Models” https://arxiv.org/abs/2108.07258
- “Language-conditioned Learning for Robotic Manipulation: A Survey” https://arxiv.org/abs/2312.10807
- “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances” https://arxiv.org/abs/2204.01691
- “RoboCLIP: One Demonstration is Enough to Learn Robot Policies” https://proceedings.neurips.cc/paper_files/paper/2023/hash/ae54ce310476218f26dd48c1626d5187-Abstract-Conference.html
- “Open X-Embodiment: Robotic Learning Datasets and RT-X Models” https://arxiv.org/abs/2310.08864
- “OpenVLA: An Open-Source Vision-Language-Action Model.” https://arxiv.org/abs/2406.09246
- “Octo: An Open-Source Generalist Robot Policy” https://arxiv.org/abs/2405.12213