Master Thesis on “Learning Latent Action Policies for Autonomous Driving”

Supervisor: Prof. Ville Kyrki (ville.kyrki@aalto.fi)
Advisor: Dr. Shoaib Azam (shoaib.azam@aalto.fi)

Keywords: Diffusion Models, World Models, Representation Learning, Autonomous Driving

Project Description

Autonomous driving requires understanding how visual scenes evolve in response to vehicle actions. Learning these dynamics directly from raw pixels is challenging due to high dimensionality and weak temporal supervision. Existing methods often rely on frame-based perception or contrastive learning, which overlook action-conditioned temporal dependencies. Meanwhile, diffusion-based trajectory models capture uncertainty but lack structured, interpretable representations of driving behavior. This reveals a gap for a latent, discrete, and action-aware world model that efficiently represents driving dynamics for policy learning.

This research addresses the gap by combining Latent Action Pretraining (LAPA) with a conditional Diffusion Model for generative policy learning. LAPA first learns discrete latent actions from observation pairs , capturing high-level driving semantics. The diffusion model then predicts or samples future latent actions conditioned on current context , modeling multi-modal futures in latent space. A frozen LAPA decoder or control head maps these latent actions to visual predictions or low-level controls. This framework enables data-efficient, interpretable, and generative policy learning for autonomous driving across multiple datasets.

Deliverables

  • Review of relevant literature
  • Pre-training Latent Action Model
  • Develop a diffusion-based policy world model and decoder network (control or reconstruction)
  • Comparison of the implementation with the state-of-the-art methods

Practical Information

Pre-requisites: Python(high), Deep Learning (high)

Tools: PyTorch

Simulators: nuScenes, Navsim, Carla

Start: Available immediately

References