Master’s thesis on “Learning interactive environment dynamics for active search”

Supervisor: Prof. Ville Kyrki (ville.kyrki@aalto.fi)
Advisor: Matti Pekkanen (matti.pekkanen@aalto.fi)
Keywords: Deep learning, Dynamic prediction, Robotic manipulation
Project Description
As robots are moving from factories into our homes, they require new capabilities to perceive and manipulate the environment to succeed in everyday tasks. In an active search task [1], the robot searches for an occluded object by observing and manipulating the objects in the environment to discover the target. The problem can be modeled as a partially observable Markov decision process (POMDP), where the available actions are to observe and manipulate the environment. The observation actions can be selected, e.g., using next-best-view planning [2] that maximizes volumetric information gain, and the observations as a whole constitute a classic mapping problem.
However, after the manipulation action, the state of the environment changes. In completely observable scenarios, physical simulators or other analytical or numerical methods can be used to predict the state of the environment, i.e., forward dynamics, after the action. However, this is computationally expensive, and under partial observability, not always feasible. For this reason, a neural network can be trained to predict the changes in the state of the environment [3, 4]. Many types of models can be applied to this task, such as convolutional neural networks [6], representation learning [7], graph neural networks [8], diffusion models [9], and physics-informed models [10]. However, there are alternatives to a single, end-to-end solution, such as a combination of analytical and learned models, or a combination of learned models.
However, there is the inverse problem: if we know the desired end state, what action should we take? This can be addressed by learning the inverse dynamic model, which can be learned jointly [10, 11] with the forward dynamics model. Acquiring the inverse model can be beneficial for many tasks, so this could be taken into account when choosing the approach to solve the forward problem.
In this master’s thesis, the main task is to design a method to predict changes in the state of the environment after a push action. The environment consists of constrained, cluttered space containing rigid objects. The network can be evaluated within a pre-existing simulator environment. When the method works in the simulation, the second task is to deploy the simulator solution in the real world and conduct experiments with a Franka robot arm. If the project is proceeding favourably, there is a possibility to integrate the developed method into an existing active search POMDP algorithm to evaluate its performance in the active search task.
The majority of the required software components are available to simulate the environment for collecting training data and evaluating network performance. However, they might require updating to accommodate the new network, as well as possibly some bug fixing. These, along with the real robot deployment, require the student to be fluent in Python and in standard software development practices. Designing the network requires an understanding of the associated physics and mathematics, as well as the principles of modern deep learning techniques. The network is trained with PyTorch, so familiarity with training deep networks in PyTorch is required.
Deliverables
- State-of-the-art literature review
- Developing a prediction method for the state changes of the environment for pushing actions
- Real-world experimental validation of the method with a Franka robot arm
Prerequisites
- Solid understanding of engineering physics and mathematics
- Experience in Python
- Experience in deep learning and training neural networks with PyTorch
References
[1] H. Huang et al., “Mechanical Search on Shelves using Lateral Access X-RAY,” IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, Sep. 2021
[2] S. Isler, R. Sabzevari, J. Delmerico, and D. Scaramuzza, “An information gain formulation for active volumetric 3D reconstruction,” in IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, May 2016
[3] F. Paus, T. Huang, and T. Asfour, “Predicting Pushing Action Effects on Spatial Object Relations by Learning Internal Prediction Models,” in IEEE International Conference on Robotics and Automation (ICRA), Paris, France: IEEE, May 2020
[4] A. E. Tekden, A. Erdem, E. Erdem, T. Asfour, and E. Ugur, “Object and relation centric representations for push effect prediction,” Robotics and Autonomous Systems, vol. 174, p. 104632, Apr. 2024
[5] T. Xue, J. Wu, K. L. Bouman, and W. T. Freeman, “Visual Dynamics: Stochastic Future Generation via Layered Cross Convolutional Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 9, Sep. 2019
[6] Z. Xu, J. Wu, A. Zeng, J. B. Tenenbaum, and S. Song, “DensePhysNet: Learning Dense Physical Object Representations via Multi-step Dynamic Interactions,” Robotics: Science and Systems (RSS)
Freiburg im Breisgau, Germany, Jun. 2019
[7] Y. Li, J. Wu, J.-Y. Zhu, J. B. Tenenbaum, A. Torralba, and R. Tedrake, “Propagation Networks for Model-Based Control Under Partial Observation,” in IEEE International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, May 2019
[8] R. Akkerman, H. Feng, M. J. Black, D. Tzionas, and V. F. Abrevaya, “InterDyn: Controllable Interactive Dynamics with Video Diffusion Models”, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, Jun. 2025
[9] M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum, “A Compositional Object-Based Approach to Learning Physical Dynamics,” International Conference on Learning Representations (ICLR), Toulon, France, Apr. 2017
[10] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine, “Learning to Poke by Poking: Experiential Learning of Intuitive Physics,” in Conference and Workshop on Neural Information Processing Systems (NeurIPS), Barcelona, Spain, Dec. 2016
[11] H. Chen et al., “Learning Coordinated Bimanual Manipulation Policies Using State Diffusion and Inverse Dynamics Models,” in IEEE International Conference on Robotics and Automation (ICRA), Atlanta, GA, USA, May 2025