There is an abundance of sensory information being collected by sensors since the emergence of the Internet of things. Combining this information (sensor fusion) to get a meaningful picture of the system under observation currently needs a lot of expert knowledge and human engineering, which doesn't scale well with the growing amount of sensors available. These sensors can vary strongly (multi-modality), ranging from complex depth camera's to various LIDARs and RADARs. Humans learn an internal approximate model of how the world works and can use that to make sense of new experiences. We can also use these approximate models to solve various new tasks. In this research we will try to mimic the human way of understanding the world in order to achieve sensor fusion.