|Description:||When performing machine learning on multimodal data (i. e., data that have multiple information channels, such as images - they contain not only the visual information, but also metadata, possibly annotations, social media comments...), the often-used industry standard is to use late fusion - perform machine learning on each modality separately and then fuse the results (rankings) obtained in each (see e.g., ). This is in contrast to early fusion, which corresponds to putting (selected) features from individual modalities into one dataset, and then performing machine learning on that.
Is this still true in the case of modern data features? The groundbreaking work that established late fusion as the standard (e.g., ) is 15 years old. Since modern features can be often interpreted of semantic labels of the data, this standard might need revisiting - and that is the topic of the project/thesis.