Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah, Djamel Bouchaffra
Abstract:Cross-modal alignment Learning integrates information from different modalities like text, image, audio and video to create unified models. This approach develops shared representations and learns correlations between modalities, enabling applications such as visual question answering and audiovisual content analysis.
Current techniques rely on large modality-specific encoders, necessitating fine-tuning or training from scratch on vast aligned datasets (e.g., text-image, text-audio, image-audio). This approach has several limitations: (i) it is highly costly, as it requires training large encoders on vast datasets, (ii) it is difficult to achieve, since obtaining large, well-aligned paired datasets is difficult, and (iii) it is time-consuming, due to the fact that introducing new modalities necessitates retraining the entire framework to accommodate them.
To address these issues, we propose OneEncoder, a lightweight framework that progressively represents and aligns four modalities (image, text, audio, video). Initially, we train a lightweight Universal Projection (UP) module to align image and text modalities. Then, we freeze the pretrained UP and progressively align future modalities to those already aligned. OneEncoder operates efficiently and cost-effectively, even in scenarios where vast aligned datasets are unavailable, due to its lightweight design. Trained on small paired datasets, it shows strong performance in tasks like classification, querying, and visual question answering, surpassing methods that rely on large datasets and specialized encoders.
Read the paper