Bilal FAYE

Bilal FAYE

PhD in Artificial Intelligence | Researcher | Educator | AI Vision & NLP Specialist.

📍 Saint Denis, France | ✉️ biljolefa@gmail.com | LinkedIn

Publications

International Conferences

Authors: Tom Devynck, Bilal Faye, Djamel Bouchaffra, Nadjib Lazaar, Mustapha Lebbah, Hanane Azzag
Abstract: Deep convolutional neural networks achieve remarkable performance by exhaustively processing dense spatial feature maps, yet this brute-force strategy introduces significant computational redundancy and encourages reliance on spurious background correlations. As a result, modern vision models remain brittle and difficult to interpret. We propose Energy-Regularized Spatial Masking (ERSM), a novel framework that reformulates feature selection as a differentiable energy minimization problem. By embedding a lightweight Energy-Mask Layer inside standard convolutional backbones, each visual token is assigned a scalar energy composed of two competing forces: an intrinsic Unary importance cost and a Pairwise spatial coherence penalty. Unlike prior pruning methods that enforce rigid sparsity budgets or rely on heuristic importance scores, ERSM allows the network to autonomously discover an optimal information-density equilibrium tailored to each input. We validate ERSM on convolutional architectures and demonstrate that it produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks, while preserving classification accuracy. Furthermore, we show that the learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.
Read the paper

Authors: Ilyes OUKID, Bilal FAYE, Hanane AZZAG, Mustapha Lebbah, Said Yacine BOULAHIA
Abstract: Multilingual ASR systems often fail to generalize to low-resource and linguistically diverse languages while remaining costly to scale. We introduce PUMA, a unified multilingual ASR model that improves low-resource performance with reduced model complexity. PUMA employs a Universal Language Projection (ULP) module that integrates a learnable language token with acoustic representations, enabling language-aware processing through shared parameters. Experiments on diverse African languages show consistent word error rate reductions over strong multilingual baselines, highlighting improved robustness and generalization.
Read the paper

Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah
Abstract: Diffusion models achieve state-of-the-art image generation but remain computationally costly due to iterative denoising. Latent-space models like Stable Diffusion reduce overhead yet lose fine detail, while retrieval-augmented methods improve efficiency but rely on large memory banks, static similarity models, and rigid infrastructures. We introduce the Prototype Diffusion Model (PDM), which embeds prototype learning into the diffusion process to provide adaptive, memory-free conditioning. Instead of retrieving references, PDM learns compact visual prototypes from clean features via contrastive learning, then aligns noisy representations with semantically relevant patterns during denoising. Experiments demonstrate that PDM sustains high generation quality while lowering computational and storage costs, offering a scalable alternative to retrieval-based conditioning.
Read the paper

Authors: Ilyes OUKID, Bilal FAYE, Hanane AZZAG, Mustapha Lebbah, Said Yacine BOULAHIA
Abstract: Multilingual automatic speech recognition (ASR) seeks to transcribe speech from multiple languages within a unified framework, yet existing approaches often suffer from poor scalability and high computational cost. Current state-of-the-art models typically rely on language-specific adapters or complex architectures, which increase the number of parameters as new languages are added and remain ineffective for low-resource and linguistically complex languages, particularly African languages. We propose LaTRO (Language-Token Routing Optimizer), a compact multilingual ASR model that addresses these limitations by (i) maintaining a constant number of trainable parameters regardless of the number of languages, (ii) introducing a learnable language token that guides the model through a shared parameter routing mechanism without language-specific modules, and (iii) supporting both simultaneous multilingual training and progressive language integration. The proposed approach is designed to better handle low-resource and under-explored African languages characterized by rich phonological and morphological variability. Experiments conducted on nine African languages show that LaTRO achieves competitive or superior performance compared to strong state-of-the-art multilingual ASR models, while significantly reducing computational and memory requirements.
Read the paper

Authors: Liza CHETOUANI, Bilal FAYE, Hanane AZZAG, Zaineb Chelly Dagdia, Mustapha Lebbah
Abstract: Tabular data generation is challenging due to complex feature dependencies, limited data availability, and privacy constraints. Most existing approaches rely on Multi-Layer Perceptron (MLP)-based generative models and perform unconditional sampling from the training distribution, which prevents adaptation to a specific data context or subset at inference time. We propose ProfileFormer, a Transformer-based generative framework that formulates tabular data generation as an in-context learning problem. To address the mismatch between Transformers and unordered tabular features, our method introduces (i) learnable profile queries that structure generation as a query-based process, (ii) context-conditioned cross-attention over reference profiles defined as real data subsets from the same class to capture inter-instance relations, and (iii) a class- and context-aware noise mechanism to enhance diversity while preserving consistency. Unlike prior methods, ProfileFormer enables context-aware generation at inference, allowing samples to be generated conditionally on reference subsets or directly from training data. We validate our approach on multiple medical tabular datasets, a particularly relevant domain due to data scarcity and the need for cohort-specific data synthesis. Results show competitive or superior performance compared to state-of-the-art baselines, while offering greater flexibility and controllability.
Read the paper

Authors: Tom Devynck, Bilal Faye, Djamel Bouchaffra, Nadjib Lazaar, Mustapha Lebbah, Hanane Azzag
Abstract: Dropout is a widely used stochastic regularization technique, yet it over- looks structural dependencies within feature maps. We introduce PB- EDropout, an energy-based approach that preserves low-energy spatial patches within each channel while suppressing the rest. During training, candidate masks are sampled from Gibbs distributions and refined us- ing genetic operators, and a running exponential moving average yields deterministic masks for inference. Experiments on shallow CNNs demon- strate that PB-EDropout consistently improves test accuracy over stan- dard dropout, remains effective even with frozen masks, generates inter- pretable visualizations of discriminative features and are available here https://github.com/Tom-Dvk/PB-EDropout/tree/main
Read the paper

Authors: Bilal Faye, Mustapha Lebbah, Hanane Azzag
Abstract: Batch Normalization (BN), a widely-used technique in neural networks, enhances generalization and expedites training by normalizing each mini-batch to the same mean and variance. However, its effectiveness diminishes when confronted with diverse data distributions. To address this challenge, we propose Supervised Batch Normalization (SBN), a pioneering approach. We expand normalization beyond traditional single mean and variance parameters, enabling the identification of data modes prior to training. This ensures effective normalization for samples sharing common features. We define contexts as modes, categorizing data with similar characteristics. These contexts are explicitly defined, such as domains in domain adaptation or modalities in multimodal systems, or implicitly defined through clustering algorithms based on data similarity. We illustrate the superiority of our approach over BN and other commonly employed normalization techniques through various experiments on both single and multi-task datasets. Integrating SBN with Vision Transformer results in a remarkable 15.13% accuracy enhancement on CIFAR-100. Additionally, in domain adaptation scenarios, employing AdaMatch demonstrates an impressive 22.25% accuracy improvement on MNIST and SVHN compared to BN.
Read the paper

Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah, Djamel Bouchaffra
Abstract: Deep Neural network learning for image processing faces major challenges related to changes in distribution across layers, which disrupt model convergence and performance. Activation normalization methods, such as Batch Normalization (BN), have revolutionized this field, but they rely on the simplified assumption that data distribution can be modelled by a single Gaussian distribution. To overcome these limitations, Mixture Normalization (MN) introduced an approach based on a Gaussian Mixture Model (GMM), assuming multiple components to model the data. However, this method entails substantial computational requirements associated with the use of Expectation-Maximization algorithm to estimate parameters of each Gaussian components. To address this issue, we introduce Adaptative Context Normalization (ACN), a novel supervised approach that introduces the concept of “context”, which groups together a set of data with similar characteristics. Data belonging to the same context are normalized using the same parameters, enabling local representation based on contexts. For each context, the normalized parameters, as the model weights are learned during the backpropagation phase. ACN not only ensures speed, convergence, and superior performance compared to BN and MN but also presents a fresh perspective that underscores its particular efficacy in the field of image processing.
Read the paper

Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah, Djamel Bouchaffra
Abstract: Low-cost cross-modal representation learning is crucial for deriving semantic representations across diverse modalities such as text, audio, images, and video. Traditional approaches typically depend on large specialized models trained from scratch, requiring extensive datasets and resulting in high resource and time costs. To overcome these challenges, we introduce a novel approach named Lightweight Cross-Modal Representa- tion Learning (LightCRL). This method uses a single neural network titled Deep Fusion Encoder (DFE), which projects data from multiple modal- ities into a shared latent representation space. This reduces the overall parameter count while still delivering robust performance comparable to more complex systems.
Read the paper

Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah, Fangchen Feng
Abstract: Deep neural networks have become a staple in solving intricate problems, proving their mettle in a wide array of applications. However, their training process is often hampered by shifting activation distributions during backpropagation, resulting in unstable gradients. Batch Normalization (BN) addresses this issue by normalizing activations, which allows for the use of higher learning rates. Despite its benefits, BN is not without drawbacks, including its dependence on mini-batch size and the presumption of a uniform distribution of samples. To overcome this, several alternatives have been proposed, such as Layer Normalization, Group Normalization, and Mixture Normalization. These methods may still struggle to adapt to the dynamic distributions of neuron activations during the learning process. To bridge this gap, we introduce Unsupervised Adaptive Normalization (UAN), an innovative algorithm that seamlessly integrates clustering for normalization with deep neural network learning in a singular process. UAN executes clustering using the Gaussian mixture model, determining parameters for each identified cluster, by normalizing neuron activations. These parameters are concurrently updated as weights in the deep neural network, aligning with the specific requirements of the target task during backpropagation. This unified approach of clustering and normalization, underpinned by neuron activation normalization, fosters an adaptive data representation that is specifically tailored to the target task. This adaptive feature of UAN enhances gradient stability, resulting in faster learning and augmented neural network performance. UAN outperforms the classical methods by adapting to the target task and is effective in classification, and domain adaptation.
Read the paper

Authors: Nicolas Ballier, Dahn Cho, Bilal Faye, et al.
Abstract: This paper discusses the WMT 2021 terminology shared task from a "meta" perspective. We present the results of our experiments using the terminology dataset and the OpenNMT (Klein et al., 2017) and JoeyNMT (Kreutzer et al., 2019) toolkits for the language direction English to French. Our experiment 1 compares the predictions of the two toolkits. Experiment 2 uses OpenNMT to fine-tune the model. We report our results for the task with the evaluation script but mostly discuss the linguistic properties of the terminology dataset provided for the task. We provide evidence of the importance of text genres across scores, having replicated the evaluation scripts.
Read the paper

Journals

Authors: Djamel Bouchaffra, Fayçal Ykhlef, Bilal Faye, Hanane Azzag, Mustapha Lebbah
Abstract:We introduce a novel deep graphical representation that integrates game theory (GT) principles with the laws of statistical physics (SP), enabling feature extraction and pattern classification within a unified learning framework. In our approach, neurons in a network are analogous to players in a GT model. Each neuron, viewed as a classical particle governed by the laws of SP, corresponds to a set of actions that represent specific activation values. The feed-forward process in deep learning (DL) is interpreted as a sequential game with each game involving multiple players. During training, neurons are evaluated iteratively and filtered based on their contributions to a payoff function, which is quantified using the Shapley value driven by a Gaussian–Boltzmann energy model. To mitigate the computational burden of exact Shapley value computations, we employ Monte–Carlo (MC) sampling, reducing the algorithmic complexity from exponential to polynomial. This approximation significantly improves scalability, making our framework suitable for larger networks. Neurons that significantly contribute to the payoff form strong coalitions, and only these neurons are allowed to propagate information to the next layers. Using the Shapley value, we devised a new model regularization technique, thereby improving overall performance. We applied this framework to facial age estimation and gender classification tasks. Experimental results show that our approach outperforms several traditional and recent machine learning models in terms of accuracy, precision, recall, and F1 -score.
Read the paper

Authors: Djamel Bouchaffra, Fayçal Ykhlef, Bilal Faye, Hanane Azzag, Mustapha Lebbah
Abstract: We present a novel deep graphical representation that seamlessly merges principles of game theory with laws of statistical mechanics. It performs feature extraction, dimensionality reduction, and pattern classification within a single learning framework. Our approach draws an analogy between neurons in a network and players in a game theory model. Furthermore, each neuron viewed as a classical particle (subject to statistical physics' laws) is mapped to a set of actions representing specific activation value, and neural network layers are conceptualized as games in a sequential cooperative game theory setting. The feed-forward process in deep learning is interpreted as a sequential game, where each game comprises a set of players. During training, neurons are iteratively evaluated and filtered based on their contributions to a payoff function, which is quantified using the Shapley value driven by an energy function. Each set of neurons that significantly contributes to the payoff function forms a strong coalition. These neurons are the only ones permitted to propagate the information forward to the next layers. We applied this methodology to the task of facial age estimation and gender classification. Experimental results demonstrate that our approach outperforms both multi-layer perceptron and convolutional neural network models in terms of efficiency and accuracy.
Read the paper

Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah, Djamel Bouchaffra
Abstract:Cross-modal alignment Learning integrates information from different modalities like text, image, audio and video to create unified models. This approach develops shared representations and learns correlations between modalities, enabling applications such as visual question answering and audiovisual content analysis. Current techniques rely on large modality-specific encoders, necessitating fine-tuning or training from scratch on vast aligned datasets (e.g., text-image, text-audio, image-audio). This approach has several limitations: (i) it is highly costly, as it requires training large encoders on vast datasets, (ii) it is difficult to achieve, since obtaining large, well-aligned paired datasets is difficult, and (iii) it is time-consuming, due to the fact that introducing new modalities necessitates retraining the entire framework to accommodate them. To address these issues, we propose OneEncoder, a lightweight framework that progressively represents and aligns four modalities (image, text, audio, video). Initially, we train a lightweight Universal Projection (UP) module to align image and text modalities. Then, we freeze the pretrained UP and progressively align future modalities to those already aligned. OneEncoder operates efficiently and cost-effectively, even in scenarios where vast aligned datasets are unavailable, due to its lightweight design. Trained on small paired datasets, it shows strong performance in tasks like classification, querying, and visual question answering, surpassing methods that rely on large datasets and specialized encoders.
Read the paper

Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah, Fangchen Feng
Abstract: Deep neural networks face challenges with distribution shifts across layers, affecting model convergence and performance. While Batch Normalization (BN) addresses these issues, its reliance on a single Gaussian distribution assumption limits adaptability. To overcome this, alternatives like Layer Normalization, Group Normalization, and Mixture Normalization emerged, yet struggle with dynamic activation distributions. We propose ”Context Normalization” (CN), introducing contexts constructed from domain knowledge. CN normalizes data within the same context, enabling local representation. During backpropagation, CN learns normalized parameters and model weights for each context, ensuring efficient convergence and superior performance compared to BN and MN. This approach emphasizes context utilization, offering a fresh perspective on activation normalization in neural networks.
Read the paper

National Conferences

Authors: Ilyes Oukid, Bilal Faye, Hanane Azzag, Mustapha Lebbah, Said Yacine Boulahia
Abstract: Automatic Speech Recognition (ASR) converts spoken language into text and remains a major challenge. Recent models, such as Massively Multilingual Speech (MMS), cover hundreds of languages but require the addition of language-specific adapters, which increases parameter cost and hinders scalability, especially for low-resource languages. We introduce MonoASR, a frugal and unified multilingual system that avoids such adapters through a Universal Language Projection (ULP). ULP associates a learned language token with acoustic representations, enabling the same model and parameters to handle different languages. Evaluated on French (a high-resource language), Arabic, and Kabyle (underrepresented and complex languages), MonoASR achieves lower word error rates (WER) than MMS, demonstrating its robustness, generalization ability, and suitability for low-cost multilingual transcription.
Read the paper

Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah, Fangchen Feng
Abstract: L'apprentissage des réseaux de neurones est confronté à des défis majeurs liés au changement de distribution en couches, perturbant ainsi la convergence et les performances des modèles. La Normalisation par lot (BN) a révolutionné ce domaine, mais repose sur l'hypothèse simplifiée d'une seule composante gaussienne par lot. Pour remédier à cela, la Normalisation par Mélange (MN) a adopté une approche basée sur le modèle de mélange gaussien (GMM), mais avec des coûts computationnels importants liés à l'algorithme Espérance-Maximisation (EM) pour déterminer des composantes. Notre solution, la Normalisation Contextuelle (CN), regroupe des observations similaires en "contextes" pour une représentation locale, sans nécessiter d'algorithme de construction de ces contextes. Les paramètres de normalisation sont appris de manière similaire aux poids du modèle, assurant rapidité, convergence et performances supérieures par rapport à BN et MN.
Read the paper

Workshops

Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah, Mohamed-Djallel Dilmi, Djamel Bouchaffra
Abstract: Deep neural networks (DNNs) have gained prominence in many areas such as computer vision (CV), natural language processing (NLP), robotics, and bioinformatics. While their deep and complex structure enables powerful representation and hierarchical learning, it poses serious challenges (e.g., internal covariate shift, vanishing/exploding gradients, overfitting, and computational complexity), during their training phase. Neuron activity normalization is an effective strategy that lives up to these challenges. This procedure consists in promoting stability, creating a balanced learning, improving performance generalization and gradient flow efficiency. Traditional normalization methods often overlook inherent dataset relationships. For example, batch normalization (BN) estimates mean and standard deviation from randomly constructed mini-batches (composed of unrelated samples), leading to performance dependence solely on the size of mini-batches, without accounting for data correlation within these batches. Conventional techniques such as Layer Normalization, Instance Normalization, and Group Normalization estimate normalization parameters per instance, addressing mini-batch size issues. Mixture Normalization (MN) utilizes a two-step process: (i) training a Gaussian mixture model (GMM) to determine components parameters, and (ii) normalizing activations accordingly. MN outperforms BN but incurs computational overhead due to GMM usage. To overcome these limitations, we propose a novel methodology that we named "Context Normalization" (CN). Our approach assumes that the data distribution can be represented as a mixture of Gaussian components. However, unlike MN that assumes a-priori that data are partitioned with respect to a set of Gaussian distributions, CN introduces the notion of concept that accounts for data relationship via a neural network classification scheme. Samples that are gathered within a cluster define a context. The estimation of the Gaussian components parameters is conducted through a supervised neural network-based concept classification. CN is more precise when clusters are thick and not sparse. Extensive comparative experiments conducted on various datasets demonstrates the superiority of CN over BN and MN in terms of convergence speed and performance generalization. In fact, CN outperforms BN and MN with a convergence speed margin of 5% and a performance margin of 10%. These results reveal the importance and the need of capturing inherent data context to learn the Gaussian component parameters. Our proposed approach harnesses data relationships, and therefore enhances deep learning models in various applications.
Read the paper

Preprints

Authors: Bilal Faye, Abdoulaye Mbaye, Hanane Azzag, Mustapha Lebbah
Abstract:Transformers have become the dominant architecture across a wide range of domains, largely due to the effectiveness of multi-head attention in capturing diverse representation subspaces. However, standard multi-head attention activates all heads uniformly for every input, regardless of task requirements or input complexity. In many scenarios, particularly for coarse-grained tasks such as text classification, the relevant information is often global and does not require the full diversity of attention heads. As a consequence, using a fixed number of heads can introduce unnecessary computational cost or lead to suboptimal performance when the allocation does not match the input. To address this limitation, we introduce BudgetFormer, a Transformer architecture equipped with an adaptive multi-head attention mechanism that dynamically allocates computational resources. Our approach learns, for each input, both a head budget corresponding to the number of attention heads required, and a relevance distribution that selects the most informative heads. We also propose a training strategy based on an exploration and exploitation trade-off, allowing the model to discover effective head configurations before converging to efficient usage patterns. Experiments on text classification tasks of varying complexity show that our method reduces inference cost in terms of FLOPs and memory, while also achieving performance that can surpass standard full multi-head attention. These results highlight the potential of adaptive head allocation as a principled approach to improving both efficiency and effectiveness in Transformer models.
Read the paper

Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah
Abstract:Single-trajectory reinforcement learning (RL) methods aim to optimize policies from datasets consisting of (prompt, response, reward) triplets, where scalar rewards are directly available. This supervision format is highly practical, as it mirrors real-world human feedback, such as thumbs-up/down signals, and avoids the need for structured preference annotations. In contrast, pairwise preference-based methods like Direct Preference Optimization (DPO) rely on datasets with both preferred and dispreferred responses, which are harder to construct and less natural to collect. Among single-trajectory approaches, Direct Reward Optimization (DRO) has shown strong empirical performance due to its simplicity and stability. However, DRO requires approximating a value function, which introduces several limitations: high off-policy variance, coupling between policy and value learning, and a lack of absolute supervision on the policy itself. We introduce Reward Partitioning Optimization (RPO), a new method that resolves these limitations by removing the need to model the value function. Instead, RPO normalizes observed rewards using a partitioning approach estimated directly from data. This leads to a straightforward supervised learning objective on the policy, with no auxiliary models and no joint optimization. RPO provides direct and stable supervision on the policy, making it robust and easy to implement in practice. We validate RPO on scalar-feedback language modeling tasks using Flan-T5 encoder-decoder models. Our results demonstrate that RPO outperforms existing single-trajectory baselines such as DRO and Kahneman-Tversky Optimization (KTO). These findings confirm that RPO is a simple, effective, and theoretically grounded method for single-trajectory policy optimization.
Read the paper

Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah
Abstract: Object detection is a fundamental challenge in computer vision, centered on recognizing objects within images, with diverse applications in areas like image analysis, robotics, and autonomous vehicles. Although existing methods have achieved great success, they are often constrained by a fixed vocabulary of objects. To overcome this limitation, approaches like MDETR have redefined object detection by incorporating region-level vision-language pre-training, enabling open-vocabulary object detectors. However, these methods are computationally heavy due to the simultaneous training of large models for both vision and language representations. To address this, we introduce a lightweight framework that significantly reduces the number of parameters while preserving, or even improving, performance. Our solution is applied to MDETR, resulting in the development of Lightweight MDETR (LightMDETR), an optimized version of MDETR designed to enhance computational efficiency without sacrificing accuracy. The core of our approach involves freezing the MDETR backbone and training only the Universal Projection module (UP), which bridges vision and language representations. A learnable modality token parameter allows the UP to seamlessly switch between modalities. Evaluations on tasks like phrase grounding, referring expression comprehension, and segmentation show that LightMDETR not only reduces computational costs but also outperforms several state-of-the-art methods in terms of accuracy.
Read the paper
⬅️ Back to Home