Transformer Module Networks for Systematic Generalization in Visual Question Answering

Research output: Contribution to journalArticlepeer-review

Abstract

Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.

Original languageEnglish
Pages (from-to)10096-10105
Number of pages10
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Volume46
Issue number12
DOIs
StatePublished - 2024

Bibliographical note

Publisher Copyright:
© 1979-2012 IEEE.

ASJC Scopus Subject Areas

  • Software
  • Computer Vision and Pattern Recognition
  • Computational Theory and Mathematics
  • Applied Mathematics
  • Artificial Intelligence

Keywords

  • Neural module network
  • systematic generalization
  • transformer
  • visual question answering

Fingerprint

Dive into the research topics of 'Transformer Module Networks for Systematic Generalization in Visual Question Answering'. Together they form a unique fingerprint.

Cite this