Hierarchical Reasoning Based on Perception Action Cycle for Visual Question Answering

37 Pages Posted: 13 Oct 2022

See all articles by Safaa Abdullahi Moallim Mohamud

Safaa Abdullahi Moallim Mohamud

Kyungpook National University

Amin Jalali

Kyungpook National University

Minho Lee

Kyungpook National University

Abstract

Recent visual question answering (VQA) frameworks employ different combinations of attention techniques to derive a correct answer. Attention techniques in vision-language tasks have mostly achieved success through the improvement of local features for both modalities. Attention as a concept is heavily established by human cognition mechanism. Different combinations of attention techniques are not well proven as a means of human cognition. Neural networks were originally inspired by the structure of the human brain. Many researchers have recently resorted to frameworks that resemble the human brain, and their models have achieved high performance. To this end, we aim to consider a framework that utilizes human biological and psychological concepts to achieve a good understanding of vision and language modalities. In this view, we introduce a hierarchical reasoning based on a perception action cycle (HIPA) framework to tackle VQA tasks. It integrates the reasoning process of multi-modalities with the perception action cycle (PAC), which explains the learning mechanism of humans about the surrounding world. It comprehends the visual modality through three phases of reasoning: object-level attention, organization, and interpretation. It comprehends the language modality through word-level attention, interpretation, and conditioning. Subsequently, vision and language modalities are interpreted dependently in a cyclic and hierarchical way throughout the entire framework. For further assessment of the visual and language features, we argue that image-question pairs of the same answer ought to have similar visual and language features eventually. As a result, we conduct visual and language feature evaluation experiments using metrics such as standard deviation of cosine similarity and Manhattan distance. We show that employing PAC in our framework improves the standard deviation compared with other VQA frameworks. For further assessment, we also test the novel proposed HIPA on the visual relationship detection (VRD) tasks. The proposed method achieves the state-of-the-art results on the TDIUC and VRD datasets and obtains competitive results on the VQA 2.0 dataset.

Keywords: Visual question answering, vision language tasks, multi-modality fusion, Attention, bilinear fusion

undefined

Suggested Citation

Abdullahi Moallim Mohamud, Safaa and Jalali, Amin and Lee, Minho, Hierarchical Reasoning Based on Perception Action Cycle for Visual Question Answering. Available at SSRN: https://ssrn.com/abstract=4247187 or http://dx.doi.org/10.2139/ssrn.4247187

Safaa Abdullahi Moallim Mohamud

Kyungpook National University ( email )

Korea, Republic of (South Korea)

Amin Jalali

Kyungpook National University ( email )

Korea, Republic of (South Korea)

Minho Lee (Contact Author)

Kyungpook National University ( email )

0 References

    0 Citations

      Do you have a job opening that you would like to promote on SSRN?

      Paper statistics

      Downloads
      57
      Abstract Views
      333
      Rank
      788,423
      PlumX Metrics