DPHANet: Discriminative Parallel and Hierarchical Attention Network for Natural Language Video Localization
Chen, Rhuihan, Junpeng, Tan, Yang, Zhijing, Yang, Xiaojung, Dai, Quingyun, Cheng, Yongqiang and Lin, Liang (2024) DPHANet: Discriminative Parallel and Hierarchical Attention Network for Natural Language Video Localization. IEEE Transactions on Multimedia. ISSN 1520-9210
Item Type: | Article |
---|
Abstract
Natural Language Video Localization (NLVL) has
recently attracted much attention because of its practical significance.
However, the existing methods still face the following
challenges: 1) When the models learn intra-modal semantic
association, the temporal causal interaction information and contextual
semantic discriminative information are ignored, resulting
in the lack of intra-modal semantic context connection; 2) When
learning fusion representations, existing cross-modal interaction
modules lack hierarchical attention function to extract intermodal
similarity information and intra-modal self-correlation
information, resulting in insufficient cross-modal information
interaction; 3) When the loss function is optimized, the existing
models ignore the correlation of causal inference between the
start and end boundaries, resulting in inaccurate start and end
boundary calibrations. To conquer the above challenges, we
proposed a novel NLVL model, called Discriminative Parallel
and Hierarchical Attention Network (DPHANet). Specifically,
we emphasized the importance of temporal causal interaction
information and contextual semantic discriminative information
and correspondingly proposed a Discriminative Parallel Attention
Encoder (DPAE) module to infer and encode the above critical
information. Besides, to overcome the shortcomings of the existing
cross-modal interaction modules, we designed a Video-Query
Hierarchical Attention (VQHA) module, which can perform
cross-modal interaction and intra-modal self-correlation modeling
in a hierarchical manner. Furthermore, a novel deviation
loss function was proposed to capture the correlation of causal
inference between the start and end boundaries and force the
model to focus on the continuity and temporal causality in
the video. Finally, extensive experiments on three benchmark
datasets demonstrated the superiority of our proposed DPHANet
model, which has achieved about 1.5% and 3.5% average
performance improvement and about 2.5% and 7.5% maximum
performance improvement on the Charades-STA and TACoS
datasets respectively.
|
PDF (Author Accepted Manuscript on publisher template/ following requested formatting)
FINAL VERSION - TMM.pdf Download (4MB) | Preview |
More Information
Additional Information: © 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.” |
Uncontrolled Keywords: Location awareness , Semantics , TV , Natural languages , Correlation , Glass , Electronic mail |
Depositing User: Yongqiang Cheng |
Identifiers
Item ID: 17612 |
Identification Number: https://doi.org/10.1109/TMM.2024.3395888 |
ISSN: 1520-9210 |
URI: http://sure.sunderland.ac.uk/id/eprint/17612 | Official URL: https://ieeexplore.ieee.org/document/10517423 |
Users with ORCIDS
Catalogue record
Date Deposited: 21 Jun 2024 14:15 |
Last Modified: 21 Jun 2024 14:30 |
Author: | Yongqiang Cheng |
Author: | Rhuihan Chen |
Author: | Tan Junpeng |
Author: | Zhijing Yang |
Author: | Xiaojung Yang |
Author: | Quingyun Dai |
Author: | Liang Lin |
University Divisions
Faculty of Technology > School of Computer ScienceSubjects
Computing > Artificial IntelligenceComputing > Information Systems
Actions (login required)
View Item (Repository Staff Only) |