Rich Visual and Language Representation with Complementary Semantics for Video Captioning

Tang Pengjie; Wang Hanli; Li Qinyu

首页> 外文期刊>ACM transactions on multimedia computing communications and applications >Rich Visual and Language Representation with Complementary Semantics for Video Captioning

【24h】

Rich Visual and Language Representation with Complementary Semantics for Video Captioning

机译：丰富的视觉和语言表示与视频标题的互补语义

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

It is interesting and challenging to translate a video to natural description sentences based on the video content. In this work, an advanced framework is built to generate sentences with coherence and rich semantic expressions for video captioning. A long short term memory (LSTM) network with an unproved factored way is first developed, which takes the inspiration of LSTM with a conventional factored way and a common practice to feed multi-modal features into LSTM at the first time step for visual description. Then, the incorporation of the LSTM network with the proposed improved factored way and un-factored way is exploited, and a voting strategy is utilized to predict candidate words. In addition, for robust and abstract visual and language representation, residuals are employed to enhance the gradient signals that are learned from the residual network (ResNet), and a deeper LSTM network is constructed. Furthermore, three convolutional neural network based features extracted from GoogLeNet, ResNet101, and ResNet152, are fused to catch more comprehensive and complementary visual information. Experiments are conducted on two benchmark datasets, including MSVD and MSR-VTT2016, and competitive performances are obtained by the proposed techniques as compared to other state-of-the-art methods.

机译：根据视频内容将视频转换为自然语描述句子是有趣和挑战性。在这项工作中，建立了一个先进的框架，以生成带有相干性和丰富语义表达式的句子，用于视频字幕。首先开发了具有未经证实的考核方式的长期内记忆（LSTM）网络，这是利用传统的因子的启发，并在第一次进行视觉描述中将多模态特征馈送到LSTM中的常见做法。然后，利用所提出的改进的因子和未被发生的方式将LSTM网络纳入，并且利用投票策略来预测候选词。此外，对于鲁棒和抽象的视觉和语言表示，使用残差来增强从残余网络（Reset）学习的梯度信号，并且构建更深的LSTM网络。此外，从Googlenet，Resnet101和Reset152中提取的三个基于卷积神经网络的特征被融合以捕获更全面和互补的视觉信息。实验在两个基准数据集上进行，包括MSVD和MSR-VTT2016，与其他最先进的方法相比，所提出的技术获得了竞争性能。

著录项

来源
《ACM transactions on multimedia computing communications and applications》 |2019年第2期|31.1-31.23|共23页
作者
Tang Pengjie; Wang Hanli; Li Qinyu;
展开▼
作者单位

Tongji Univ Dept Comp Sci & Technol Shanghai 201804 Peoples R China|Jinggangshan Univ Coll Math & Phys Jian 343009 Jiangxi Peoples R China;

Tongji Univ Dept Comp Sci & Technol Shanghai 201804 Peoples R China;

Tongji Univ Dept Comp Sci & Technol Shanghai 201804 Peoples R China|Lanzhou City Univ Dept Comp Sci Lanzhou 730070 Gansu Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Video captioning; long short term memory; convolutional neural network; sequential voting; complementary features;

机译：视频标题;长期短期记忆;卷积神经网络;顺序投票;互补特征;

相似文献

外文文献
中文文献
专利

1. Rich Visual and Language Representation with Complementary Semantics for Video Captioning [J] . Tang Pengjie, Wang Hanli, Li Qinyu ACM transactions on multimedia computing communications and applications . 2019,第2期

机译：丰富的视觉和语言表示以及带有辅助语义的视频字幕
2. Translating video into language by enhancing visual and language representations [J] . Tang Pengjie, Tan Yunlan, Li Jinzhong, Journal of visual communication & image representation . 2020,第Octa期

机译：通过增强视觉和语言表示将视频转换为语言
3. Learning semantic sentence representations from visually grounded language without lexical knowledge [J] . Merkx Danny, Frank Stefan L. Natural language engineering . 2019,第PTa4期

机译：在没有词汇知识的情况下从视觉基础的语言学习语义句子表示
4. Grounding language acquisition by training semantic parsers using captioned videos [C] . Candace Ross, Andrei Barbu, Yevgeni Berzak, Conference on empirical methods in natural language processing . 2018

机译：通过使用字幕视频训练语义解析器来掌握语言
5. The effect of the use of videos captioning on English as a foreign language (EFL) on college students' language learning in Taiwan (China). [D] . Hwang, Yan-Ling. 2003

机译：在台湾（中国）使用视频字幕作为外语英语（EFL）对大学生语言学习的影响。
6. A Comparison of Comprehension Processes in Sign Language Interpreter Videos with or without Captions [O] . Matjaž Debevc, Danijela Milošević, Ines Kožuh -1

机译：带或不带字幕的手语翻译视频中理解过程的比较
7. Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning [O] . Nayyer Aafaq, Naveed Akhtar, Wei Liu, 2019

机译：用于视频字幕的时空动态和语义属性丰富的视觉编码
8. Rich Representations with Exposed Semantics for Deep Visual Reasoning. [R] . Davis, L., Chellappa, R., Hoiem, D., 2016

机译：用于深度视觉推理的具有暴露语义的丰富表示。

Rich Visual and Language Representation with Complementary Semantics for Video Captioning

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅