Terms and Application Form of MMSS Dataset v1.0

Table of Contents

1 Introduction

Multi-modal Sentence Summarization (MMSS) system aims to automatically generate a textual summary given a pair of sentence and image related to a news event. Intuitively, readers can easier grasp the gist of the event by scanning the image than by reading long sentences, and thus we believe that the image will also reduce the difficulty for machine to understand a news event.

2 Dataset construction

Each sample in our corpus is a triple (sentence, image, summary) in which the sentence-summary pair is from the annotated Gigaword corpus and the image is crawled from Yahoo! Image Search. The Gigaword corpus provides 3.8 million first-sentence-summary pairs. For each first-sentence, we search Yahoo! Image Search, and crawl the top-5 ranked images. Next, we delete the explicit trivial images such as portraits, thumbnails and advertisements. Then we employ 10 graduate students to select the best-match image for each sentence. As a result, we collect 66,000 samples. We randomly split our corpus as a training set with 62,000 samples, a test set with 2,000 samples and a development set with 2,000 samples. More details can be found in our IJCAI2018 paper.

3 Copyright

The copyright of this dataset belongs to the authors, and the dataset is only used for research purposes. Display, reproduction, transmission, distribution or publication of this dataset is prohibited. If you are interested in our dataset, please fill out the application form below and send an email to haoran.li@nlpr.ia.ac.cn. We will send the download link of this dataset to the applicant. If you have any question, don't hesitate to contact us.

4 Application Form

The copyright of this dataset belongs to the authors of the paper (Haoran Li, Junnan Zhu, Tianshang Liu, Jiajun Zhang and Chengqing Zong. Multi-modal Sentence Summarization with Modality Attention and Image Filtering. In Proc. of IJCAI-2018, pages 4152- 4158). This dataset is only used for research purposes. Display, reproduction, transmission, distribution or publication of this dataset is prohibited.

□ I have read the above terms, and accept them.







Author: Junnan Zhu

Created: 2018-07-22 Sun 18:39
