Terms and Application Form of MMS Dataset v1.0

Table of Contents

1 Introduction

As multimedia data(including text, image, audio, and video) have increased dramatically recently, it becomes difficult for users to obtain important information efficiently. Multi-modal Summarization (MMS) system aims to automatically generate a textual summary given a set of documents, images, audios, and videos related to a specific topic.

1.1 Dataset description

Unlike movies which contain synchronous voice, visual, and captions, our dataset provides an asynchronous (i.e., there is no given description for images and no subtitle for videos) collection of multi-modal information about a specific news topics, including multiple documents, images, and videos, to generate a fixed length textual summary.

1.2 Dataset construction

We select 50 news topics in the most recent five years, 25 in English and 25 in Chinese. For each topic, we collect 20 documents and 5-10 videos within the same period. We provide manual reference summaries following the instructions of Document Understanding Conferences (DUC) and Text Analysis Conference (TAC). 10 graduate students are employed to write reference summaries after reading documents and watching videos on the same topic. There are 3 reference summaries for each topic. The criteria for summarizing documents lie in: (1) retaining important content of the input documents and videos; (2) avoiding redundant information; (3) having a good readability; (4) following the length limit. We set the length constraint for each English and Chinese summary to 300 words and 500 characters, respectively. Some examples of English news topics are: (a) Nepal earthquake. (b) Terror attack in Paris. (c) Train derailment in India. (d) Germanwings crash. (e) Refugee crisis in Europe.

More details can be found in our EMNLP2017 paper.

2 Copyright

The copyright of this dataset belongs to the authors of our paper, and the dataset is only used for research purposes. Display, reproduction, transmission, distribution or publication of this dataset is prohibited. If you are interested in our dataset, please fill out the application form below and send an email to {haoran.li, junnan.zhu}@nlpr.ia.ac.cn. We will send the download link of this dataset to the applicant. If you have any question, don't hesitate to contact us.

3 Application Form






I have read the above terms, and accept them.


Author: Junnan Zhu

Created: 2017-11-06 Mon 13:54