Multi-modal Event Topic Model for Social Event Analysis

Shengsheng Qian, Tianzhu Zhang , Changsheng Xu and Jie Shao

Summary

With the massive growth of social events in Internet, it has become more and more difficult to exactly find and organize the interesting events from massive social media data, which is useful to browse, search and monitor social events by users or governments. To deal with this problem, we propose a novel multi-modal social event tracking and evolution framework to not only effectively capture multi-modal topics of social events, but also obtain the evolutionary trends of social events and generate effective event summary details over time. To achieve this goal, we propose a novel multi-modal event topic model (mmETM), which can effectively model social media documents including long text with related images and learn the correlations between textual and visual modalities to separate the visual-representative topics and non-visual-representative topics. To apply the mmETM model to social event tracking, we adopt an incremental learning strategy denoted as incremental mmETM, which can obtain informative textual and visual topics of social events over time to help understand these events and their evolutionary trends. To evaluate the effectiveness of our proposed algorithm, we collect a real-world dataset to conduct various experiments. Both qualitative and quantitative evaluations demonstrate that the proposed mmETM algorithm performs favorably against several state-of-the-art methods.

Framework

Figure 2: The multi-modal event tracking and evolution framework. The input is the multi-modality data collected from Google News including images and texts. Based on the input data, our algorithm can learn multi-modality topics and track multiple events. After tracking, for each event, it can be visualized with texts and images over time. Meanwhile, we can mine their semantic topics.

We propose a novel multi-modal social event tracking and evolution framework to obtain the evolutionary trends of social events and generate effective event summary details over time as shown in Fig. 2, which has several modules: (1) The input is multimedia documents with time-ordered event data downloaded from Google News, including images and texts. Each social media document contains long text and its corresponding images. After pre-processing, they are input into the mmETM model. (2) The multi-modal event topic mining module is to effectively model multi-modal social event documents, which can learn the correlations between textual and visual modalities to separate the visual-representative topics and non-visualrepresentative topics. (3) In multi-modal event topic visualization, we can show the learned visual-representative topics and non-visual-representative topics, which can help understand the social events. (4) For event tracking and evolution, we adopt an incremental learning strategy to update the proposed mmETM model over time, denoted as incremental mmETM. In our tracking algorithm, we need to first initialize the mmETM model for each social event. Then, the coming event documents will be determined to which event in the next moment by similarity computing identification. Finally, event documents are assigned to the corresponding social events, and the mmETM will be updated incrementally. In this way, we can track multi-modal social event documents over time and show the whole evolutionary process of events with their topics.

Multi-modal Event Topic Model

Figure 3: Graphical models of Multi-modal Event Topic Model.

In the mmETM model, a document could be a tagged photo, or a long news with images. Figure 3 illustrates the graphical representation of mmETM. From the figure, we can see that the proposed model is based on the traditional mm-LDA model by considering non-visual-representative topics, which can effectively model multi-modal social event documents. Each document is associated with two different topic distributions: $\theta$ over topics shared between textual and visual modalities, and $\psi$ over topics unique to textual modality. Each kind of topics is probability distribution over textual or visual words. In the model, we use binary variable $x$ to control whether the topic word is generated from the visual-representative topic space or the non-visual-representative topic space. When $x=1$ or $x=0$ , the topic word is generated from the visual-representative topic space or the non-visual-representative topic space, respectively. We assume that all visual aspect words are generated from visual-representative topic space, i.e., $x_v =1$ . We omit the illustration of $x$ in the plate of visual aspect words for simplicity. Input multimedia documents $E_t$ in the epoch $t$ , our aim is to infer the two document-topic distributions $\theta_d$ and $\psi_d$ , and a set of $K$ visual-representative topics ${\phi_v^s,\phi_w^s}$ , and $H$ non-visual-representative topics $\phi _w^p$ . The $\theta_d$ represents that textual and visual information in a social event document share the same document-specific distribution over topics while the $\psi_d$ includes only textual information in a social event document.

An overview of the proposed online mmETM algorithm is shown in Algorithm 1. The inputs of the algorithm are: fixed Dirichlet values, $a$ and $b$ , which are used to initialize the priors $\{\alpha, \beta, \gamma\}$ and $\eta$ , respectively, at epoch 1. And, multimedia documents of an social event over time are ${E_t, t \in \{1,\cdot \cdot \cdot T\}}$ . Here, $T$ is the number of stories according to the evolution time of social event. The outputs of the algorithm are: $T$ generative models including visual-representative topics $\{\phi _{t,w}^s, \phi _{t,v}^s\}_{t=1}^T$ , non-visual-representative topics $\{\phi _{t,w}^p\}_{t=1}^T$ , document-topic distributions $\{\theta_{t,d}\}_{t=1}^T$ and $\{\psi_{t,d}\}_{t=1}^T$ , and the evolution matrices ${{\bf B}_{w}^s}, {{\bf B}_{v}^s}, {{\bf B}_{w}^p}$ .

Results

We show extensive experimental results on our collected dataset to demonstrate the effectiveness of our model.

A. Qualitative Evaluation:

Figure 4: Illustration of samples of discovered topics for the event “Occupy Wall Street” and “United States Presidential Election” (VR represents visualrepresentative and NVR represents non-visual-representative topics.).

We will qualitatively demonstrate the effectiveness of our proposed model for social event analysis. For simplicity, we first visualize the learned visual-representative and non-visual-representative topics in the first epoch of social event in the Fig. 4, which can validate the effectiveness of our proposed mmETM model. In Fig. 4, we show an example of the discovered topics for the event “United States Presidential Election” and “Occupy Wall Street”, respectively. By providing a multi-modal information of the representative textual and visual words, it is very intuitive to interpret the social events with each associated topic. For the visualization of visual-representative and non-visual-representative topics, we present the top-ranked textual words and visual patches for each visual-representative topic, and the top-ranked textual words for each non-visualrepresentative topic, which can help interpret the property of social events.

B. Quantitative Evaluation

Figure 5: The purity scores of topic identification for different topic models on our collected dataset.

Figure 6: The text/image perplexity scores for different topic models on our collected dataset.

Based on the results of the comparisons of the soft clustering quality and text/image perplexity as shown in Fig. 5 and Fig. 6, we can make the following conclusions. (1) The LDA shows inferior performance. This is due to the fact that LDA models textual and visual aspect words obscurely and cannot differentiate the associations between textual and visual words. (2) The tr-mmLDA, Corr-LDA, and mm-LDA models achieve better results by leveraging the information in texts and images to capture the dependencies between text words and visual words. However, the tr-mmLDA introduces a regression module to correlate the 2 sets of topics, and the Corr-LDA and MM-LDA assume a one-toone correspondence between the topics of each modality. Their methods need to satisfy strong relativity between the textual and visual information. (3) The proposed mmETM consistently and significantly outperforms other state-of-the-art topic models. The major reason is that our mmETM can reasonably model the correlations between two modalities more accurately by separating the visual-representative topics and non-visualrepresentative topics, which makes better document representations in the latent semantic space.

Publication

Multi-modal Event Topic Model for Social Event Analysis. [pdf][slides]

Shengsheng Qian, Tianzhu Zhang , Changsheng Xu and Jie Shao
IEEE Transactions on Multimedia, 2015, Accepted.