TextTruth KDD18 Summary

2021/11/09

Source paper: TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi-Sourced Text Data (KDD 18')

Fundamental Principles of Truth Discovery

Challenges in text data

Keywords and Factors

Take question “What are the symptoms of flu?” as a simple example. It may have several answers, say “One may feel very cold and exhausted”. From answers like the given one, we can extract some keywords like: “freezing”, “cold”, “tired”, “exhausted”, “running nose” and “congestion”. In those answers, it is obvious that some different keywords represent same meaning, here we call the meaning factor, e.g., “cold” and “freezing” belong to the factor “chills”, and, “tired”, together with “exhausted”, belong to the factor “fatigue”.

More formally, the level of answer-keyword-factor are sorted hierarchically, and the factor behaves like the cluster of several keywords which share similar semantic meanings.

High-level Idea

This papers introduces a model named “TextTruth”, which takes the keywords in each answer as inputs and outputs a ranking for the answer candidates based on their trustworthiness. Specifically, for a given answer, they first extract the keywords from it. Then they compute a vectorized representation of those keywords, which is only an “answer-level” / coarse-grained representations due to the multifactorial challenge. Thus, they need to be converted into find-grained factors. They achieve this conversion by clustering the keywords with similar semantic meanings. Finally, they can estimate the trustworthiness of each answer factor instead of the whole answer and infer the correctness of each factor in the answer.

Problem Formulation

Details

Overview

Let’s review the goal of this problem, it is to extract highly-trustworthy answers and highly-trustworthy key factors in answers for each question. This paper develop a generative probabilistic model to achieve it.

Generative Model

In detail, it first use a dirichlet distribution to model the mixture of the factors $\mathbf{\pi}_q \sim Dirichlet(\mathbf{\beta})$ (as a prior). And a Beta distribution is used to generate the prior truth probability $\gamma_{qk} \sim Beta(\alpha_1^{(a)},\alpha_0^{(a)})$ for the $k$-th factor under question q. Then, based on above prior $\gamma_{qk}$, they generate the truth label from a Bernoulli distribution as $t_{qk} \sim Bernoulli(\gamma_{qk})$. And finally, to model the semantic characteristic of each answer factor, the proposed model leverages the von Mises-Fisher distribution to generate the keyword embedding vectors, with two hyper-parameters: the centroid parameter $\mathbf{\mu}{qk}$ and the concentrate parameter $\mathbf{\kappa}{qk}$. Then they use

Variables/Distributions Distribution Hyper-Parameters Remark
Mixture of factors $\mathbf{\pi}_q$ $\mathbf{\pi}_q \sim Dirichlet(\mathbf{\beta})$ $\mathbf{\beta}$, a $K_q$-dimensional vector The prior of the factors
Prior truth probability $\gamma_{qk}$ $\gamma_{qk} \sim Beta(\alpha_1^{(a)},\alpha_0^{(a)})$ $\alpha_1^{(a)},\alpha_0^{(a)}$
Binary truth label $t_{qk}$ $t_{qk} \sim Bernoulli(\gamma_{qk})$ $\gamma_{qk}$ sampled from $Beta(\alpha_1^{(a)},\alpha_0^{(a)})$ true or false whether trustworthy
keyword embedding vectors $\mathbf{v}_{qum}$ $\mathbf{v}_{qum} \sim vMF(\mathbf{\mu}_{qk}, \kappa_{qk})$ $\mathbf{\mu}{qk}, \kappa{qk}$ $\mu$ defines the ‘semantic center’, $\kappa$ is the concentration parameter
$\mathbf{\mu}{qk}, \kappa{qk}$ $\mathbf{\mu}{qk}, \mathbf{\kappa}{qk} \sim \Phi(\mathbf{\mu}{qk}, \kappa{qk}; \mathbf{m}_0, R_0, c)$

  1. question that can be answered with simple facts expressed in short text answers ↩︎

  2. the answer of a user only contains a part of the correct answer, for example the correct answer should contains the following factors: fever, chills, cough, nasal symptom, ache, and fatigue, but the responded answer only contains fever and cough↩︎