Source paper: TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi-Sourced Text Data (KDD 18')
Fundamental Principles of Truth Discovery
- If a user provides much trustworthy information or true answers, his/her reliability is high
- If an answer is supported by many reliable users, this answer is more likely to be true
Challenges in text data
- Unstructured and noisy
- For a factoid question1, the answer may be multifactorial and it’s usually hard for a given text answer to cover all the factors. Such circumstance lead to the so-called partially correct phenomenon2
- Diversity of word usages. For example, one can use words such as tired or exhausted to describe the symptom of fatigue
Keywords and Factors
Take question “What are the symptoms of flu?” as a simple example. It may have several answers, say “One may feel very cold and exhausted”. From answers like the given one, we can extract some keywords like: “freezing”, “cold”, “tired”, “exhausted”, “running nose” and “congestion”. In those answers, it is obvious that some different keywords represent same meaning, here we call the meaning factor, e.g., “cold” and “freezing” belong to the factor “chills”, and, “tired”, together with “exhausted”, belong to the factor “fatigue”.
More formally, the level of answer-keyword-factor are sorted hierarchically, and the factor behaves like the cluster of several keywords which share similar semantic meanings.
High-level Idea
This papers introduces a model named “TextTruth”, which takes the keywords in each answer as inputs and outputs a ranking for the answer candidates based on their trustworthiness. Specifically, for a given answer, they first extract the keywords from it. Then they compute a vectorized representation of those keywords, which is only an “answer-level” / coarse-grained representations due to the multifactorial challenge. Thus, they need to be converted into find-grained factors. They achieve this conversion by clustering the keywords with similar semantic meanings. Finally, they can estimate the trustworthiness of each answer factor instead of the whole answer and infer the correctness of each factor in the answer.
Problem Formulation
- Question: A question $q$ contains $N_q$ words and can be answered by users.
- Answer: An answer given by user $u$ to question $q$ is denoted as $a_{qu}$
- Answer Keyword: Answer keywords are domain-specific content words / phrases in answers. The $m$-th answer keyword of the answer given by user $u$ to question $q$ is denoted as $x_{qum}$
- Answer Factor: Answer factors are the key points of the answers, which are represented as clusters of answer keywords. The k-th answer factor in the answers to question q is denoted as $c_{qk}$
- Problem Definition: Given a set of user $\{u\}_1^{U}$, a set of questions $\{q\}_1^Q$ and a set of answers $\{a_{qu}\}_{q,u=1,1}^{Q,U}$, where $U$ denotes the number of users and $Q$ stands for the number of questions. The goal is to extract highly-trustworthy answers and highly-trustworthy key factors in answers for each question.
Details
Overview
Let’s review the goal of this problem, it is to extract highly-trustworthy answers and highly-trustworthy key factors in answers for each question. This paper develop a generative probabilistic model to achieve it.
Generative Model
In detail, it first use a dirichlet distribution to model the mixture of the factors $\mathbf{\pi}_q \sim Dirichlet(\mathbf{\beta})$ (as a prior). And a Beta distribution is used to generate the prior truth probability $\gamma_{qk} \sim Beta(\alpha_1^{(a)},\alpha_0^{(a)})$ for the $k$-th factor under question q. Then, based on above prior $\gamma_{qk}$, they generate the truth label from a Bernoulli distribution as $t_{qk} \sim Bernoulli(\gamma_{qk})$. And finally, to model the semantic characteristic of each answer factor, the proposed model leverages the von Mises-Fisher distribution to generate the keyword embedding vectors, with two hyper-parameters: the centroid parameter $\mathbf{\mu}{qk}$ and the concentrate parameter $\mathbf{\kappa}{qk}$. Then they use
Variables/Distributions | Distribution | Hyper-Parameters | Remark |
---|---|---|---|
Mixture of factors $\mathbf{\pi}_q$ | $\mathbf{\pi}_q \sim Dirichlet(\mathbf{\beta})$ | $\mathbf{\beta}$, a $K_q$-dimensional vector | The prior of the factors |
Prior truth probability $\gamma_{qk}$ | $\gamma_{qk} \sim Beta(\alpha_1^{(a)},\alpha_0^{(a)})$ | $\alpha_1^{(a)},\alpha_0^{(a)}$ | |
Binary truth label $t_{qk}$ | $t_{qk} \sim Bernoulli(\gamma_{qk})$ | $\gamma_{qk}$ sampled from $Beta(\alpha_1^{(a)},\alpha_0^{(a)})$ | true or false whether trustworthy |
keyword embedding vectors $\mathbf{v}_{qum}$ | $\mathbf{v}_{qum} \sim vMF(\mathbf{\mu}_{qk}, \kappa_{qk})$ | $\mathbf{\mu}{qk}, \kappa{qk}$ | $\mu$ defines the ‘semantic center’, $\kappa$ is the concentration parameter |
$\mathbf{\mu}{qk}, \kappa{qk}$ | $\mathbf{\mu}{qk}, \mathbf{\kappa}{qk} \sim \Phi(\mathbf{\mu}{qk}, \kappa{qk}; \mathbf{m}_0, R_0, c)$ |
- Answer Factor Modeling:
-
question that can be answered with simple facts expressed in short text answers ↩︎
-
the answer of a user only contains a part of the correct answer, for example the correct answer should contains the following factors: fever, chills, cough, nasal symptom, ache, and fatigue, but the responded answer only contains fever and cough. ↩︎