The theory of belief functions, also referred to as evidence theory or Dempster-Shafer theory (DST), is a general framework for reasoning with uncertainty, with understood connections to other frameworks such as probability, possibility and imprecise probability theories. First introduced by Arthur P. Dempster in the context of statistical inference, the theory was later developed by Glenn Shafer into a general framework for modeling epistemic uncertainty--a mathematical theory of evidence. The theory allows one to combine evidence from different sources and arrive at a degree of belief (represented by a mathematical object called belief function) that takes into account all the available evidence.
In a narrow sense, the term Dempster-Shafer theory refers to the original conception of the theory by Dempster and Shafer. However, it is more common to use the term in the wider sense of the same general approach, as adapted to specific kinds of situations. In particular, many authors have proposed different rules for combining evidence, often with a view to handling conflicts in evidence better. The early contributions have also been the starting points of many important developments, including the transferable belief model and the theory of hints.
Dempster-Shafer theory is a generalization of the Bayesian theory of subjective probability. Belief functions base degrees of belief (or confidence, or trust) for one question on the probabilities for a related question. The degrees of belief themselves may or may not have the mathematical properties of probabilities; how much they differ depends on how closely the two questions are related. Put another way, it is a way of representing epistemic plausibilities but it can yield answers that contradict those arrived at using probability theory.
Often used as a method of sensor fusion, Dempster-Shafer theory is based on two ideas: obtaining degrees of belief for one question from subjective probabilities for a related question, and Dempster's rule for combining such degrees of belief when they are based on independent items of evidence. In essence, the degree of belief in a proposition depends primarily upon the number of answers (to the related questions) containing the proposition, and the subjective probability of each answer. Also contributing are the rules of combination that reflect general assumptions about the data.
In this formalism a degree of belief (also referred to as a mass) is represented as a belief function rather than a Bayesian probability distribution. Probability values are assigned to sets of possibilities rather than single events: their appeal rests on the fact they naturally encode evidence in favor of propositions.
Dempster-Shafer theory assigns its masses to all of the subsets of the propositions that compose a system--in set-theoretic terms, the power set of the propositions. For instance, assume a situation where there are two related questions, or propositions, in a system. In this system, any belief function assigns mass to the first proposition, the second, both or neither.
Shafer's formalism starts from a set of possibilities under consideration, for instance numerical values of a variable, or pairs of linguistic variables like "date and place of origin of a relic" (asking whether it is antique or a recent fake). A hypothesis is represented by a subset of this frame of discernment, like "(Ming dynasty, China)", or "(19th century, Germany)".:p.35f.
Shafer's framework allows for belief about such propositions to be represented as intervals, bounded by two values, belief (or support) and plausibility:
In a first step, subjective probabilities (masses) are assigned to all subsets of the frame; usually, only a restricted number of sets will have non-zero mass (focal elements).:39f.Belief in a hypothesis is constituted by the sum of the masses of all subsets of the hypothesis-set. It is the amount of belief that directly supports either the given hypothesis or a more specific one, thus forming a lower bound on its probability. Belief (usually denoted Bel) measures the strength of the evidence in favor of a proposition p. It ranges from 0 (indicating no evidence) to 1 (denoting certainty). Plausibility is 1 minus the sum of the masses of all sets whose intersection with the hypothesis is empty. Or, it can be obtained as the sum of the masses of all sets whose intersection with the hypothesis is not empty. It is an upper bound on the possibility that the hypothesis could be true, i.e. it "could possibly be the true state of the system" up to that value, because there is only so much evidence that contradicts that hypothesis. Plausibility (denoted by Pl) is defined to be Pl(p) = 1 − Bel(~p). It also ranges from 0 to 1 and measures the extent to which evidence in favor of ~p leaves room for belief in p.
For example, suppose we have a belief of 0.5 for a proposition, say "the cat in the box is dead." This means that we have evidence that allows us to state strongly that the proposition is true with a confidence of 0.5. However, the evidence contrary to that hypothesis (i.e. "the cat is alive") only has a confidence of 0.2. The remaining mass of 0.3 (the gap between the 0.5 supporting evidence on the one hand, and the 0.2 contrary evidence on the other) is "indeterminate," meaning that the cat could either be dead or alive. This interval represents the level of uncertainty based on the evidence in the system.
|Null (neither alive nor dead)||0||0||0|
|Either (alive or dead)||0.3||1.0||1.0|
The null hypothesis is set to zero by definition (it corresponds to "no solution"). The orthogonal hypotheses "Alive" and "Dead" have probabilities of 0.2 and 0.5, respectively. This could correspond to "Live/Dead Cat Detector" signals, which have respective reliabilities of 0.2 and 0.5. Finally, the all-encompassing "Either" hypothesis (which simply acknowledges there is a cat in the box) picks up the slack so that the sum of the masses is 1. The belief for the "Alive" and "Dead" hypotheses matches their corresponding masses because they have no subsets; belief for "Either" consists of the sum of all three masses (Either, Alive, and Dead) because "Alive" and "Dead" are each subsets of "Either". The "Alive" plausibility is 1 - m (Dead): 0.8 and the "Dead" plausibility is 1 - m (Alive): 0.5. In other way, the "Alive" plausibility is m(Alive) + m (Either) and the "Dead" plausibility is m(Dead) + m(Either). Finally, the "Either" plausibility sums m(Alive) + m(Dead) + m(Either). The universal hypothesis ("Either") will always have 100% belief and plausibility--it acts as a checksum of sorts.
Here is a somewhat more elaborate example where the behavior of belief and plausibility begins to emerge. We're looking through a variety of detector systems at a single faraway signal light, which can only be coloured in one of three colours (red, yellow, or green):
|Red or Yellow||0.06||0.66||0.85|
|Red or Green||0.05||0.55||0.75|
|Yellow or Green||0.04||0.44||0.65|
Events of this kind would not be modeled as disjoint sets in probability space as they are here in mass assignment space. Rather the event "Red or Yellow" would be considered as the union of the events "Red" and "Yellow", and (see probability axioms) P(Red or Yellow) >= P(Yellow), and P(Any) = 1, where Any refers to Red or Yellow or Green. In DST the mass assigned to Any refers to the proportion of evidence that can not be assigned to any of the other states, which here means evidence that says there is a light but does not say anything about what color it is. In this example, the proportion of evidence that shows the light is either Red or Green is given a mass of 0.05. Such evidence might, for example, be obtained from a R/G color blind person. DST lets us extract the value of this sensor's evidence. Also, in DST the Null set is considered to have zero mass, meaning here that the signal light system exists and we are examining its possible states, not speculating as to whether it exists at all.
Beliefs from different sources can be combined with various fusion operators to model specific situations of belief fusion, e.g. with Dempster's rule of combination, which combines belief constraints that are dictated by independent belief sources, such as in the case of combining hints or combining preferences. Note that the probability masses from propositions that contradict each other can be used to obtain a measure of conflict between the independent belief sources. Other situations can be modeled with different fusion operators, such as cumulative fusion of beliefs from independent sources which can be modeled with the cumulative fusion operator.
Dempster's rule of combination is sometimes interpreted as an approximate generalisation of Bayes' rule. In this interpretation the priors and conditionals need not be specified, unlike traditional Bayesian methods, which often use a symmetry (minimax error) argument to assign prior probabilities to random variables (e.g. assigning 0.5 to binary values for which no information is available about which is more likely). However, any information contained in the missing priors and conditionals is not used in Dempster's rule of combination unless it can be obtained indirectly--and arguably is then available for calculation using Bayes equations.
Dempster-Shafer theory allows one to specify a degree of ignorance in this situation instead of being forced to supply prior probabilities that add to unity. This sort of situation, and whether there is a real distinction between risk and ignorance, has been extensively discussed by statisticians and economists. See, for example, the contrasting views of Daniel Ellsberg, Howard Raiffa, Kenneth Arrow and Frank Knight.
is the set of all subsets of X, including the empty set . For example, if:
The elements of the power set can be taken to represent propositions concerning the actual state of the system, by containing all and only the states in which the proposition is true.
The theory of evidence assigns a belief mass to each element of the power set. Formally, a function
is called a basic belief assignment (BBA), when it has two properties. First, the mass of the empty set is zero:
Second, the masses of all the members of the power set add up to a total of 1:
The mass m(A) of A, a given member of the power set, expresses the proportion of all relevant and available evidence that supports the claim that the actual state belongs to A but to no particular subset of A. The value of m(A) pertains only to the set A and makes no additional claims about any subsets of A, each of which have, by definition, their own mass.
From the mass assignments, the upper and lower bounds of a probability interval can be defined. This interval contains the precise probability of a set of interest (in the classical sense), and is bounded by two non-additive continuous measures called belief (or support) and plausibility:
The belief bel(A) for a set A is defined as the sum of all the masses of subsets of the set of interest:
The plausibility pl(A) is the sum of all the masses of the sets B that intersect the set of interest A:
The two measures are related to each other as follows:
And conversely, for finite A, given the belief measure bel(B) for all subsets B of A, we can find the masses m(A) with the following inverse function:
where |A - B| is the difference of the cardinalities of the two sets.
It follows from the last two equations that, for a finite set X, one needs to know only one of the three (mass, belief, or plausibility) to deduce the other two; though one may need to know the values for many sets in order to calculate one of the other values for a particular set. In the case of an infinite X, there can be well-defined belief and plausibility functions but no well-defined mass function.
The problem we now face is how to combine two independent sets of probability mass assignments in specific situations. In case different sources express their beliefs over the frame in terms of belief constraints such as in case of giving hints or in case of expressing preferences, then Dempster's rule of combination is the appropriate fusion operator. This rule derives common shared belief between multiple sources and ignores all the conflicting (non-shared) belief through a normalization factor. Use of that rule in other situations than that of combining belief constraints has come under serious criticism, such as in case of fusing separate belief estimates from multiple sources that are to be integrated in a cumulative manner, and not as constraints. Cumulative fusion means that all probability masses from the different sources are reflected in the derived belief, so no probability mass is ignored.
Specifically, the combination (called the joint mass) is calculated from the two sets of masses m1 and m2 in the following manner:
K is a measure of the amount of conflict between the two mass sets.
The normalization factor above, 1 - K, has the effect of completely ignoring conflict and attributing any mass associated with conflict to the null set. This combination rule for evidence can therefore produce counterintuitive results, as we show next.
The following example shows how Dempster's rule produces intuitive results when applied in a preference fusion situation, even when there is high conflict.
An example with exactly the same numerical values was introduced by Zadeh in 1979, to point out counter-intuitive results generated by Dempster's rule when there is a high degree of conflict. The example goes as follows:
Such result goes against the common sense since both doctors agree that there is a little chance that the patient has a meningitis. This example has been the starting point of many research works for trying to find a solid justification for Dempster's rule and for foundations of Dempster-Shafer Theory or to show the inconsistencies of this theory.
The following example shows where Dempster's rule produces a counter-intuitive result, even when there is low conflict.
This result implies complete support for the diagnosis of a brain tumor, which both doctors believed very likely. The agreement arises from the low degree of conflict between the two sets of evidence comprised by the two doctors' opinions.
In either case, it would be reasonable to expect that:
since the existence of non-zero belief probabilities for other diagnoses implies less than complete support for the brain tumour diagnosis.
As in Dempster-Shafer theory, a Bayesian belief function has the properties and . The third condition, however, is subsumed by, but relaxed in DS theory::p. 19
For example, a Bayesian would model the color of a car as a probability distribution over (red, green, blue), assigning one number to each color. Dempster-Shafer would assign numbers to each of (red, green, blue, (red or green), (red or blue), (green or blue), (red or green or blue)) which do not have to cohere, for example Bel(red)+Bel(green) != Bel(red or green). This may be computationally more efficient if a witness reports "I saw that the car was either blue or green" in which case the belief can be assigned in a single step rather than breaking down into values for two separate colors. However this can lead to irrational conclusions.
Equivalently, each of the following conditions defines the Bayesian special case of the DS theory::p. 37,45
Bayes' conditional probability is a special case of Dempster's rule of combination.:p. 19f.
It has been argued that DS theory provides a clearer distinction between epistemic uncertainty and physical uncertainty than Bayesian theory. For example, the height of an unobserved person from a population may have a Gaussian belief distribution with a high variance, but Bayesian theory obtains the same distribution in the case where all people are the same height but little data is available about what that height is, as in the case where there is a wide range of physically different heights in the population. Standard Bayesian theory may lead to suboptimal decisions if this difference is not accounted for using second-order probability and machinery to estimate utilities of information-gathering actions.
It has also been argued that DS theory is not a generalisation of Bayesian theory.
The Bayesian approximation [Voorbraak, 1989) reduces a given bpa to a (discrete) probability distribution, i.e. only singleton subsets of the frame of discernment are allowed to be focal elements of the approximated version Markup Renders as of :
It's useful for those who are only interested in the single state hypothesis.
We can perform it in the 'light' example.
|Red or Yellow||0.06||0.21||0.07||0||0||0|
|Red or Green||0.05||0.01||0.01||0||0||0|
|Yellow or Green||0.04||0.03||0.01||0||0||0|
Judea Pearl (1988a, chapter 9; 1988b and 1990) has argued that it is misleading to interpret belief functions as representing either "probabilities of an event," or "the confidence one has in the probabilities assigned to various outcomes," or "degrees of belief (or confidence, or trust) in a proposition," or "degree of ignorance in a situation." Instead, belief functions represent the probability that a given proposition is provable from a set of other propositions, to which probabilities are assigned. Confusing probabilities of truth with probabilities of provability may lead to counterintuitive results in reasoning tasks such as (1) representing incomplete knowledge, (2) belief-updating and (3) evidence pooling. He further demonstrated that, if partial knowledge is encoded and updated by belief function methods, the resulting beliefs cannot serve as a basis for rational decisions.
K?opotek and Wierzcho? proposed to interpret the Dempster-Shafer theory in terms of statistics of decision tables (of the rough set theory), whereby the operator of combining evidence should be seen as relational joining of decision tables. In another interpretation M. A. K?opotek and S. T. Wierzcho? propose to view this theory as describing destructive material processing (under loss of properties), e.g. like in some semiconductor production processes. Under both interpretations reasoning in DST gives correct results, contrary to the earlier probabilistic interpretations, criticized by Pearl in the cited papers and by other researchers.
Jøsang proved that Dempster's rule of combination actually is a method for fusing belief constraints. It only represents an approximate fusion operator in other situations, such as cumulative fusion of beliefs, but generally produces incorrect results in such situations. The confusion around the validity of Dempster's rule therefore originates in the failure of correctly interpreting the nature of situations to be modeled. Dempster's rule of combination always produces correct and intuitive results in situation of fusing belief constraints from different sources.
In considering preferences one might use the partial order of a lattice instead of the total order of the real line as found in Dempster-Schafer theory. Indeed, Gunther Schmidt has proposed this modification and outlined the method.
Given a set of criteria C and a lattice L with ordering E, Schmidt defines a relational measure μ from the power set on C into L that respects the order Ω on P(C): The tools of the calculus of relations, including composition of relations, are used to express this respect:
Schmidt compares μ with the belief function of Schafer, and he also considers a method of combining measures generalizing the approach of Dempster (when new evidence is combined with previously held evidence). He also introduces a relational integral and compares it to the Choquet integral and Sugeno integral. Any relation m between C and L may be introduced as a "direct valuation", then processed with the calculus of relations to obtain a possibility measure μ.