Imagine you were a computer programmer tasked with building not the Chess expert but the medical diagnostic expert. Unlike Chess where all legal moves are known, almost all the elements in your medical diagnostic database will have probabilities attached to them and very little deductive absolute knowledge will be available. How could you begin to structure the data so that the large search space did not become overwhelming? The structure needs to be able to provide answers to queries of the form, what is the most likely disease given this set of symptoms? The key feature of the required structure is that it should quickly identify what items in the data are irrelevant so that they can be safely ignored and provide the ability to work efficiently with a large numbers of parameters.

It turns out that the solution is to take the metaphors I have mentioned literally. Remember the decision tree? Sketch it out as a graphical model. These trees consist of circular nodes that each represents one of the variables in our model. Each node also contains a numeric probability for each of the variable’s possible states representing the power of that evidence in the overall scheme of the model. The nodes are connected with lines, the branches of the tree. These lines express the conditional probability associations and are often directed, the arrows indicating the direction of influence of one variable on another. We say that the node at the tail end of an arrow is the parent of the node at the head of the arrow, its child node. What results is a tree of probability relationships known as a Bayesian Belief Network. To reason across the tree is a process of including the values of the states that are known and summing across those nodes that are unknown. The great technical trick is that the tree structure allows us to only consider the immediate parents of a node as we traverse the relationships. Those immediate parents are the only factors that are relevant.

This is a simple, classic example of a graphical model. You have an automatic sprinkler system which includes a fairly efficient sensor to detect how cloudy it is, though it is not 100% accurate. You wake up and find the grass is wet, what caused it? There are two possible causes in the model, either it rained or the sprinklers were on last night. What influences whether it rained or the sprinklers were on is whether or not it was cloudy. This model is a graphical representation of a set of conditional probabilities. Once it is fully specified by including numeric values we can ask the model to answer numerous questions: Given that it was cloudy what are the chances that the sprinkler system was on last night? Given that the grass is wet what are the chances that it rained? If I know the sprinklers were on last night what are the chances that it was cloudy? This is the power of the Bayesian Belief Network; it becomes a probability speaking oracle. Before looking at the fully specified model there are a number of observations to make about what this seemingly simple technique provides even before any numeric values are introduced.

The graphical representation is faithful to the way we often think about probabilistic circumstances. We are able to decompose a reasoning task into local independent considerations from different domains (atmospheric, technological and perceptual in our model) and fit these results together into a global inference in stages. Observations in cognitive science of how people actually think lends weight to the graphical modeling approach, it seems that organizing our knowledge in such structures is how we too are able to access relevant information in a timely fashion. As we proceed through complex reasoning with many steps we naturally deal with the locally relevant items then carry their results to the next step and so on. This is exactly the algorithm used in machine intelligence when querying a Bayesian network. “It is this role which prompts us to posit that conditional independence is not a grace of nature for which we must wait passively, but rather a psychological necessity which we satisfy actively by organizing our knowledge in a specific way”(Pearl 1988, pg. 44). The world is full of numerous causes and effects interconnected in many different ways but when we reason under uncertainty we only consider those factors that are directly relevant. If I am wondering what the chances are that my family will benefit if I take a new job offer, I consider items like salary and my well being. These in turn will depend on numerous other factors such as the sociological status of my chosen career, the local corporate tax policy, how macroeconomic inflation or recession is affecting the national economy and so on. Although these additional factors are all real influences that bear on my estimation, their influences are accounted for by considering simply the offered salary. Cognitively we are able to chunk our information in this way and avoid being overwhelmed. In the same way using graphical models assists the designers of expert systems as they interview specialists about their knowledge. By allowing relevant relationships to become explicit the soundness of the model can be checked by considering only those influences represented by the arrows. The medical diagnostic model with its thousands of nodes is only tractable because this graphical representation allows us to consider only those variables involved in direct influences. It scales up because of its modularity.

There is another aspect of how the graphical model is a representation of something more fundamental than numeric values when we reason. It is difficult for us to intuitively provide estimates of factors that are probabilistically independent but easy to perceive and estimate dependent relationships where knowing the truth about one condition will influence our estimates of another. For example if I know that it rained last night this immediately increases my belief that the grass will be wet. These notions of dependence that tell us which variables are relevant are much more basic to our reasoning than the assignment of numeric values of the probabilities. This important insight is what has revolutionized the use of Bayesian thought in machine intelligence. The structure of dependencies laid out in the graphic tree are more basic than the numeric values, it captures an invariant aspect of the problem domain. The relationships define a stable structural model that remains unchanged by the introduction of numbers whatever their values. In the wet grass model we see that by capturing where the dependent influences occur we have automatically implied what are independent as well; the state of the clouds do not directly influence whether the grass is found to be wet or not and whether or not the sprinklers are on has no direct influence on the probability of rain. This is just what the technical challenge needs, the identification of what variables do not need to be considered because they are irrelevant. This can be seen by comparing the number of terms that need to be calculated. The complete probability of such models is the joint probability of all the nodes in the tree, for this model p(cloudy, rain, sprinkler, wet grass) where a comma is being used to indicate AND. In the full model, using the product rule, the complete joint probability is given by

p(cloudy, rain, sprinkler, wet grass) =

p(cloudy) * p(rain | cloudy) * p(sprinkler | cloudy, rain) * p(wet grass | cloudy, rain, sprinkler)

by using the independence relationships just mentioned the same probabilities can be written as

p(cloudy, rain, sprinkler, wet grass) =

p(cloudy) * p(rain | cloudy) * p(sprinkler | cloudy) * p(wet grass | rain, sprinkler)

Why are both of these representations of the same model? The theorems of the Bayesian Belief Network prove that the probability of each variable in each node only depends on the probability of its parent nodes. We can see this intuitively in our model. The cloud sensing sprinkler in the third term is independent of the state of rain given its parent cloudy. What this means is that by knowing if it was cloudy or not I know whether or not the sprinklers were on, knowing if it rained or not adds no additional relevant information. Similarly the state of the wet grass variable is independent of the cloudy condition given its parents rain and sprinkler. Knowing if it rained or knowing if the sprinklers were on is all that we need to know to determine if the grass will be wet, knowing if it was cloudy or not adds no new information to the task of determining the wetness of the grass. In this sprinkler model the savings are minimal but as the models become more complex the savings can become substantial. For example one famous medical diagnostic system is QMR-DT where 600 diseases are mapped to 4000 symptoms, illustrated below.

Before the introduction of numeric values the graphic representation defines a whole class of models all sharing the same structural characteristics. Instances of the model are defined when numeric values are assigned. An instance of the wet grass model follows. Each table assigns the probability of the node variable for each possible state of its parents. The top node in our model, cloudy, has no parent node so it is assigned a prior probability. This is how prior probabilities are introduced into Bayesian networks. Each variable in this model is Boolean, it can only take on the values of true or false but this is not a requirement, just a simplification. The columns indicate the true or false value, so for example the table associated with the rain node is read as indicating that the probability that the sprinkler was on (S=T) given it was cloudy is 0.1 or 10% indicating that the sensor correctly keeps the sprinkler from turning on when it is cloudy 90% of the time. Notice that all the rows sum to 1 as they must to be valid probability assignments.

“Perhaps the most important aspect of Bayesian networks is that they are direct representations of the world, not of reasoning processes. The arrows in the diagram represent real casual connections and not the flow of information during reasoning (as in rule based systems and neural networks). Reasoning processes can operate on Bayesian networks by propagating information in any direction.” (Pearl 2000) Let’s see how our model can be used to draw inferences which will show how the flow of information does occur. The information can flow either top down or bottom up. When it flows bottom up it is diagnostic, we observe an effect and ask about the cause. When it flows top down it is predictive, we observe a cause and ask how probable an effect will be. The top down flow is also referred to as generative; it specifies how causes generate effects.

Reasoning with a Bayesian network is accomplished by setting up a query and using the tree to arrive at the final answer. Any combination of variables can be used to create the query. This is just what it is to have a probability model; it specifies each relationship between every elementary event in the language. In a fully specified model each atomic proposition or its negation appears once and as we have seen, sums to 1. Consider the diagnostic query in which we observe that the grass is wet and want to know which of the two possible causes is most probable. Translated into terms of probability this query is asking for p(sprinkler = true | wet grass = true) and the p(rain = true | wet grass = true). The form is p(query | evidence). Using the Bayes equation we are able to compute the posterior probabilities. We find that the probability of the cause being the sprinkler is 0.430 and the probability of the cause being rain is 0.708. Rain is the more likely cause. On the other hand if we observe that the grass is not wet we find that posterior probability of the sprinkler being on is 0.062 and of it having rained 0.119.

Mathematically this is simply an extension of the Bayesian equation to the case of multiple variables. The rest of this paragraph lays out how this is possible; it can be skipped without loss by those less curious about the actual mechanics of the calculations. The algorithm is a sum of products. What is happening is that each entry in each table is being used for variables that are not part of the query. For those that are a part of the query the correct entry in the table corresponding to those variable settings is being used. Each variable that is not part of the query is considered a ‘nuisance variable’ and is being summed out, that is, each product of the conditional calculations for the nuisance variables are being added together. When we are asking what is the probability of the sprinkler being on given the grass is wet the sum is over the cloudy and rain nodes. One product is found when all values are true. This is being added to the rest of the set of products created by leaving the evidence variables at the same setting but accounting for each of the non-evidence variables in every combination:

p(sprinkler = true | wet grass = true) =

cloudy =

true * sprinkler = true * rain = true * wet grass = true (all values set to true) +

cloudy =

false * sprinkler = true * rain = true * wet grass = true (change only cloudy to false) +

cloudy =

true * sprinkler = true * rain = false * wet grass = true (change only rain to false) +

cloudy =

false * sprinkler = true * rain = false * wet grass = true (change both cloudy and rain to false)

Each of these products is being added together and the result normalized to arrive at the final answer. Notice how that when asking for the probability of the sprinkler = true given that wet grass = true we are only considering the true values for the sprinkler and wet grass nodes. The result of this summation process is being normalized by being divided by the evidence, in this case wet grass = true. This evidence value is found in the same way by keeping wet grass set to true and incorporating each possible value of the remaining nodes. The same technique is being used when we are asking for the probability of it having rained given the grass is wet; the sum is now calculated over the cloudy and sprinkler nodes.

What would happen to the probability of the sprinkler having been on if we knew it rained last night and observe that the grass is wet? Now the query takes the form p(sprinkler = true | wet grass = true, rain = true). The posterior probability of the sprinkler being on drops from 0.430 to 0.195 with the addition of this new information. This is known as ‘explaining away’. There are two possible causes of the wet grass, rain or sprinkler, which are independent of one another in our model. There is no arrow between them which is the graphical way to recognize that the sprinkler does not cause it to rain and the rain does not cause the sprinkler to turn on or off (remember the sprinkler sensor is sensitive to cloud cover, not humidity). However when the observation of wet grass occurs they do become conditionally dependent in the sense that they are now competing explanations. When the additional observation that it has rained is added this does have an effect on the probability of the sprinkler being on. The rain explains away the need for the sprinkler to have made the grass wet. This effect is even more evident if what we have observed about the rain is that it did not rain. Now the query is p(sprinkler = true | wet grass = true, rain = false) and the posterior probability of the sprinkler being on is 1.0 or 100%. This ability to properly deal with explaining away competing causes though handled automatically in Bayesian networks is not something the rule based or neural network approaches to machine intelligence are able to deal with in any natural way.

We can also ask the sprinkler model predictive queries. What is the chance that it will rain given that it is cloudy, that is p(rain = true | cloudy = true)? This can be read off the table directly, 0.8 or 80%. What if we ask what the chances are that the grass will be wet given that it is cloudy? The posterior for p(wet grass = true | cloudy = true) is 0.745, a little less than the chance of rain. If it seems counter intuitive that this is less than the chance of rain consider all the factors that are now relevant. There is a 10% chance in the model that it can rain yet the grass will not be wet when we wake up and observe it. It is also factoring in the sprinkler with its 10% chance of running even though it is cloudy. This combines with the 20% chance that it can be cloudy and not rain. The posterior p(wet grass = true | cloudy = true) = 0.745 is taking all this into account. Even in this simple model we see how it provides a useful aid for assuring that our inferences are abiding by the axioms of probability theory.

In all these examples only one node has been involved in the query but this is not a necessary restriction. What are the chances that both the sprinkler was on and that it rained given that the grass is wet? The model answers p(rain = true, sprinkler = true | wet grass = true) to be 0.138 or approximately 14% of the time. This time the result is larger than we might have intuitively expected. This can draw our attention to examine the relevant factors more carefully. In this case one interesting aspect of the model is that it allows that it rained though it is no longer cloudy when we wake up to observe the wet grass 20% of the time. It is easy to see how the interaction of the relevant variables can lead to results that are unexpected yet on careful analysis are found to agree with common sense.

Bayesian networks provide our robot with its consistent technique for updating its beliefs in models complex and complete enough to be useful in real world applications. The structure of the graphical tree allows the necessary computations to be performed at each node and then pass messages from one node to the other in an object oriented fashion. This has made practical implementations of these Bayesian networks possible in many circumstances. However, there are problems of inference in Bayesian trees that are NP-hard, which is computer science speak for being impractical to the point of impossible. In our example sprinkler model the variables were Boolean, the simplest possible. One complication arises when the variables can take on continuous values such as found in measurements that use the real number line. In this circumstance the summation step is replaced by integration and as mentioned not all the integrals that arise have analytic solutions. Another complication arises when the number of parents for a given node grows. As we have seen it is the structure of the graph that makes the identification of independence easy but this arises only when the graph has a tree structure. Not all models decompose into such structures. Finally, there are models that require that there be loops in the tree structure due to feedback mechanisms. For each of these complicating cases sophisticated algorithms have been developed that rely on approximations and temporary restructuring of the tree. They work in practice but considerably complicate the picture as it has been presented here.

I have presented Bayesian networks as a technology for performing inference since the subject of Bayesian belief updating is the central concept we are exploring in this book. There are related capabilities worth touching on as well. Having an instance of the model of our belief network is an incredibly nifty thing but how are the probabilities in those tables to be arrived at? Haven’t we just pushed the hard problem off to another level? One way to arrive at the probabilities is to use the knowledge gained from past studies. Another is to interview a number of experts but there remains yet a third way. It is possible to set a computer the task of learning the correct probabilities from a set of data. The idea is to program the computer to examine the relationships among the variables as they are found in the data and set the probabilities appropriately. How do engineers know that the values found this way are appropriate? There are a number of techniques but one of the trickiest is to take the existing data set and split it in two. One of the data sets is used to train the model, to assign probabilities. Then the model is used to make predictions and these are compared to the data that is found in the second data set. A common example where this technique has been used successfully is in the spam filters that protect an email account. Spam filters are often built using a form of the Bayesian network we are considering known as a classifier. Their job is to separate spam email from non-spam and they do so by examining the text of the email for combination of key words that are known to be included in spam with high probability. They are using a network where the nodes are individual keywords and calculating the conditional probabilities of their combinations. When you indicate that a particular email that it did not keep out of your inbox is spam the software re-analyzes its content and adjusts its probabilities given the new information. It learns. This is different than what we have seen these networks doing when answering a query which was a process of belief updating. Now the network is involved in a process of belief commitment. It is creating a collection of settings as the most likely to explain the evidence.

This too can be seen as a way of pushing off the hard problem. It is all well and good that the spam filter is able to update the settings in its nodes but where did the model initially come from that identified the keywords that should be the relevant variables? In many models the relevant variables are known and the whole point is to translate that knowledge into a form that is capable of automating reasoning, the basic medical expert systems are in this category. A spam filter on the other hand represents a whole different set of models where there is not a unifying theory for cause and effect and no expert knowledge detailed enough to do the job. For these types of models the same approach can be taken, the computer can be programmed to learn the best model structure that accounts for the evidence. This exploratory data analysis has become a very widely used Bayesian technique because of the power it has to find relationships among large data sets that exceed the capacity of the unaided mind. One example should suffice to illustrate the types of problems that this data analysis has made tractable. In genetic analysis there are very often large collections of variables, thousands of genes and tens of thousands of inter-genetic signals. When a researcher is investigating the possibility of a disease having a genetic component they use data sets of the genetic expression of those with and without the disease. They need to find what relationship among these genetic factors best predicts the disease occurrence. The process now consists of the computer exploring sets of classes of models, each with their own structure, to find those that explain the data most completely. The beauty of the probabilist approach is that the results are all open to critical examination. Unlike neural networks there are no hidden nodes that are not accessible nor are the influences between the nodes too far removed from the causes and effects of the initial problem domain. The probabilistic approach builds in safeguards to guide the challenge of interpreting the results, churning out low probabilities when appropriate and specifically indicating where more data is required.

The arrows in the sprinkler model all represent cause and effect relationships. This need not be the case; the links can also be used to represent logical associations. A model of an orchard’s crop yield might include the output of peaches and the output of cherries. Though there is no casual relationship between them making them dependent there could be a logical one. New information about the cherry yield could influence what we expect from the peaches as they both depend on shared environmental conditions. Proper inference always depends on the complete state of knowledge from which the questions are being asked. It is also possible that there is a casual relationship between nodes that the researcher does not know about; in these cases those nodes are going to be considered conditionally independent. As predictions and diagnostics are compared with the actual data the Bayesian networks provide a flexible way to explore accuracy of our models. There is one more way to use these networks for causal investigations if the model is like our sprinkler in which each association is mapping a cause and effect relationship. We can ask what would happen if we were to intervene, to reach in as it were and set the state of a node to a given value. This action is written as do(X=x), turning on the sprinkler would be written do(sprinkler=1). This has the effect of separating that node, it would cut the link from cloudy to sprinkler. If we turn the sprinkler on it no longer depends on whether or not it is cloudy. This will change the results obtained from queries that are calculated against this new tree. This powerful technique allows us to predict what the impact of external interventions will be using only the data that we have collected before the intervention takes place (Pearl 1999, Sloman 2009). By using this ‘do’ operation it is possible to explore the causes and effects of actions in circumstances in which it might be too dangerous, unethical, too expensive or simply impractical to intervene. Intervention on targeted variables is also the essence of the experimental method. Investigation of the ‘do’ operator as a new means of conducting experiments wholly within the computer is an active area of research.

To conclude this whirlwind tour of the use of Bayesian thought in machine intelligence it is good to recall the first and most important rule of computer science – garbage in, garbage out. Nowhere is this more easily seen than in the use of Bayesian techniques, they will never fail to provide the correct answer to the exact question that was posed. The danger is that it is not always easy to understand the exact question. All mathematics is only able to deliver an answer that is based on the information provided. As we have seen the sprinkler model was almost like an oracle, providing answers to every query we could ask. The value of the answers can only be as good as the assumptions that went into the model. As the machine intelligences around us continue to increase their sophistication it is increasingly important to not lose sight of these platitudes.