Learning Bayesian belief networks with neural network estimatorsy Stefano Monti Intelligent Systems Program University of Pittsburgh 901M CL, Pittsburgh, PA { 15260

; Gregory F. Cooper Center for Biomedical Informatics University of Pittsburgh 8084 Forbes Tower, Pittsburgh, PA { 15261

[email protected]

[email protected]

Technical Report ISSP-96-02 April 1996 (Revised January '97) Intelligent Systems Program University of Pittsburgh 901 CL, Pittsburgh, PA - 15260

Abstract

In this paper we propose a method for learning Bayesian belief networks from data. The method uses arti cial neural networks as probability estimators, thus avoiding the need for making prior assumptions on the nature of the probability distributions governing the relationships among the participating variables. This new method has the potential for being applied to domains containing both discrete and continuous variables arbitrarily distributed. We compare the learning performance of this new method with the performance of the method proposed by Cooper and Herskovits in [10]. The experimental results show that, although the learning scheme based on the use of ANN estimators is slower, the learning accuracy of the two methods is comparable.

y To

appear in Advances in Neural Information Processing Systems, 1996.

1

1 Introduction Bayesian belief networks (BBN), often referred to as probabilistic networks, are a powerful formalism for representing and reasoning under uncertainty. This representation has a solid theoretical foundation [19], and its practical value is suggested by the rapidly growing number of areas to which it is being applied. BBNs concisely represent the joint probability distribution over a set of random variables, by explicitly identifying the probabilistic dependencies and independencies between these variables. Their clear semantics make BBNs particularly suitable for being used in tasks such as diagnosis, planning, control, and explanation. Construction of probabilistic networks with domain experts often remains a dicult and time consuming task [12]. Knowledge acquisition from experts is dicult because the experts have problems in making their knowledge explicit. Furthermore, it is time consuming because the information needs to be collected manually. On the other hand, databases are becoming increasingly abundant in many areas. By exploiting databases, the construction time of probabilistic networks may be considerably decreased. The task of learning a BBN from data can usually be formulated as a search over the space of network structures, and as the subsequent search for an optimal parametrization of the discovered structure or structures. The task can be further complicated by extending the search to account for hidden variables and for the presence of data points with missing values. Dierent approaches have been successfully applied to the task of learning probabilistic networks from data [6, 9, 10, 16, 21, 22, 23, 24]. In all these approaches, simplifying assumptions are made to circumvent practical problems in the implementation of the theory. One common assumption that is made is that all variables are discrete, or that all variables are continuous and normally distributed. In this paper, we propose a novel method for learning BBNs from data that makes use of arti cial neural networks (ANN) as probability distribution estimators, thus avoiding the need for making prior assumptions on the nature of the probability distribution governing the relationships among the participating variables. The use of ANNs as probability distribution estimators is not new [3, 18, 20], and its application to the task of learning Bayesian belief networks from data has been recently explored in [15]. However, in [15] the ANN estimators were used in the parametrization of the belief network structure only, and cross validation was the method of choice for comparing dierent network structures. In our approach, the ANN estimators are an essential component of the scoring metric used to search over the BBN structure space. We ran several simulations to compare the performance of this new method with the learning method described in [10]. The results show that, although the learning scheme based on the use of ANN estimators is slower, the learning accuracy of the two methods is comparable. The rest of the paper is organized as follows. In Section 2 we brie y introduce the Bayesian belief network formalism and some basics of how to learn BBNs from data. In Section 3, we describe our learning method, and detail the use of arti cial neural networks as probability distribution estimators. In Section 4 we present some experimental results comparing the performance of this new method with the one proposed in [10]. We conclude the paper with some suggestions for further research.

2 Background A Bayesian belief network is de ned by a triple (G; ; P ), where G = (X ; E ) is a directed acyclic graph with a set of nodes X = fx1 ; : : : ; xn g representing domain variables, and with a set of arcs E = f(xi ; xj ) j xi ; xj 2 X ; xi 6= xj g representing probabilistic dependencies among domain variables; is the space of possible instantiations of the domain variables1; and P is a probability distribution over the instantiations in . Given a node x 2 X , we use x to denote the set of parents of x in X . The essential property of BBNs is summarized by the Markov condition, which asserts that each variable is independent of its non-descendants given its parents. This property allows for the representation of the multivariate joint probability distribution over X in terms of the univariate conditional distributions P (xi j i ; i ) of each variable xi given its parents 1

An instantiation ! of all n variables in X is an n-uple of values fx01 ; : : : ; x0n g such that xi = x0i for i = 1 : : : n.

2

i , with i the set of parameters needed to fully characterize the conditional probability. Application of the chain rule, together with the Markov condition, yields the following factorization of the joint probability of any particular instantiation of all n variables:

P (x01 ; : : : ; x0n ) =

Yn P (x0 j 0 ; ) :

i=1

i

(1)

xi i

2.1 Learning Bayesian belief networks 2 The task of learning BBNs involves learning the network structure and learning the parameters of the

conditional probability distributions. A well established set of learning methods is based on the de nition of a scoring metric measuring the tness of a network structure to the data, and on the search for high-scoring network structures based on the de ned scoring metric [6, 10, 14]. We focus on these methods, and in particular on the de nition of Bayesian scoring metrics. In a Bayesian framework, ideally classi cation and prediction would be performed by taking a weighted average over the inferences of every possible belief network containing the domain variables. Since this approach is in general computationally infeasible, often an attempt has been made to use a high scoring belief network for classi cation. We will assume this approach in the remainder of this paper. The basic idea of the Bayesian approach is to maximize the probability P (BS j D) = P (BS ; D)=P (D) of a network structure BS given a database of cases D. Because for all network structures the term P (D) is the same, for the purpose of model selection it suces to calculate P (BS ; D) for all BS . The Bayesian metrics developed so far all rely on the following assumptions: 1) given a BBN structure, all cases in D are drawn independently from the same distribution (multinomial sample); 2) there are no cases with missing values (complete database); 3) the parameters of the conditional probability distribution of each variable are independent (global parameter independence); and 4) the parameters associated with each instantiation of the parents of a variable are independent (local parameter independence). The application of these assumptions allows for the following factorization of the probability P (BS ; D)

P (BS ; D) = P (BS )P (D j BS ) = P (BS )

Yn s(x ; ; D) ;

i=1

(2)

i i

where n is the number of nodes in the network, and each s(xi ; i ; D) is a term measuring the contribution of xi and its parents i to the overall score of the network BS . The exact form of the terms s(xi i ; D) slightly diers in the Bayesian scoring metrics de ned so far, and for lack of space we refer the interested reader to the relevant literature [6, 10, 14]. By looking at Equation (2), it is clear that if we assume a uniform prior distribution over all network structures, the scoring metric is decomposable, in that it is just the product of the s(xi ; i ; D) over all xi times a constant P (BS ). Suppose that two network structures BS and BS dier only for the presence or absence of a given arc into xi . To compare their metrics, it suces to compute s(xi ; i ; D) for both structures, since the other terms are the same. Q Likewise, if we assume a decomposable prior distribution over network structures, of the form P (BS ) = i pi , as suggested in [14], the scoring metric is still decomposable, since we can include each pi into the corresponding s(xi ; i ; D). Once a scoring metric is de ned, a search for a high-scoring network structure can be carried out. This search task (in several forms) has been shown to be NP-hard [4, 8]. Various heuristics have been proposed to nd network structures with a high score. One such heuristic is known as K2 [10], and it implements a greedy search over the space of network structures. The algorithm assumes a given ordering on the variables. For simplicity, it also assumes that no prior information on the network is available, so the prior probability distribution over the network structures is uniform and can be ignored in comparing network structures. 0

2

For a comprehensive guide to the literature on learning probabilistic networks, see [7].

3

The Bayesian scoring metrics developed so far either assume discrete variables [6, 10, 14], or continuous variables normally distributed [13]. In the next section, we propose a possible generalization which allows for the inclusion of both discrete and continuous variables with arbitrary probability distributions.

3 An ANN-based scoring metric The main idea of this work is to use arti cial neural networks as probability estimators, to de ne a decomposable scoring metric for which no informative priors on the class, or classes, of the probability distributions of the participating variables are needed. The rst three of the four assumptions described in the previous section are still needed, namely, the assumption of a multinomial sample, the assumption of a complete database, and the assumption of global parameter independence. However, the use of ANN estimators allows for the elimination of the assumption of local parameter independence. In fact, the conditional probabilities corresponding to the dierent instantiations of the parents of a variable are represented by the same ANN, and they share the same network weights and the same training data. Let us denote with Dl fC1 ; : : : ; Cl?1 g the set of the rst l ? 1 cases in the database, and with x(il) and i(l) the instantiations of xi and i in the l-th case respectively. The joint probability P (BS ; D) can be written as:

P (BS ; D) = P (BS )P (D j BS ) = P (BS ) = P (BS )

Ym P (C j D ; B ) l=1

Ym Yn P (x(l) j (l); D ; B ) : l

i

i

l=1 i=1

l

l

S

=

S

Q

(3)

If we assume uninformative priors, or decomposable priors on network structures, of the form P (BS ) = i pi , the probability P (BS ; D) is decomposable. In fact, we can interchange the two products in Equation 3, so as to obtain

P (BS ; D) =

Yn [ p Ym P (x(l) j (l); D ; B ) ] = Yn s(x ; ; D) ;

i=1

i

i

l=1

l

i

S

i=1

(4)

i i

where s(xi ; i ; D) is the term between square brackets, and it is only a function of xi and its parents in the network structure BS (pi can be neglected if we assume a uniform prior over the network structures). Notice that the computation of Equation 4 corresponds to the application of the prequential method discussed by Dawid in [11]. The problem now lies with how to estimate each term P (xi j i ; Dl ; BS ). This can be done by means of neural network estimators. Several schemes are available for training a neural network to approximate a given probability distribution, or density. Notice that the calculation of each term s(xi ; i ; D) can be computationally very expensive. For each node xi , computing s(xi ; i ; D) requires the training of m ANNs, where m is the size of the database. To reduce this computational cost, we use the following approximation, which we call the t-invariance approximation: for any l 2 f1; : : :; mg, given the probability P (xi j i ; Dl ; BS ), at least t (1 t m ? l) new cases are needed in order to alter such probability. That is, for each positive integer h, such that h < t, we assume P (xi j i ; Dl+h ; BS ) = P (xi j i ; Dl ; BS ). Intuitively, this approximation implies the assumption that, given our present belief about the value of each P (xi j i ; Dl ; BS ), at least t new cases are needed to revise this belief. By making this approximation, we achieve a t-fold reduction in the computation needed, since we now need to build and train only m=t ANNs for each xi , instead of the original m. In fact, application of the t-invariance approximation to the computation of a given s(xi ; i ; D) yields:

s(xi ; i ; D) =

k+1) Ym P (x(l) j (l); D ; B ) = m=tY?1 t(Y P (x(l) j (l) ; D l=1

i

i

l

S

k=0 l=tk+1

4

i

i

tk ; BS ):

(5)

With regard to the choice of an appropriate value for t, we can either select a constant value for t, or we can choose to increment t as the size of the training database Dl increases. The second approach seems preferable. When estimating P (xi j i ; Dl ; BS ), this estimate will be very sensitive to the addition of new cases when l is small, but will become increasingly insensitive to the addition of new cases as l grows. For example, when l = 1, adding a new case to the training data set means doubling it, while if l = 10; 000 an additional case in the data set is very unlikely to make a signi cant dierence. A scheme for the incremental updating of t can be summarized in the equation t = dle, where l is the number of cases already seen (i.e., the cardinality of Dl ), and 0 < 1. For example, given a data set of 50 cases, the updating scheme t = d0:5le would require the training of the ANN estimators P (xi j i ; Dl ; BS ) for l = 1; 2; 3; 5; 8; 12; 18; 27; 41.

4 Evaluation In this section, we describe the experimental evaluation we conducted to test the feasibility of use of the ANN-based scoring metric developed in the previous section.

Methodology All the experiments are performed on the belief network ALARM, a multiply-connected network originally developed to model anesthesiology problems that may occur during surgery [2]. It contains 37 nodes/variables and 46 arcs. The variables are all discrete, and take between 2 and 4 distinct values. The database used in the experiments was generated from ALARM, and it is the same database used in [10]. In the experiments, we use a modi cation of the algorithm K2 [10]. The modi ed algorithm, which we call ANN-K2, replaces the closed-form scoring metric developed in [10] with the ANN-based scoring metric of Equation (5). The performance of ANN-K2 is measured in terms of accuracy of the recovered network structure, by counting the number of edges added and omitted with respect to the ALARM network; and in terms of the accuracy of the learned joint probability distribution, by computing its cross entropy with respect to the joint probability distribution of ALARM. The learning performance of ANN-K2 is also compared with the performance of K2. To train the ANNs, we used the conjugate-gradient search algorithm described in [17].

Architecture of the ANN estimators Since all the variables in the ALARM network are discrete, the ANN estimators to be included in the scoring metric are de ned based on the softmax model [5] that we now describe. Given a variable xi , with ni values and set of parents i , the conditional probability distribution P P (xi j i) is approximated by a neural network with ni output units, and n input units, where n = j2 (nj ? 1). The ni output units de ne the conditional probabilities P (xi = vk j i ); k = 1; : : : ; ni , as follows: i

f ( ) P (xi = vk j i ) = Pne ef ( ) ; k

i

j =1

i

j

i

where fk (i ) is the linear output of the k-th output unit for the input i . The use of n input units is due to the fact that the variables in ALARM are ordinal variables, with numerical values indicating qualitative measures, such as \low", \medium", \high". The representation of an ordinal variable taking nj values by means of nj ? 1 indicator variables is common practice in statistical regression. As a regularization technique, we augment the training set so as to induce a uniform conditional probability over the unseen instantiantions of the ANN input. In particular, given the probability P (xi j i ; Dl ) to be estimated, and assuming xi is a k-valued variable, for each instantiation i0 that does not appear in the 5

Data set 100 500 1000 2000 3000

Algo. ANN-K2 K2 ANN-K2 K2 ANN-K2 K2 ANN-K2 K2 ANN-K2 K2

arcs +

m 0.19 0.75 0.19 0.22 0.24 0.11 0.19 0.05 0.16 0.00

s.d. 0.40 1.28 0.40 0.42 0.49 0.31 0.40 0.23 0.37 0.00

arcs ?

m 0.62 0.22 0.22 0.11 0.22 0.03 0.11 0.03 0.05 0.03

cross entropy

s.d. 0.86 0.48 0.48 0.31 0.48 0.16 0.31 0.16 0.23 0.16

m 0.23 0.08 0.04 0.02 0.05 0.01 0.02 0.005 0.01 0.004

med .051 .070 .010 .010 .005 .006 .002 .002 .001 .001

s.d. 0.52 0.10 0.11 0.02 0.15 0.01 0.06 0.007 0.017 0.005

time (secs)

m 130 0.44 1077 0.13 6909 0.34 6458 0.46 11155 1.02

med 88 .06 480 .06 4866 .23 4155 .44 4672 .84

s.d. 159 1.48 1312 0.22 6718 0.46 7864 0.65 2136 1.11

Table 1: Comparison of the performance of ANN-K2 and of K2 in terms of number of arcs wrongly added (+), number of arcs wrongly omitted (?), cross entropy, and computation time. Each column reports the mean m, the

median med, and the standard deviation s.d. of the corresponding measure over the 37 nodes/variables of ALARM. The median for the number of arcs added and omitted is always 0, and is not reported.

database Dl , we generate k new cases, with i instantiated to i0 , and xi taking each of its k values. As a result, the neural network estimates P (xi j i0 ; Dl ) to be uniform, with P (xi j i0 ; Dl ) = 1=k for each of xi 's values x11 ; : : : ; x1k .

Results We ran simulations where we varied the size of the training data set (100, 500, 1000, 2000, and 3000 cases), and the value of in the updating scheme t = dle described in Section 3. We used the settings = 0:35, and = 0:5. For each run, we measured the number of arcs added, the number of arcs omitted, the cross entropy, and the computation time, for each variable in the network. That is, we considered each node, together with its parents, as a simple BBN, and collected the measures of interest for each of these BBNs. Table 1 reports mean and standard deviation of each measure over the 37 variables of ALARM, for both ANN-K2 and K2. The results for ANN-K2 shown in Table 1 correspond to the setting = 0:5, since their dierence from the results corresponding to the setting = 0:35 was not statistically signi cant, for any of the given measures. Standard t-tests were performed to assess the signi cance of the dierence between the measures for K2 and the measures for ANN-K2, for each data set cardinality. No technique to correct for multiple-testing was applied. Most measures show no statistically signi cant dierence, either at the 0.05 level or at the 0.01 level (most p values are well above 0.2). In the simulation with 100 cases, both the dierence between the mean number of arcs added and the dierence between the mean number of arcs omitted are statistically signi cant (p ' 0:01). However, these dierences cancel out, in that ANN-K2 adds fewer extra arcs than K2, and K2 omits fewer arcs than ANN-K2. This is re ected in the corresponding cross entropies, whose dierence is not statistically signi cant (p = 0:08). In the simulation with 1000 cases, only the dierence in the number of arcs omitted is statistically signi cant (p ' :03). Finally, in the simulation with 3000 cases, only the dierence in the number of arcs added is statistically signi cant (p ' :02). K2 misses a single arc, and does not add any extra arc, and this is the best result to date. By comparison, ANN-K2 omits 2 arcs, and adds 5 extra arcs. For the simulation with 3000 cases, we also computed Wilcoxon rank sum tests. The results were consistent with the t-test results, showing a statistically signi cant dierence only in the number of arcs added. Finally, as it can be noted in Table 1, the dierence in computation time is of several order of magnitude, thus making a statistical analysis super uous. A natural question to ask is how sensitive is the learning procedure to the order of the cases in the training set. Ideally, the procedure would be insensitive to this order. Since we are using ANN estimators, however, which perform a greedy search in the solution space, particular permutations of the training cases might cause the ANN estimators to be more susceptible to getting stuck in local maxima. We performed some preliminary experiments to test the sensitivity of the learning procedure to the order of the cases in the 6

training set. We ran few simulations in which we randomly changed the order of the cases. The recovered structure was identical in all simulations. Morevoer, the dierence in cross entropy for dierent orderings of the cases in the training set showed not to be statistically signi cant.

5 Conclusions In this paper we presented a novel method for learning Bayesian belief networks from data based on the use of arti cial neural networks as probability distribution estimators. As a preliminary evaluation, we have compared the performance of the new algorithm with the performance of K2, a well established learning algorithm for discrete domains, for which extensive empirical evaluation is available [1, 10]. With regard to the learning accuracy of the new method, the results are encouraging, being comparable to state-of-the-art results for the chosen domain. The next step is the application of this method to domains where current techniques for learning BBNs from data are not applicable, namely domains with continuous variables not normally distributed, and domains with mixtures of continuous and discrete variables. The main drawback of the new algorithm is its time requirements. However, in this preliminary evaluation, our main concern was the learning accuracy of the algorithm, and little eort was spent in trying to optimize its time requirements. We believe there is ample room for improvement in the time performance of the algorithm. More importantly, the scoring metric of Section 3 provides a general framework for experimenting with dierent classes of probability estimators. In this paper we used ANN estimators, but more ecient estimators can easily be adopted, especially if we assume the availability of prior information on the class of probability distributions to be used.

Acknowledgments This work was funded by grant IRI-9509792 from the National Science Foundation.

References [1] C. Aliferis and G. F. Cooper. An evaluation of an algorithm for inductive learning of bayesian belief networks using simulated data sets. In R. L. de Mantras and D. Poole, editors, Proceedings of the 10th Conference of Uncertainty in Arti cial Intelligence, pages 8{14, San Francisco, California, 1994. Morgan Kaumann Publishers. [2] I. Beinlich, H. Suermondt, H. Chavez, and G. Cooper. The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. In 2nd Conference of AI in Medicine Europe, pages 247{256, London, England, 1989. [3] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, 1995. [4] R. Bouckaert. Properties of learning algorithms for Bayesian belief networks. In Proceedings of the 10th Conference of Uncertainty in Arti cial Intelligence, pages 102{109, San Francisco, California, 1994. Morgan Kaumann Publishers. [5] J. Bridle. Probabilistic interpretation of feedforward classi cation network outputs, with relationships to statistical pattern recognition. In Neuro-computing: Algorithms, Architectures and Applications. Springer Verlag, New York, 1989. [6] W. Buntine. Theory re nement on Bayesian networks. In Proceedings of the 7th Conference of Uncertainty in AI, pages 52{60, 1991. 7

[7] W. Buntine. A guide to the literature on learning probabilistic networks from data. IEEE Transactions on Knowledge and Data Engineering, 1996. [8] D. Chickering, D. Geiger, and D. Heckerman. Learning Bayesian networks: search methods and experimental results. In Proceedings of 5th Workshop on Arti cial Intelligence and Statistics, January 1995. [9] G. Cooper and E. Herskovits. A Bayesian method for constructing Bayesian belief networks from databases. In Proceedings of the 7th Conference of Uncertainty in Arti cial Intelligence, pages 86{94, Los Angeles, CA, 1991. [10] G. Cooper and E. Herskovits. A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning, 9:309{347, 1992. [11] A. Dawid. Present position and potential developments: Some personal views. statistical theory. the prequential approach (with discussion). Journal of Royal Statistical Society A, 147:278{292, 1984. [12] M. Druzdzel, L. C. van der Gaag, M. Henrion, and F. Jensen, editors. Building probabilistic networks: where do the numbers come from?, IJCAI-95 Workshop, Montreal, Canada, 1995. [13] D. Geiger and D. Heckerman. Learning Gaussian networks. Technical Report MSR-TR-94-10, Microsoft Research, One Microsoft Way, Redmond, WA 98052, 1994. [14] D. Heckerman, D. Geiger, and D. Chickering. Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning, 1995. [15] R. Hofmann and V. Tresp. Discovering structure in continuous variables using Bayesian networks. In M. E. H. D. S. Touretzky, M. C. Mozer, editor, Advances in Neural Information Processing Systems 8. MIT Press, 1995. [16] W. Lam and F. Bacchus. Learning Bayesian belief networks { an approach based on the MDL principle. Computational Intelligence, 10(4), 1994. [17] M. Moller. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6:525{ 533, 1993. [18] R. Neuneier, F. Hergert, W. Finno, and D. Ormoneit. Estimation of conditional densities: A comparison of neural network approaches. In Proc. of Int. Conference on Arti cial NNs, volume 1, pages 689{692, New York, 1994. [19] J. Pearl. Probabilistic Reasoning in Intelligent Systems: networks of plausible inference. Morgan Kaufman Publishers, Inc., 1988. [20] M. Richard and R. Lippman. Neural network classi ers estimate Bayesian a-posteriori probabilities. Neural Computation, 3:461{483, 1991. [21] D. Spiegelhalter and S. Lauritzen. Sequential updating of conditional probabilities on directed graphical structures. Networks, 20:579{605, 1990. [22] P. Spirtes, C. Glymour, and R. Scheines. Causality, prediction and search. Springer{Verlag, 1992. [23] J. Suzuky. A construction of Bayesian networks from databases based on an MDL principle. In Proceedings of the 9th Conference of Uncertainty in Arti cial Intelligence, pages 266{273, 1993. [24] T. Verma and J. Pearl. An algorithm for deciding if a set of observerd independencies has a causal explanation. In Proceedings of 8th Conference of Uncertainty in AI, pages 323{330, 1992.

8

; Gregory F. Cooper Center for Biomedical Informatics University of Pittsburgh 8084 Forbes Tower, Pittsburgh, PA { 15261

[email protected]

[email protected]

Technical Report ISSP-96-02 April 1996 (Revised January '97) Intelligent Systems Program University of Pittsburgh 901 CL, Pittsburgh, PA - 15260

Abstract

In this paper we propose a method for learning Bayesian belief networks from data. The method uses arti cial neural networks as probability estimators, thus avoiding the need for making prior assumptions on the nature of the probability distributions governing the relationships among the participating variables. This new method has the potential for being applied to domains containing both discrete and continuous variables arbitrarily distributed. We compare the learning performance of this new method with the performance of the method proposed by Cooper and Herskovits in [10]. The experimental results show that, although the learning scheme based on the use of ANN estimators is slower, the learning accuracy of the two methods is comparable.

y To

appear in Advances in Neural Information Processing Systems, 1996.

1

1 Introduction Bayesian belief networks (BBN), often referred to as probabilistic networks, are a powerful formalism for representing and reasoning under uncertainty. This representation has a solid theoretical foundation [19], and its practical value is suggested by the rapidly growing number of areas to which it is being applied. BBNs concisely represent the joint probability distribution over a set of random variables, by explicitly identifying the probabilistic dependencies and independencies between these variables. Their clear semantics make BBNs particularly suitable for being used in tasks such as diagnosis, planning, control, and explanation. Construction of probabilistic networks with domain experts often remains a dicult and time consuming task [12]. Knowledge acquisition from experts is dicult because the experts have problems in making their knowledge explicit. Furthermore, it is time consuming because the information needs to be collected manually. On the other hand, databases are becoming increasingly abundant in many areas. By exploiting databases, the construction time of probabilistic networks may be considerably decreased. The task of learning a BBN from data can usually be formulated as a search over the space of network structures, and as the subsequent search for an optimal parametrization of the discovered structure or structures. The task can be further complicated by extending the search to account for hidden variables and for the presence of data points with missing values. Dierent approaches have been successfully applied to the task of learning probabilistic networks from data [6, 9, 10, 16, 21, 22, 23, 24]. In all these approaches, simplifying assumptions are made to circumvent practical problems in the implementation of the theory. One common assumption that is made is that all variables are discrete, or that all variables are continuous and normally distributed. In this paper, we propose a novel method for learning BBNs from data that makes use of arti cial neural networks (ANN) as probability distribution estimators, thus avoiding the need for making prior assumptions on the nature of the probability distribution governing the relationships among the participating variables. The use of ANNs as probability distribution estimators is not new [3, 18, 20], and its application to the task of learning Bayesian belief networks from data has been recently explored in [15]. However, in [15] the ANN estimators were used in the parametrization of the belief network structure only, and cross validation was the method of choice for comparing dierent network structures. In our approach, the ANN estimators are an essential component of the scoring metric used to search over the BBN structure space. We ran several simulations to compare the performance of this new method with the learning method described in [10]. The results show that, although the learning scheme based on the use of ANN estimators is slower, the learning accuracy of the two methods is comparable. The rest of the paper is organized as follows. In Section 2 we brie y introduce the Bayesian belief network formalism and some basics of how to learn BBNs from data. In Section 3, we describe our learning method, and detail the use of arti cial neural networks as probability distribution estimators. In Section 4 we present some experimental results comparing the performance of this new method with the one proposed in [10]. We conclude the paper with some suggestions for further research.

2 Background A Bayesian belief network is de ned by a triple (G; ; P ), where G = (X ; E ) is a directed acyclic graph with a set of nodes X = fx1 ; : : : ; xn g representing domain variables, and with a set of arcs E = f(xi ; xj ) j xi ; xj 2 X ; xi 6= xj g representing probabilistic dependencies among domain variables; is the space of possible instantiations of the domain variables1; and P is a probability distribution over the instantiations in . Given a node x 2 X , we use x to denote the set of parents of x in X . The essential property of BBNs is summarized by the Markov condition, which asserts that each variable is independent of its non-descendants given its parents. This property allows for the representation of the multivariate joint probability distribution over X in terms of the univariate conditional distributions P (xi j i ; i ) of each variable xi given its parents 1

An instantiation ! of all n variables in X is an n-uple of values fx01 ; : : : ; x0n g such that xi = x0i for i = 1 : : : n.

2

i , with i the set of parameters needed to fully characterize the conditional probability. Application of the chain rule, together with the Markov condition, yields the following factorization of the joint probability of any particular instantiation of all n variables:

P (x01 ; : : : ; x0n ) =

Yn P (x0 j 0 ; ) :

i=1

i

(1)

xi i

2.1 Learning Bayesian belief networks 2 The task of learning BBNs involves learning the network structure and learning the parameters of the

conditional probability distributions. A well established set of learning methods is based on the de nition of a scoring metric measuring the tness of a network structure to the data, and on the search for high-scoring network structures based on the de ned scoring metric [6, 10, 14]. We focus on these methods, and in particular on the de nition of Bayesian scoring metrics. In a Bayesian framework, ideally classi cation and prediction would be performed by taking a weighted average over the inferences of every possible belief network containing the domain variables. Since this approach is in general computationally infeasible, often an attempt has been made to use a high scoring belief network for classi cation. We will assume this approach in the remainder of this paper. The basic idea of the Bayesian approach is to maximize the probability P (BS j D) = P (BS ; D)=P (D) of a network structure BS given a database of cases D. Because for all network structures the term P (D) is the same, for the purpose of model selection it suces to calculate P (BS ; D) for all BS . The Bayesian metrics developed so far all rely on the following assumptions: 1) given a BBN structure, all cases in D are drawn independently from the same distribution (multinomial sample); 2) there are no cases with missing values (complete database); 3) the parameters of the conditional probability distribution of each variable are independent (global parameter independence); and 4) the parameters associated with each instantiation of the parents of a variable are independent (local parameter independence). The application of these assumptions allows for the following factorization of the probability P (BS ; D)

P (BS ; D) = P (BS )P (D j BS ) = P (BS )

Yn s(x ; ; D) ;

i=1

(2)

i i

where n is the number of nodes in the network, and each s(xi ; i ; D) is a term measuring the contribution of xi and its parents i to the overall score of the network BS . The exact form of the terms s(xi i ; D) slightly diers in the Bayesian scoring metrics de ned so far, and for lack of space we refer the interested reader to the relevant literature [6, 10, 14]. By looking at Equation (2), it is clear that if we assume a uniform prior distribution over all network structures, the scoring metric is decomposable, in that it is just the product of the s(xi ; i ; D) over all xi times a constant P (BS ). Suppose that two network structures BS and BS dier only for the presence or absence of a given arc into xi . To compare their metrics, it suces to compute s(xi ; i ; D) for both structures, since the other terms are the same. Q Likewise, if we assume a decomposable prior distribution over network structures, of the form P (BS ) = i pi , as suggested in [14], the scoring metric is still decomposable, since we can include each pi into the corresponding s(xi ; i ; D). Once a scoring metric is de ned, a search for a high-scoring network structure can be carried out. This search task (in several forms) has been shown to be NP-hard [4, 8]. Various heuristics have been proposed to nd network structures with a high score. One such heuristic is known as K2 [10], and it implements a greedy search over the space of network structures. The algorithm assumes a given ordering on the variables. For simplicity, it also assumes that no prior information on the network is available, so the prior probability distribution over the network structures is uniform and can be ignored in comparing network structures. 0

2

For a comprehensive guide to the literature on learning probabilistic networks, see [7].

3

The Bayesian scoring metrics developed so far either assume discrete variables [6, 10, 14], or continuous variables normally distributed [13]. In the next section, we propose a possible generalization which allows for the inclusion of both discrete and continuous variables with arbitrary probability distributions.

3 An ANN-based scoring metric The main idea of this work is to use arti cial neural networks as probability estimators, to de ne a decomposable scoring metric for which no informative priors on the class, or classes, of the probability distributions of the participating variables are needed. The rst three of the four assumptions described in the previous section are still needed, namely, the assumption of a multinomial sample, the assumption of a complete database, and the assumption of global parameter independence. However, the use of ANN estimators allows for the elimination of the assumption of local parameter independence. In fact, the conditional probabilities corresponding to the dierent instantiations of the parents of a variable are represented by the same ANN, and they share the same network weights and the same training data. Let us denote with Dl fC1 ; : : : ; Cl?1 g the set of the rst l ? 1 cases in the database, and with x(il) and i(l) the instantiations of xi and i in the l-th case respectively. The joint probability P (BS ; D) can be written as:

P (BS ; D) = P (BS )P (D j BS ) = P (BS ) = P (BS )

Ym P (C j D ; B ) l=1

Ym Yn P (x(l) j (l); D ; B ) : l

i

i

l=1 i=1

l

l

S

=

S

Q

(3)

If we assume uninformative priors, or decomposable priors on network structures, of the form P (BS ) = i pi , the probability P (BS ; D) is decomposable. In fact, we can interchange the two products in Equation 3, so as to obtain

P (BS ; D) =

Yn [ p Ym P (x(l) j (l); D ; B ) ] = Yn s(x ; ; D) ;

i=1

i

i

l=1

l

i

S

i=1

(4)

i i

where s(xi ; i ; D) is the term between square brackets, and it is only a function of xi and its parents in the network structure BS (pi can be neglected if we assume a uniform prior over the network structures). Notice that the computation of Equation 4 corresponds to the application of the prequential method discussed by Dawid in [11]. The problem now lies with how to estimate each term P (xi j i ; Dl ; BS ). This can be done by means of neural network estimators. Several schemes are available for training a neural network to approximate a given probability distribution, or density. Notice that the calculation of each term s(xi ; i ; D) can be computationally very expensive. For each node xi , computing s(xi ; i ; D) requires the training of m ANNs, where m is the size of the database. To reduce this computational cost, we use the following approximation, which we call the t-invariance approximation: for any l 2 f1; : : :; mg, given the probability P (xi j i ; Dl ; BS ), at least t (1 t m ? l) new cases are needed in order to alter such probability. That is, for each positive integer h, such that h < t, we assume P (xi j i ; Dl+h ; BS ) = P (xi j i ; Dl ; BS ). Intuitively, this approximation implies the assumption that, given our present belief about the value of each P (xi j i ; Dl ; BS ), at least t new cases are needed to revise this belief. By making this approximation, we achieve a t-fold reduction in the computation needed, since we now need to build and train only m=t ANNs for each xi , instead of the original m. In fact, application of the t-invariance approximation to the computation of a given s(xi ; i ; D) yields:

s(xi ; i ; D) =

k+1) Ym P (x(l) j (l); D ; B ) = m=tY?1 t(Y P (x(l) j (l) ; D l=1

i

i

l

S

k=0 l=tk+1

4

i

i

tk ; BS ):

(5)

With regard to the choice of an appropriate value for t, we can either select a constant value for t, or we can choose to increment t as the size of the training database Dl increases. The second approach seems preferable. When estimating P (xi j i ; Dl ; BS ), this estimate will be very sensitive to the addition of new cases when l is small, but will become increasingly insensitive to the addition of new cases as l grows. For example, when l = 1, adding a new case to the training data set means doubling it, while if l = 10; 000 an additional case in the data set is very unlikely to make a signi cant dierence. A scheme for the incremental updating of t can be summarized in the equation t = dle, where l is the number of cases already seen (i.e., the cardinality of Dl ), and 0 < 1. For example, given a data set of 50 cases, the updating scheme t = d0:5le would require the training of the ANN estimators P (xi j i ; Dl ; BS ) for l = 1; 2; 3; 5; 8; 12; 18; 27; 41.

4 Evaluation In this section, we describe the experimental evaluation we conducted to test the feasibility of use of the ANN-based scoring metric developed in the previous section.

Methodology All the experiments are performed on the belief network ALARM, a multiply-connected network originally developed to model anesthesiology problems that may occur during surgery [2]. It contains 37 nodes/variables and 46 arcs. The variables are all discrete, and take between 2 and 4 distinct values. The database used in the experiments was generated from ALARM, and it is the same database used in [10]. In the experiments, we use a modi cation of the algorithm K2 [10]. The modi ed algorithm, which we call ANN-K2, replaces the closed-form scoring metric developed in [10] with the ANN-based scoring metric of Equation (5). The performance of ANN-K2 is measured in terms of accuracy of the recovered network structure, by counting the number of edges added and omitted with respect to the ALARM network; and in terms of the accuracy of the learned joint probability distribution, by computing its cross entropy with respect to the joint probability distribution of ALARM. The learning performance of ANN-K2 is also compared with the performance of K2. To train the ANNs, we used the conjugate-gradient search algorithm described in [17].

Architecture of the ANN estimators Since all the variables in the ALARM network are discrete, the ANN estimators to be included in the scoring metric are de ned based on the softmax model [5] that we now describe. Given a variable xi , with ni values and set of parents i , the conditional probability distribution P P (xi j i) is approximated by a neural network with ni output units, and n input units, where n = j2 (nj ? 1). The ni output units de ne the conditional probabilities P (xi = vk j i ); k = 1; : : : ; ni , as follows: i

f ( ) P (xi = vk j i ) = Pne ef ( ) ; k

i

j =1

i

j

i

where fk (i ) is the linear output of the k-th output unit for the input i . The use of n input units is due to the fact that the variables in ALARM are ordinal variables, with numerical values indicating qualitative measures, such as \low", \medium", \high". The representation of an ordinal variable taking nj values by means of nj ? 1 indicator variables is common practice in statistical regression. As a regularization technique, we augment the training set so as to induce a uniform conditional probability over the unseen instantiantions of the ANN input. In particular, given the probability P (xi j i ; Dl ) to be estimated, and assuming xi is a k-valued variable, for each instantiation i0 that does not appear in the 5

Data set 100 500 1000 2000 3000

Algo. ANN-K2 K2 ANN-K2 K2 ANN-K2 K2 ANN-K2 K2 ANN-K2 K2

arcs +

m 0.19 0.75 0.19 0.22 0.24 0.11 0.19 0.05 0.16 0.00

s.d. 0.40 1.28 0.40 0.42 0.49 0.31 0.40 0.23 0.37 0.00

arcs ?

m 0.62 0.22 0.22 0.11 0.22 0.03 0.11 0.03 0.05 0.03

cross entropy

s.d. 0.86 0.48 0.48 0.31 0.48 0.16 0.31 0.16 0.23 0.16

m 0.23 0.08 0.04 0.02 0.05 0.01 0.02 0.005 0.01 0.004

med .051 .070 .010 .010 .005 .006 .002 .002 .001 .001

s.d. 0.52 0.10 0.11 0.02 0.15 0.01 0.06 0.007 0.017 0.005

time (secs)

m 130 0.44 1077 0.13 6909 0.34 6458 0.46 11155 1.02

med 88 .06 480 .06 4866 .23 4155 .44 4672 .84

s.d. 159 1.48 1312 0.22 6718 0.46 7864 0.65 2136 1.11

Table 1: Comparison of the performance of ANN-K2 and of K2 in terms of number of arcs wrongly added (+), number of arcs wrongly omitted (?), cross entropy, and computation time. Each column reports the mean m, the

median med, and the standard deviation s.d. of the corresponding measure over the 37 nodes/variables of ALARM. The median for the number of arcs added and omitted is always 0, and is not reported.

database Dl , we generate k new cases, with i instantiated to i0 , and xi taking each of its k values. As a result, the neural network estimates P (xi j i0 ; Dl ) to be uniform, with P (xi j i0 ; Dl ) = 1=k for each of xi 's values x11 ; : : : ; x1k .

Results We ran simulations where we varied the size of the training data set (100, 500, 1000, 2000, and 3000 cases), and the value of in the updating scheme t = dle described in Section 3. We used the settings = 0:35, and = 0:5. For each run, we measured the number of arcs added, the number of arcs omitted, the cross entropy, and the computation time, for each variable in the network. That is, we considered each node, together with its parents, as a simple BBN, and collected the measures of interest for each of these BBNs. Table 1 reports mean and standard deviation of each measure over the 37 variables of ALARM, for both ANN-K2 and K2. The results for ANN-K2 shown in Table 1 correspond to the setting = 0:5, since their dierence from the results corresponding to the setting = 0:35 was not statistically signi cant, for any of the given measures. Standard t-tests were performed to assess the signi cance of the dierence between the measures for K2 and the measures for ANN-K2, for each data set cardinality. No technique to correct for multiple-testing was applied. Most measures show no statistically signi cant dierence, either at the 0.05 level or at the 0.01 level (most p values are well above 0.2). In the simulation with 100 cases, both the dierence between the mean number of arcs added and the dierence between the mean number of arcs omitted are statistically signi cant (p ' 0:01). However, these dierences cancel out, in that ANN-K2 adds fewer extra arcs than K2, and K2 omits fewer arcs than ANN-K2. This is re ected in the corresponding cross entropies, whose dierence is not statistically signi cant (p = 0:08). In the simulation with 1000 cases, only the dierence in the number of arcs omitted is statistically signi cant (p ' :03). Finally, in the simulation with 3000 cases, only the dierence in the number of arcs added is statistically signi cant (p ' :02). K2 misses a single arc, and does not add any extra arc, and this is the best result to date. By comparison, ANN-K2 omits 2 arcs, and adds 5 extra arcs. For the simulation with 3000 cases, we also computed Wilcoxon rank sum tests. The results were consistent with the t-test results, showing a statistically signi cant dierence only in the number of arcs added. Finally, as it can be noted in Table 1, the dierence in computation time is of several order of magnitude, thus making a statistical analysis super uous. A natural question to ask is how sensitive is the learning procedure to the order of the cases in the training set. Ideally, the procedure would be insensitive to this order. Since we are using ANN estimators, however, which perform a greedy search in the solution space, particular permutations of the training cases might cause the ANN estimators to be more susceptible to getting stuck in local maxima. We performed some preliminary experiments to test the sensitivity of the learning procedure to the order of the cases in the 6

training set. We ran few simulations in which we randomly changed the order of the cases. The recovered structure was identical in all simulations. Morevoer, the dierence in cross entropy for dierent orderings of the cases in the training set showed not to be statistically signi cant.

5 Conclusions In this paper we presented a novel method for learning Bayesian belief networks from data based on the use of arti cial neural networks as probability distribution estimators. As a preliminary evaluation, we have compared the performance of the new algorithm with the performance of K2, a well established learning algorithm for discrete domains, for which extensive empirical evaluation is available [1, 10]. With regard to the learning accuracy of the new method, the results are encouraging, being comparable to state-of-the-art results for the chosen domain. The next step is the application of this method to domains where current techniques for learning BBNs from data are not applicable, namely domains with continuous variables not normally distributed, and domains with mixtures of continuous and discrete variables. The main drawback of the new algorithm is its time requirements. However, in this preliminary evaluation, our main concern was the learning accuracy of the algorithm, and little eort was spent in trying to optimize its time requirements. We believe there is ample room for improvement in the time performance of the algorithm. More importantly, the scoring metric of Section 3 provides a general framework for experimenting with dierent classes of probability estimators. In this paper we used ANN estimators, but more ecient estimators can easily be adopted, especially if we assume the availability of prior information on the class of probability distributions to be used.

Acknowledgments This work was funded by grant IRI-9509792 from the National Science Foundation.

References [1] C. Aliferis and G. F. Cooper. An evaluation of an algorithm for inductive learning of bayesian belief networks using simulated data sets. In R. L. de Mantras and D. Poole, editors, Proceedings of the 10th Conference of Uncertainty in Arti cial Intelligence, pages 8{14, San Francisco, California, 1994. Morgan Kaumann Publishers. [2] I. Beinlich, H. Suermondt, H. Chavez, and G. Cooper. The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. In 2nd Conference of AI in Medicine Europe, pages 247{256, London, England, 1989. [3] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, 1995. [4] R. Bouckaert. Properties of learning algorithms for Bayesian belief networks. In Proceedings of the 10th Conference of Uncertainty in Arti cial Intelligence, pages 102{109, San Francisco, California, 1994. Morgan Kaumann Publishers. [5] J. Bridle. Probabilistic interpretation of feedforward classi cation network outputs, with relationships to statistical pattern recognition. In Neuro-computing: Algorithms, Architectures and Applications. Springer Verlag, New York, 1989. [6] W. Buntine. Theory re nement on Bayesian networks. In Proceedings of the 7th Conference of Uncertainty in AI, pages 52{60, 1991. 7

[7] W. Buntine. A guide to the literature on learning probabilistic networks from data. IEEE Transactions on Knowledge and Data Engineering, 1996. [8] D. Chickering, D. Geiger, and D. Heckerman. Learning Bayesian networks: search methods and experimental results. In Proceedings of 5th Workshop on Arti cial Intelligence and Statistics, January 1995. [9] G. Cooper and E. Herskovits. A Bayesian method for constructing Bayesian belief networks from databases. In Proceedings of the 7th Conference of Uncertainty in Arti cial Intelligence, pages 86{94, Los Angeles, CA, 1991. [10] G. Cooper and E. Herskovits. A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning, 9:309{347, 1992. [11] A. Dawid. Present position and potential developments: Some personal views. statistical theory. the prequential approach (with discussion). Journal of Royal Statistical Society A, 147:278{292, 1984. [12] M. Druzdzel, L. C. van der Gaag, M. Henrion, and F. Jensen, editors. Building probabilistic networks: where do the numbers come from?, IJCAI-95 Workshop, Montreal, Canada, 1995. [13] D. Geiger and D. Heckerman. Learning Gaussian networks. Technical Report MSR-TR-94-10, Microsoft Research, One Microsoft Way, Redmond, WA 98052, 1994. [14] D. Heckerman, D. Geiger, and D. Chickering. Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning, 1995. [15] R. Hofmann and V. Tresp. Discovering structure in continuous variables using Bayesian networks. In M. E. H. D. S. Touretzky, M. C. Mozer, editor, Advances in Neural Information Processing Systems 8. MIT Press, 1995. [16] W. Lam and F. Bacchus. Learning Bayesian belief networks { an approach based on the MDL principle. Computational Intelligence, 10(4), 1994. [17] M. Moller. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6:525{ 533, 1993. [18] R. Neuneier, F. Hergert, W. Finno, and D. Ormoneit. Estimation of conditional densities: A comparison of neural network approaches. In Proc. of Int. Conference on Arti cial NNs, volume 1, pages 689{692, New York, 1994. [19] J. Pearl. Probabilistic Reasoning in Intelligent Systems: networks of plausible inference. Morgan Kaufman Publishers, Inc., 1988. [20] M. Richard and R. Lippman. Neural network classi ers estimate Bayesian a-posteriori probabilities. Neural Computation, 3:461{483, 1991. [21] D. Spiegelhalter and S. Lauritzen. Sequential updating of conditional probabilities on directed graphical structures. Networks, 20:579{605, 1990. [22] P. Spirtes, C. Glymour, and R. Scheines. Causality, prediction and search. Springer{Verlag, 1992. [23] J. Suzuky. A construction of Bayesian networks from databases based on an MDL principle. In Proceedings of the 9th Conference of Uncertainty in Arti cial Intelligence, pages 266{273, 1993. [24] T. Verma and J. Pearl. An algorithm for deciding if a set of observerd independencies has a causal explanation. In Proceedings of 8th Conference of Uncertainty in AI, pages 323{330, 1992.

8