WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Speech recognition with mixtures of bayesian networks    
United States Patent6336108   
Link to this pagehttp://www.wikipatents.com/6336108.html
Inventor(s)Thiesson; Bo (Woodinville, WA); Meek; Christopher A. (Kirkland, WA); Chickering; David Maxwell (Redmond, WA); Heckerman; David Earl (Bellevue, WA); Alleva; Fileno A. (Redmond, WA); Hwang; Mei-Yuh (Redmond, WA)
AbstractThe invention performs speech recognition using an array of mixtures of Bayesian networks. A mixture of Bayesian networks (MBN) consists of plural hypothesis-specific Bayesian networks (HSBNs) having possibly hidden and observed variables. A common external hidden variable is associated with the MBN, but is not included in any of the HSBNs. The number of HSBNs in the MBN corresponds to the number of states of the common external hidden variable, and each HSBN models the world under the hypothesis that the common external hidden variable is in a corresponding one of those states. In accordance with the invention, the MBNs encode the probabilities of observing the sets of acoustic observations given the utterance of a respective one of said parts of speech. Each of the HSBNs encodes the probabilities of observing the sets of acoustic observations given the utterance of a respective one of the parts of speech and given a hidden common variable being in a particular state. Each HSBN has nodes corresponding to the elements of the acoustic observations. These nodes store probability parameters corresponding to the probabilities with causal links representing dependencies between ones of said nodes.



 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 6336108
Speech recognition with mixtures of bayesian networks - US Patent 6336108 Drawing
Speech recognition with mixtures of bayesian networks
Inventor     Thiesson; Bo (Woodinville, WA); Meek; Christopher A. (Kirkland, WA); Chickering; David Maxwell (Redmond, WA); Heckerman; David Earl (Bellevue, WA); Alleva; Fileno A. (Redmond, WA); Hwang; Mei-Yuh (Redmond, WA)
Owner/Assignee     Microsoft Corporation (Redmond, WA)
Patent assignment
All assignments
Publication Date     January 1, 2002
Application Number     09/220,197
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     December 23, 1998
US Classification     706/20 704/256 704/256.4
Int'l Classification     G06F 015/18 G10L 015/00
Examiner     Dam; Tuan Q.
Assistant Examiner     Booker; Kelvin
Attorney/Law Firm     Wallace, Michaelson; Peter L. Michaelson & Wallace; Robert M. ,
Address
Parent Case     CROSS-REFERENCE TO RELATED APPLICATION This application is a continuation-in-part of U.S. application Ser. No. 08/985,114 filed Dec. 4, 1997 by Bo Thiesson et al and entitled "Mixtures of Bayesian Networks".
Priority Data    
USPTO Field of Search     706/20 706/16 706/17 706/18 704/231 704/232 704/256
Patent Tags     speech recognition mixtures bayesian networks
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
6163769
Acero

Dec,2000

[0 after 0 votes]
6141641
Hwang

Oct,2000

[0 after 0 votes]
6076056
Huang

Jun,2000

[0 after 0 votes]
5963903
Hon
704/254
Oct,1999

[0 after 0 votes]
5937384
Huang
704/256
Aug,1999

[0 after 0 votes]
5913193
Huang
704/258
Jun,1999

[0 after 0 votes]
5864810
Digalakis
704/255
Jan,1999

[0 after 0 votes]
5825978
Digalakis
704/256
Oct,1998

[0 after 0 votes]
5812974
Hemphill

Sep,1998

[0 after 0 votes]
5812975
Komori

Sep,1998

[0 after 0 votes]
5802256
Heckerman
706/59
Sep,1998

[0 after 0 votes]
5794197
Alleva

Aug,1998

[0 after 0 votes]
5748850
Sakurai
706/50
May,1998

[0 after 0 votes]
5715367
Gillick
704/254
Feb,1998

[0 after 0 votes]
5710866
Alleva
704/256.4
Jan,1998

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A speech recognition network for inferring parts of speech from acoustic observations having n elements, with a common hidden variable having plural discrete states, comprising:

a plurality of mixtures of Bayesian networks (MBNs), each of said MBNs encoding the probabilities of observing the sets of acoustic observations given the utterance of a respective one of said parts of speech;

each of said MBNs comprising:

a plurality of hypothesis-specific Bayesian networks (HSBNs), each of said HSBNs encoding the probabilities of observing the sets of acoustic observations given the utterance of a respective one of said parts of speech and given the hidden common variable being in a respective one of its states;

a combiner which combines outputs of said HSBNs to produce an MBN output of said MBN;

wherein each one of said HSBNs comprises:

plural nodes, each of said nodes corresponding to one of said n elements of the acoustic observations,

at least some of said plural nodes having dependencies with others of said plural nodes within the one HSBN,

a combiner connected to outputs of said nodes, said nodes receiving at their inputs the state of a respective one of the n elements of a current one of the acoustic observations.

2. The mixture of Bayesian networks of claim 1 wherein said nodes have dependencies with others of said nodes in that each of said nodes stores a set of probability parameters representing causal relationships with causal links between states of respective ones of said variables.

3. The mixture of Bayesian networks of claim 2 wherein the probability parameters stored in a respective one of the nodes correspond to causal links between said acoustic observation element of said one variable and the respective part of speech.

4. The mixture of Bayesian networks of claim 3 wherein said probability parameters stored in said one node define how states of said acoustic observation element influences the likelihood of said respective part of speech.

5. The mixture of Bayesian networks of claim 1 wherein part of speech is a senone.

6. The mixture of Bayesian networks of claim 5 wherein said acoustic observation is a Cepstrum parameter.

7. The mixture of Bayesian networks of claim 6 wherein:

there are 33 Cepstrum parameters, 6000 senones and said common hidden variable has 20 states.

8. The mixture of Bayesian networks of claim 1 wherein said common hidden variable is external in that it is not represented by any one of the nodes in said mixture of Bayesian networks.

9. The mixture of Bayesian networks of claim 1 wherein each HSBN is associated with an HSBN score, wherein;

each of said HSBNs further comprises an inference input defining observed data corresponding to said acoustic observations and an inference output corresponding to the likelihood of an acoustic observation given the utterance of a corresponding part of speech and given said common hidden variable being in one of its states corresponding to said HSBN; and

said mixture of Bayesian networks further comprises a weight multiplier which weights the inference output of each HSBN by a corresponding HSBN score and combines the weighted HSBN inference outputs into a single inference output of said mixture of Bayesian networks.

10. The mixture of Bayesian networks of claim 9 wherein said HSBN score corresponds to the likelihood of said common external hidden variable being in the corresponding one of the states of said common hidden variable.

11. The mixture of Bayesian networks of claim 10 wherein said HSBN score reflects the goodness of the corresponding HSBN at predicting observed data representing states of said observed variables.

12. The mixture of Bayesian networks of claim 11 wherein said HSBN score is computed by said mixture of Bayesian networks.

13. The mixture of Bayesian networks of claim 1 wherein the number of said HSBNs in said mixture of Bayesian network is selected to optimize the goodness of said mixture of Bayesian network at predicting observed data representing states of said observed variables.

14. The mixture of Bayesian networks of claim 1 wherein the nodes of different ones of said HSBNs represent the same set of hidden and observed variables.

15. The mixture of Bayesian networks of claim 14 wherein said probability parameters in different ones of said HSBNs differ to reflect different states of said common external hidden variable represented by the different ones of said HSBNs.

16. The mixture of Bayesian networks of claim 15 wherein the causal links in different ones of said HSBNs differ in reflecting different states of said common external hidden variable represented by the different ones of said HSBNs.

17. A method using a set of observed data for training a speech recognition network including a plurality of mixtures of Bayesian networks (MBNs), each of said MBNs encoding the probabilities of observing the sets of acoustic observations given the utterance of a respective one of said parts of speech and having a plurality of hypothesis-specific Bayesian networks (HSBNs), each of said HSBNs encoding the probabilities of observing the sets of acoustic observations given the utterance of a respective one of said parts of speech and given a hidden common variable being in a respective one of its states, each one of said HSBNs having plural nodes corresponding to the plural elements of the acoustic observations and storing probability parameters corresponding to said probabilities with causal links representing dependencies between ones of said nodes, said method of training comprising:

for each one of said HSBNs conducting a parameter search for a set of changes in said probability parameters which improves the goodness of said one HSBN in predicting said observed data, and modifying the probability parameters of said one HSBN accordingly;

for each one of said HSBNs, computing a structure score of said one HSBN reflecting the goodness of said one HSBN in predicting said observed data, conducting a structure search for a change in said causal links which improves said structure search score, and modifying the causal links of said one HSBN accordingly.

18. The method of claim 17 wherein the step of computing a structure score of said one HSBN comprises:

computing from said observed data expected complete model sufficient statistics (ECMSS);

computing from said ECMSS sufficient statistics for said one HSBN;

computing said structure score from said sufficient statistics.

19. The method of claim 18 wherein the step of computing said ECMSS comprises:

computing the probability of each combination of the states of the discrete hidden and observed variables;

forming a vector for each observed case in said set of observed data, each entry in said vector corresponding to a particular one of the combinations of the states of said discrete variables; and

summing the vectors over plural cases of said observed data.

20. The method of claim 19 wherein the step of forming a vector is such that each entry in said vector is formed to have plural sub-entries comprising:

(a) the probability of the one combination of the states of the discrete variables,

(b) sub-entry vectors representing the states of the continuous variables.

21. The method of claim 20 wherein each sub-entry is formed such that said sub-entry vector has a vector multiplier corresponding to the probability of the one combination of the states of the discrete variables.

22. The method of claim 21 wherein the step of computing sufficient statistics from said ECMSS comprises computing from said ECMSS the following:

(a) mean,

(b) scatter,

(c) sample size.

23. The method of claim 20 wherein the probability of the one combination of the states of the discrete variables is computed by inference in said mixture of Bayesian networks.

24. The method of claim 17 wherein the steps of conducting a parameter search and modifying said probability parameters are repeated consecutively until a parameter search convergence criteria is met.

25. The method of claim 23 further comprising:

repeating the steps of conducting a parameter search, computing the structure score and conducting a structure search until a structure search convergence criteria is met.

26. The method of claim 25 wherein said structure search convergence criteria comprises a determination of whether the structure score has worsened since a prior repetition of said structure search step.

27. The method of claim 25 wherein said structure search criteria comprises a determination of whether a current performance of the structure search has changed any of said causal links in the one HSBN.

28. The method of claim 17 further comprising:

repeating the steps of conducting a parameter search, computing the structure score and conducting a structure search until a structure search convergence criteria is met.

29. The method of claim 28 wherein said parameter search convergence criteria is a determination of whether the parameter search has converged at a local optimum.

30. The method of claim 28 wherein said parameter search convergence criteria is a determination of whether the parameter search has been repeated a certain number of times.

31. The method of claim 30 wherein said certain number of times is a set number.

32. The method of claim 30 wherein said certain number of times is a function of the number of times the structure search has been repeated.

33. The method of claim 30 wherein said parameter search convergence criteria limits the repetition of said parameter search to a limited number of repetitions and wherein said parameter search is repeated after convergence of said structure search.

34. The method of claim 17 wherein the step of conducting a structure search comprises:

attempting different modifications to said causal links at each node of said one HSBN;

for each one of said different modifications, computing the structure score of the one HSBN;

saving those modifications providing improvements to said structure score.

35. The method of claim 17 further comprising computing a combined score of said mixture of Bayesian networks from the structure scores of the individual HSBNs.

36. The method of claim 35 further comprising associating said mixture of Bayesian networks with said combined score.

37. The method of claim 36 further comprising choosing a different number of states of said discrete hidden and observed variables and repeating said parameter and structure search steps, to generate a different mixtures of Bayesian networks and scores thereof for different numbers of states of said discrete variables.

38. The method of claim 37 further comprising choosing the mixture of Bayesian networks having the highest score.

39. The method of claim 37 further comprising weighting inference outputs of the different mixtures of Bayesian networks in accordance with their individual scores.

40. The method of claim 17 wherein said parameter search is repeated whenever a performance of said structure search results in a change in the structure of said causal links.

41. The method of claim 40 wherein the parameter search is repeated a limited number of times while the structure search is always carried out to convergence.

42. The method of claim 40 wherein the parameter search is repeated to convergence and thereafter the structure search is repeated to convergence.

43. The method of claim 40 wherein the parameter search is repeated by a number of times which is a function of the number of times the structure search as been repeated.

44. The method of claim 40 wherein said parameter search is repeated a fixed number of times and said structure search is repeated a fixed number of times.

45. The method of claim 40 wherein the parameter search is repeated to convergence while the structure search is repeated a limited number of times.

46. The method of claim 40 wherein said parameter search is repeated a number of times which is a function of the number of structure searches performed thus fare while the structure search is repeated a fixed number of times.

47. The method of claim 17 further comprising repeating the steps of performing said parameter search and said structure search and interleaving repetitions of said parameter search and said structure search.

48. The method of claim 17 wherein the step of initializing said HSBNs comprises, for each HSBN:

defining a causal link from each hidden variable node to each continuous observed variable node;

initialize the probability parameters in each node.

49. The method of claim 48 wherein the step of initializing the probability parameters employs the same initial probability parameters from node to node.

50. The method of claim 48 wherein the step of defining a causal link further comprises defining a causal link from each discrete hidden variable node to each discrete observed variable node.

51. The method of claim 17 wherein the step of performing the parameter search comprises searching for a change in the probability parameters in each node which improves the performance of said one HSBN in predicting said observed data.

52. The method of claim 17 wherein one of said hidden variables is a common external discrete hidden variable not represented by any node in said mixture of Bayesian networks, and wherein the number of HSBNs in said mixture of Bayesian networks is equal to the number of states of said common external discrete hidden variable.

53. The method of claim 17 further comprising, for each HSBN, determining an optimum number m of HSBNs in the MBN, whereby m is different for each MBN.

54. A computer-readable medium storing computer-readable instructions for carrying out the steps of claim 17.
 Description Submit all comments and votes
 


FIELD OF THE INVENTION

The present invention relates generally to data processing systems and, more particularly, to the generation of Bayesian networks.

BACKGROUND OF THE INVENTION

The advent of artificial intelligence within computer science has brought an abundance of decision-support systems. Decision-support systems are computer systems in which decisions, typically rendered by humans, are recommended and sometimes made. In creating decision-support systems, computer scientists seek to provide decisions with the greatest possible accuracy. Thus, computer scientists strive to create decision-support systems that are equivalent to or more accurate than a human expert. Applications of decision-support systems include medical diagnosis, troubleshooting computer networks, or other systems wherein a decision is based upon identifiable criteria.

One of the most promising new areas for research in decision-support systems is Bayesian networks. A Bayesian network is a representation of the probabilistic relationships among distinctions about the world. Each distinction, sometimes called a variable, can take on one of a mutually exclusive and exhaustive set of possible states. A Bayesian network is expressed as an acyclic-directed graph where the variables correspond to nodes and the relationships between the nodes correspond to arcs. FIG. 1 depicts an examplary Bayesian network 101. In FIG. 1 there are three variables, X.sub.1, X.sub.2, and X.sub.3, which are represented by nodes 102, 106 and 110, respectively. This Bayesian network contains two arcs 104 and 108. Associated with each variable in a Bayesian network is a set of probability distributions. Using conditional probability notation, the set of probability distributions for a variable can be denoted by p(x.sub.i.vertline..PI..sub.i, .xi.), where "p" refers to the probability distribution, where ".PI..sub.i " denotes the parents of variable X.sub.i and where ".xi." denotes the knowledge of the expert. The Greek letter ".xi." indicates that the Bayesian network reflects the knowledge of an expert in a given field. Thus, this expression reads as follows: the probability distribution for variable X.sub.i given the parents of X.sub.i and the knowledge of the expert. For example, X.sub.1 is the parent of X.sub.2. The probability distributions specify the strength of the relationships between variables. For instance, if X.sub.1 has two states (true and false), then associated with X.sub.1 is a single probability distribution p(x.sub.1.vertline..xi.) and associated with X.sub.2 are two probability distributions p(x.sub.2.vertline.x.sub.1 =t, .xi.) and p(x.sub.2.vertline.x.sub.1 =f, .xi.). In the remainder of this specification, .xi. is not specifically mentioned.

The arcs in a Bayesian network convey dependence between nodes. When there is an arc between two nodes, the probability distribution of the first node depends upon the value of the second node when the direction of the arc points from the second node to the first node. For example, node 106 depends upon node 102. Therefore, nodes 102 and 106 are said to be conditionally dependent. Missing arcs in a Bayesian network convey conditional independencies. For example, node 102 and node 110 are conditionally independent given node 106. However, two variables indirectly connected through intermediate variables are conditionally dependent given lack of knowledge of the values ("states" ) of the intermediate variables. Therefore, if the value for node 106 is known, node 102 and node 110 are conditionally dependent.

In other words, sets of variables X and Y are said to be conditionally independent, given a set of variables Z, if the probability distribution for X given Z does not depend on Y. If Z is empty, however, X and Y are said to be "independent" as opposed to conditionally independent. If X and Y are not conditionally independent, given Z, then X and Y are said to be conditionally dependent given Z.

The variables used for each node may be of different types. Specifically, variables may be of two types: discrete or continuous. A discrete variable is a variable that has a finite or countable number of states, whereas a continuous variable is a variable that has an uncountably infinite number of states. All discrete variables considered in this specification have a finite number of states. An example of a discrete variable is a Boolean variable. Such a variable can assume only one of two states: "true" or "false." An example of a continuous variable is a variable that may assume any real value between -1 and 1. Discrete variables have an associated probability distribution. Continuous variables, however, have an associated probability density function ("density"). Where an event is a set of possible outcomes, the density p(x) for a variable "x" and events "a" and "b" is defined as: ##EQU1##

where p(a.ltoreq.x.ltoreq.b) is the probability that x lies between a and b. Conventional systems for generating Bayesian networks cannot use continuous variables in their nodes.

FIG. 2 depicts an example Bayesian network for troubleshooting automobile problems. The Bayesian network of FIG. 2 contains many variables 202, 204, 206, 208, 210, 212, 214, 216, 218, 220, 222, 224, 226, 228, 230, 232, and 234, relating to whether an automobile will work properly, and arcs 236, 238, 240, 242, 244, 246, 248, 250, 252, 254, 256, 258, 260, 262, 264, 268. A few examples of the relationships between the variables follow. For the radio 214 to work properly, there must be battery power 212 (arc 246). Battery power 212, in turn, depends upon the battery working properly 208 and a charge 210 (arcs 242 and 244). The battery working properly 208 depends upon the battery age 202 (arc 236). The charge 210 of the battery depends upon the alternator 204 working properly (arc 238) and the fan belt 206 being intact (arc 240). The battery age variable 202, whose values lie from zero to infinity, is an example of a continuous variable that can contain an infinite number of values. However, the battery variable 208 reflecting the correct operations of the battery is a discrete variable being either true or false.

The automobile troubleshooting Bayesian network also provides a number of examples of conditional independence and conditional dependence. The nodes operation of the lights 216 and battery power 212 are dependent, and the nodes operation of the lights 216 and operation of the radio 214 are conditionally independent given battery power 212. However, the operation of the radio 214 and the operation of the lights 216 are conditionally dependent. The concept of conditional dependence and conditional independence can be expressed using conditional probability notation. For example, the operation of the lights 216 is conditionally dependent on battery power 212 and conditionally independent of the radio 214 given the battery power 212. Therefore, the probability of the lights working properly 216 given both the battery power 212 and the radio 214 is equivalent to the probability of the lights working properly given the battery power alone, P(Lights.vertline.Battery Power, Radio)=P(Lights.vertline.Battery Power). An example of a conditional dependence relationship is the probability of the lights working properly 216 given the battery power 212 which is not equivalent to the probability of the lights working properly given no information. That is, p(Lights.vertline.Battery Power).noteq.p(Lights).

There are two conventional approaches for constructing Bayesian networks. Using the first approach ("the knowledge-based approach"), a person known as a knowledge engineer interviews an expert in a given field to obtain the knowledge of the expert about the field of expertise of the expert. The knowledge engineer and expert first determine the distinctions of the world that are important for decision making in the field of the expert. These distinctions correspond to the variables of the domain of the Bayesian network. The "domain" of a Bayesian network is the set of all variables in the Bayesian network. The knowledge engineer and the expert next determine the dependencies among the variables (the arcs) and the probability distributions that quantify the strengths of the dependencies.

In the second approach ("called the data-based approach"), the knowledge engineer and the expert first determine the variables of the domain. Next, data is accumulated for those variables, and an algorithm is applied that creates a Bayesian network from this data. The accumulated data comes from real world instances of the domain. That is, real world instances of decision making in a given field. Conventionally, this second approach exists for domains containing only discrete variables.

After the Bayesian network has been created, the Bayesian network becomes the engine for a decision-support system. The Bayesian network is converted into a computer-readable form, such as a file and input into a computer system. Then, the computer system uses the Bayesian network to determine the probabilities of variable states given observations, determine the benefits of performing tests, and ultimately recommend or render a decision. Consider an example where a decision-support system uses the Bayesian network of FIG. 2 to troubleshoot automobile problems. If the engine for an automobile did not start, the decision-based system could request an observation of whether there was gas 224, whether the fuel pump 226 was in working order by possibly performing a test, whether the fuel line 228 was obstructed, whether the distributor 230 was working, and whether the spark plugs 232 were working. While the observations and tests are being performed, the Bayesian network assists in determining which variable should be observed next.

U.S. application Ser. No. 08/240,019 filed May 9, 1994 entitled "Generating Improved Belief Networks" describes an improved system and method for generating Bayesian networks (also known as "belief networks") that utilize both expert data received from an expert ("expert knowledge") and data received from real world instances of decisions made ("empirical data"). By utilizing both expert knowledge and empirical data, the network generator provides an improved Bayesian network that is more accurate than conventional Bayesian networks. In addition, the exemplary embodiment facilitates the use of continuous variables in Bayesian networks and handles missing data in the empirical data that is used to construct Bayesian networks.

Expert knowledge consists of two components: an equivalent sample size or sizes ("sample size"), and the prior probabilities of all possible Bayesian-network structures ("priors on structures"). The effective sample size is the effective number of times that the expert has rendered a specific decision. For example, a doctor with 20 years of experience diagnosing a specific illness may have an effective sample size in the hundreds. The priors on structures refers to the confidence of the expert that there is a relationship between variables (e.g., the expert is 70 percent sure that two variables are related). The priors on structures can be decomposed for each variable-parent pair known as the "prior probability" of the variable-parent pair. Empirical data is typically stored in a database. An example of acquiring empirical data can be given relative to the Bayesian network of FIG. 2. If, at a service station, a log is maintained for all automobiles brought in for repair, the log constitutes empirical data. The log entry for each automobile may contain a list of the observed state of some or all of the variables in the Bayesian network. Each log entry constitutes a case. When one or more variables are unobserved in a case, the case containing the unobserved variable is said to have "missing data." Thus, missing data refers to when there are cases in the empirical data database that contain no observed value for one or more of the variables in the domain. An assignment of one state to each variable in a set of variables is called an "instance" of that set of variables. Thus, a "case" is an instance of the domain. The "database" is the collection of all cases.

An example of a case can more clearly be described relative to the Bayesian network of FIG. 2. A case may consist of the battery age 202 being 2.132 years old, the battery working properly 208 being true, the alternator working properly 204 being true, the fan belt being intact 206 being true, the charge 210 being sufficient, the battery power 212 being sufficient, the starter working properly 220 being true, the engine turning over 218 being true, the amount of gas 224 being equal to 5.3 gallons, the fuel pump working properly 226 being true, the fuel line working properly 228 being true, the distributor working properly 230 being false, the spark plugs working properly 232 being true and the engine starting 234 being false. In addition, the variables for the gas gauge 222, the radio working properly 214 and the lights working properly 216 may be unobserved. Thus, the above-described case contains missing data.

BACKGROUND RELATIVE TO DECISION GRAPHS

Although Bayesian networks are quite useful in decision-support systems, Bayesian networks require a significant amount of storage. For example, in the Bayesian network 300 of FIG. 3A, the value of nodes X and Y causally influences the value of node Z. In this example, nodes X, Y, and Z have binary values of either 0 or 1. As such, node Z maintains a set of four probabilities, one probability for each combination of the values of X and Y, and stores these probabilities into a table 320 as shown in FIG. 3B. When performing probabilistic inference, it is the probabilities in table 320 that are accessed. As can be seen from table 320, only the probabilities for Z equaling 0 are stored; the probabilities for Z equaling 1 need not be stored as they are easily derived by subtracting the probability of when Z equals 0 from 1. As the number of parents of a node increases, the table in the node that stores the probabilities becomes multiplicatively large and requires a significant amount of storage. For example, a node having binary values with 10 parents that also have binary values requires a table consisting of 1,024 entries. And, if either the node or one of its parents has more values than a binary variable, the number of probabilities in the table increases multiplicatively. To improve the storage of probabilities in a Bayesian network node, some conventional systems use a tree data structure. A tree data structure is an acyclic, undirected graph where each vertex is connected to each other vertex via a single path. The graph is acyclic in that there is no path that both emanates from a vertex and returns to the same vertex, where each edge in the path is traversed only once. FIG. 3C depicts an example tree data structure 330 that stores into its leaf vertices 336-342 the probabilities shown in table 320 of FIG. 3B. Assuming that a decision-support system performs probabilistic inference with X's value being 0 and Y's value being 1, the following steps occur to access the appropriate probability in the tree data structure 330: First, the root vertex 332, vertex X, is accessed, and its value determines the edge or branch to be traversed. In this example, X's value is 0, so edge 344 is traversed to vertex 334, which is vertex Y. Second, after reaching vertex Y, the value for this vertex determines which edge is traversed to the next vertex. In this example, the value for vertex Y is 1, so edge 346 is traversed to vertex 338, which is a leaf vertex. Finally, after reaching the leaf vertex 338, which stores the probability for Z equaling 0 when X=0 and Y =1, the appropriate probability can be accessed. As compared to a table, a tree is a more efficient way of storing probabilities in a node of a Bayesian network, because it requires less space. However, tree data structures are inflexible in the sense that they can not adequately represent relationships between probabilities. For example, because of the acyclic nature of tree data structures, a tree cannot be used to indicate some types of equality relationships where multiple combinations of the values of the parent vertices have the same probability (i.e., refer to the same leaf vertex). This inflexibility requires that multiple vertices must sometimes store the same probabilities, which is wasteful. It is thus desirable to improve Bayesian networks with tree distributions.

BACKGROUND RELATIVE TO COLLABORATIVE FILTERING

Collaborative filtering systems have been developed that predict the preferences of a user. The term "collaborative filtering" refers to predicting the preferences of a user based on known attributes of the user, as well as known attributes of other users. For example, a preference of a user may be whether they would like to watch the television show "I Love Lucy" and the attributes of the user may include their age, gender, and income. In addition, the attributes may contain one or more of the user's known preferences, such as their dislike of another television show. A user's preference can be predicted based on the similarity of that user's attributes to other users. For example, if all users over the age of 50 with a known preference happen to like "I Love Lucy" and if that user is also over 50, then that user may be predicted to also like "I Love Lucy" with a high degree of confidence. One conventional collaborative filtering system has been developed that receives a database as input. The database contains attribute-value pairs for a number of users. An attribute is a variable or distinction, such as a user's age, gender or income, for predicting user preferences. A value is an instance of the variable. For example, the attribute age may have a value of 23. Each preference contains a numeric value indicating whether the user likes or dislikes the preference (e.g., 0 for dislike and 1 for like). The data in the database is obtained by collecting attributes of the users and their preferences. It should be noted that conventional collaborative filtering systems can typically only utilize numerical attributes. As such, the values for non-numeric