WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
System and method for mining generalized association rules in databases    
United States Patent5615341   
Link to this pagehttp://www.wikipatents.com/5615341.html
Inventor(s)Agrawal; Rakesh (San Jose, CA); Srikant; Ramakrishnan (San Jose, CA)
AbstractA system and method for discovering consumer purchasing tendencies includes a computer-implemented program which identifies consumer transaction itemsets that are stored in a database and which appear in the database a user-defined minimum number of times, referred to as minimum support. The itemsets contain items that are characterized by a hierarchical taxonomy. Then, the system discovers association rules, potentially across different levels of the taxonomy, in the itemsets by comparing the number of times each of the large itemsets appears in the database to the number of times particular subsets of the itemset appear in the database. When the relationship exceeds a predetermined minimum confidence value, the system outputs a generalized association rule which is representative of purchasing tendencies of consumers. The set of generalized association rules can be pruned of uninteresting rules, i.e., association rules which do not occur at a frequency that is significantly different than what is expected based upon the frequency of occurrence of the rule's ancestors.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Inventor     Agrawal; Rakesh (San Jose, CA); Srikant; Ramakrishnan (San Jose, CA)
Owner/Assignee     International Business Machines Corporation (Armonk, NY)
Patent assignment
All assignments
Publication Date     March 25, 1997
Application Number     08/436,794
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     May 8, 1995
US Classification    
Int'l Classification    
Examiner     McElheny Jr.; Donald E.
Assistant Examiner    
Attorney/Law Firm     Baker, Maxham, Jester & Meador
Address
Parent Case    
Priority Data    
USPTO Field of Search    
Patent Tags     mining generalized association rules databases
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5515270
Weinblatt
705/14
May,1996

[0 after 0 votes]
5459306
Stein
235/383
Oct,1995

[0 after 0 votes]
5430644
Deaton
705/14
Jul,1995

[0 after 0 votes]
5369571
Metts
705/10
Nov,1994

[0 after 0 votes]
5056019
Schultz
705/14
Oct,1991

[0 after 0 votes]
4949256
Humble
705/14
Aug,1990

[0 after 0 votes]
5173851
Off
705/14
Dec,1969

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


We claim:

1. A computer program device comprising:

a computer program storage device readable by a digital processing apparatus; and

a program means on the program storage device and including instructions executable by the digital processing apparatus for performing method steps for identifying association rules in itemsets with a hierarchical taxonomy on items of the itemsets, the taxonomy defining descendant and ancestor relationships between the items, the method steps comprising:

entering an itemset into a set of large itemsets when the number of times the itemset is present in a database of transactions establishes a support value that exceeds a predefined minimum support value;

for at least some of the itemsets in the set of large itemsets, determining the number of times selected subsets of the itemsets appear in transactions in the database; and

(d) outputting an association rule when the number of times a selected subset appears in the database bears a predetermined relationship to the number of times the associated itemset appears in the database and thereby satisfies a minimum confidence constraint.

2. The computer program device of claim 1, further comprising the steps of:

concatenating itemsets in the set of large itemsets in accordance with a predetermined concatenation regime to generate a next set of candidate large itemsets and discarding all candidate large itemsets whose subsets are not large itemsets;

comparing each itemset in the next set of candidate large itemsets to the itemsets in the database to determine the number of times the candidate large itemset is present in the database: and

entering a candidate large itemset into a next forward set of large itemsets only when the number of times the candidate large itemset is present in the database is greater than the minimum support value.

3. The computer program device of claim 2, wherein the taxonomy is a directed acyclic graph (DAG) taxonomy, and at least some of the itemsets contain items that are descendant items and ancestor items in the DAG taxonomy.

4. The computer program device of claim 3, wherein ancestors of an item are entered into transactions that include the item only when the ancestor appears in one of the itemsets in the next set of candidate large itemsets, and the method further comprises the steps of:

accessing the DAG taxonomy to predetermine the ancestors of each item prior to entering the ancestors into the set of large itemsets; and

when an itemset in the next set of candidate large itemsets contains an item and an ancestor of the item deleting the itemset from the next set of candidate large itemsets.

5. The computer program device of claim 4, further comprising the steps of entering ancestors of an item into transactions that include the item and deleting duplicate item entries from the transactions, prior to determining the number of times the associated itemset is present in the database.

6. The computer program device of claim 4, wherein the taxonomy is derived on the itemsets and is characterized by a hierarchical order of levels, the levels ranging from highest level to lowest level successively represented by integers 0 through n, and itemsets containing items at level 0 are accessed for entry into the set of large itemsets before itemsets containing items at other levels.

7. The computer program device of claim 6, wherein an itemset at level 1 is deleted if the itemset is a descendant of an ancestor itemset in level 0 when the number of times the ancestor itemset is present in the database does not exceed the predefined minimum support value, the deleting step being performed before itemsets at level 1 are accessed for entry into the set of large itemsets.

8. The computer program device of claim 4, wherein the taxonomy is characterized by a hierarchical order of levels, the levels ranging from highest level to lowest level successively represented by integers 0 through n, the method further comprising the steps of:

accessing a selected sample portion of the database to estimate candidate large itemsets;

entering itemsets estimated to be large into the next set of candidate large itemsets: and

entering itemsets not estimated to be large but having ancestor itemsets all of which are estimated to be large into the next set of candidate large itemsets.

9. The computer program device of claim 8, further comprising the step of determining the number of times descendant itemsets of an itemset that was not estimated to be large are present in the database when the next set of candidate large itemsets is counted.

10. The computer program storage device of claim 1, further comprising the steps of:

determining expected confidence and support values for an association rule based upon the confidence and support of an ancestor of the association rule; and

identifying the association rule as interesting when the confidence and support values of the association rule exceeds the expected confidence and support values by a predetermined factor.

11. A computer program product for use with a computer system, a central processing unit and means coupled to the central processing unit for storing a database to identify association rules in itemsets of transactions which are stored in the database, the itemsets being characterized by items in a hierarchical taxonomy, comprising:

a data storage device including a computer usable medium having computer readable program means for identifying association rules in itemsets having items anywhere in the hierarchy of the taxonomy, the computer usable code means having:

computer readable code means for accessing an itemset;

computer readable code means for entering the itemset into a set of large itemsets when the number of times the itemset is present in the database exceeds a predefined minimum support value:

computer readable code means for determining for at least some of the itemsets in the set of large itemsets, the number of times selected subsets of the itemsets appear in transactions in the database: and

computer readable code means for outputting an association rule when the number of times a selected subset appears in the database bears a predetermined minimum confidence relationship to the number of times the associated itemset appears in the database, thereby satisfying a minimum confidence constraint.

12. The computer program product of claim 11, wherein the taxonomy is a directed acyclic graph (DAG) taxonomy, and at least some of the itemsets contain items that are descendant items or ancestor items in the DAG taxonomy.

13. The computer program product of claim 12, wherein ancestors of an item are entered into transactions that include the item only when the ancestor appears in one of the itemsets in the next set of candidate large itemsets, and the computer program product further comprises:

computer readable code means for accessing the DAG taxonomy to predetermine the ancestors of each item prior to entering the ancestors into the set of large itemsets; and

computer readable code means for deleting, when an itemset in the next set of candidate large itemsets contains an item and an ancestor of the item, the itemset from the next set of candidate large itemsets.

14. The computer program product of claim 13, further comprising computer readable code means for entering ancestors of an item into a transaction that includes the item and deleting duplicate item entries from the transaction, prior to determining the number of times the associated itemset is present in the database.

15. The computer program product of claim 13, wherein the taxonomy derived on the candidate itemsets is characterized by a hierarchical order of levels, the levels ranging from highest level to lowest level successively represented by integers 0 through n, and the computer readable code means accesses itemsets at level 0 for entry into the set of large itemsets before itemsets at other levels are accessed.

16. The computer program product of claim 15, wherein the computer readable code means deletes an itemset at level 1 if the itemset is a descendant of an ancestor itemset in level 0. When the number of times the ancestor itemset is present in the database does not exceed the predefined minimum support value, before itemsets at level 1 are accessed for entry into the set of large itemsets.

17. The computer program product of claim 13, wherein the taxonomy is characterized by a hierarchical order of levels, the levels ranging from highest level to lowest level successively represented by integers 0 through n, and the computer program product further comprises:

computer readable code means for accessing a selected sample portion of the database to estimate candidate large itemsets;

computer readable code means for entering itemsets estimated to be large into the next set of candidate large itemsets; and

computer readable code means for entering itemsets not estimated to be large but having all ancestor itemsets estimated to be large into the next set of candidate large itemsets.

18. The computer program product of claim 17, further comprising computer readable code means for determining the number of times descendant itemsets of an itemset that was not estimated to be large are present in the database when the next set of candidate large itemsets is counted.

19. The computer program storage device of claim 11, further comprising:

computer readable code means for determining expected confidence and support values for an association rule based upon the confidence and support of an ancestor of the association rule; and

computer readable code means for identifying the association rule as interesting when the confidence and support values of the association rule exceeds the expected confidence and support values by a predetermined factor.

20. A program storage device readable by a digital processing apparatus and tangibly embodying a program of instructions executable by the digital processing apparatus to perform method steps for identifying generalized association rules of itemsets in transactions stored in a database and having more than one item, the items being characterized by a taxonomic structure, so as to discover customer purchasing tendencies, the method steps comprising:

identifying as large itemsets those itemsets having items located anywhere in the taxonomic structure and recurring with at least a user-defined minimum support;

discovering association rules between the large itemsets and subsets thereof when the subset recurrence bears a predetermined relationship to itemset recurrence; and

outputting the association rules as representative of customer purchasing tendencies.

21. The program storage device of claim 20, comprising the step of generating a next set of candidate large itemsets, wherein ancestors of an item are entered into at least one transaction that includes the item only when the ancestor appears in one of the itemsets in the next set of candidate large itemsets, and the method steps further comprise:

accessing the taxonomic structure to predetermine ancestors of each item prior to entering the ancestors into the set of large itemsets; and

when an itemset in the next set of candidate large itemsets contains an item and an ancestor of the item, deleting the itemset from the next set of candidate large itemsets.

22. The program storage device of claim 21, further comprising the steps of:

determining expected confidence and support values for an association rule based upon the confidence and support of an ancestor of the association rule; and

identifying the association rule as interesting when the confidence and support values of the association rule exceeds the expected confidence and support values by a predetermined factor.

23. A database mining system for discovering association rules in itemsets having items stored in a taxonomically structured database, comprising:

a large itemset generator for generating large itemsets when the itemsets have a support in a transaction database at least equal to a user-defined minimum support value;

an association rule generator for receiving the large itemsets and outputting an association rule when an itemset bears a confidence relationship to at least one of its subsets equal to or greater than a predetermined confidence relationship; and

a rule pruner for identifying an association rule as interesting when the support and the confidence relationship respectively exceed an expected support and an expected confidence relationship by a preselected factor.

24. A computer-based system for discovering purchasing tendencies of consumers by identifying association rules between itemsets of transactions and subsets of the itemsets, the subsets including one or more items, comprising:

a multi-level taxonomic structure accessible by the computer for storing the items in a hierarchical relationship;

a large itemset generator accessing the taxonomic structure and the transactions for determining a first number of times an itemset appears in the transactions and for designating the itemset as a large itemset when the first number of times exceeds a minimum support value; and

an association rule discoverer accessing the large itemset generator for determining a second number of times at least one subset of an itemset appears in the transactions, the association rule discoverer outputting an association rule representative of purchasing tendencies of consumers when the first number of times bears a predetermined minimum confidence relationship to the second number of times.
 Description Submit all comments and votes
 


CROSS REFERENCE TO RELATED APPLICATIONS

This application contains material related to the following co-pending U.S. patent applications, which are commonly assigned with this application:

U.S. patent application Ser. No. 08/227,428, filed Apr. 14, 1994, for "SYSTEM AND METHOD FOR QUERY OPTIMIZATION USING QUANTILE VALUES OF A LARGE UNORDERED DATA SET";

U.S. patent application Ser. No. 08/398,640, filed Mar. 3, 1995, for "SYSTEM AND METHOD FOR MINING SEQUENTIAL PATTERNS IN A LARGE DATABASE"; and

U.S. patent application Ser. No. 08/415,006, filed Mar. 31, 1995, for "SYSTEM AND METHOD FOR QUICKLY MINING ASSOCIATION RULES IN A DATABASE".

The above-referenced U.S. patent applications Ser. Nos. 08/398,640 and 08/415,006, are incorporated herein by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates generally to data processing, and more particularly to "computer database mining" in which generalized association rules between significant transactions that are recorded in a database are discovered. In particular, the invention concerns mining a large database of sales transactions.

2. Description of the Related Art

Customer purchasing habits can provide invaluable marketing information for a wide variety of applications. For example, retailers can create more effective store displays and more effectively control inventory than otherwise would be possible if they know that, given a consumer's purchase of a first set of items, the same consumer can be expected, with some degree of probability, to purchase a particular second set of items along with the first set. In other words, it would be helpful from a marketing standpoint to know association rules between itemsets in a transaction. To illustrate, it would be helpful for a retailer of automotive parts and supplies to be aware of an association rule expressing the fact that 90% of the consumers who purchase automobile batteries and battery cables also purchase battery post brushes and battery post cleanser (referred to as the "consequent" in the terminology of the present invention).

It will be appreciated that advertisers too can benefit from a thorough knowledge of such consumer purchasing tendencies. Still further, catalogue companies can conduct more effective mass mailings if they know the tendencies of consumers to purchase particular sets of items with other sets of items. It is to be understood, however, that although this discussion focusses on the marketing applications of the present invention, database mining and, hence, the principles of the present invention, are useful in many other areas, e.g., business and science.

It happens that until recently, building large, detailed databases that could chronicle thousands, and from a statistical view preferably millions, of consumer transactions, much less deriving useful information from the databases (i.e., mining the databases), was highly impractical. Consequently, marketing and advertising strategies have been based upon anecdotal evidence of purchasing habits, if any at all, and thus have been susceptible to inefficiencies in consumer targeting that have been difficult if not impossible to overcome.

With the advent of modern technology, however, building large databases of consumer transactions has become possible. The ubiquitous bar-code reader can almost instantaneously read so-called basket data, i.e., when a particular item from a particular lot was purchased by a consumer, how many items the consumer purchased, and so on, for automatic electronic storage of the basket data. Further, when the purchase is made with, for example, a credit card, the identity of the purchaser can be almost instantaneously known, recorded, and stored along with the basket data. Still further, vastly improved data storage media have made it possible to electronically store vast amounts of such information for future use.

As alluded to above, however, building a transaction database is only part of the marketing challenge. Another important part is the mining of the database for useful information. Such database mining becomes increasingly problematic as the size of databases expands into the gigabyte and indeed the terabyte range.

Not surprisingly, purchasing tendencies, and, hence, particular regimes of database mining, can be classified many ways. In the above-referenced U.S. patent application Ser. No. 08/415,006, for "SYSTEM AND METHOD FOR QUICKLY MINING ASSOCIATION RULES IN A DATABASE", for example, an effective system is disclosed for quickly mining association rules that indicate purchasing habits during single transactions, i.e., rules that indicate, with user-defined degrees of confidence, which frequently-recurring itemsets are likely to purchased along with other frequently-recurring itemsets in a transaction. In accordance with the present invention, an itemset "frequently occurs" in a database and is referred to as being "large" if it appears in the database with at least a user-defined regularity, referred to herein as "minimum support".

Previous database mining systems, however, including the invention disclosed in the parent application, do not consider mining association rules across different levels of a taxonomy, but instead restricted the items in the mined rules to leaf nodes in databases. Thus, for example, in the case of an itemset taxonomy in which the item "jacket" hierarchicly depends from the item "outerwear", which hierarchicly depends from the item "clothes", the parent invention might generate an association rule that indicates that people who purchase jackets tend to purchase hiking boots at the same time, but it is unable to generate more generalized rules that, e.g., people who purchase outerwear or clothing tend to purchase hiking boots. And, because the support for an item in a taxonomy is not necessarily equal to the sum of the supports of its children, rules cannot be inferred for items at higher levels of taxonomies from rules for items at leaves.

Unfortunately, when association rules are restricted to just the leaves of a taxonomy, many significant associations might escape detection. For example, few consumers might purchase hiking boots with jackets, but many people might perhaps purchase hiking boots with outerwear in general, without previous mining systems so discovering. Moreover, a rule stating that consumers who purchase jackets tend to purchase hiking boots might be discovered by the parent invention, but it can happen that such a rule is not nearly as interesting, from a marketing standpoint, as the fact that consumers who purchase outerwear in general tend to purchase hiking boots. Consequently, by not considering taxonomies, previous systems are unable to prune out non-interesting and redundant rules. It is therefore the focus of the present invention to consider taxonomies and thereby discover generalized association rules which also satisfy; a user-defined interest criterium.

Accordingly, it is an object of the present invention to provide a system and method for mining large databases to discover generalized association rules. Another object of the present invention is to provide a system and method for discovering generalized association rules in itemsets that are stored in a transaction database, based on an item taxonomy. Still another object of the present invention is to provide a system and method for finding interesting association rules which repeat with a user-defined degree of regularity, which satisfy a user-defined degree of confidence, and which satisfy a user-defined interest criterium. Yet another object of the present invention is to provide a system and method for quickly mining large databases which is easy to use and cost-effective.

SUMMARY OF THE INVENTION

The invention concerns a procedure to identify association rules in itemsets, also referred to as transactions, that arc stored in a large database with a taxonomy on the items in the itemsets.

This invention is realized in a critical machine component that causes a digital processing apparatus to perform method steps for identifying association rules in itemsets with a hierarchical taxonomy on items of the itemsets, the taxonomy defining descendant and ancestor relationships between the items. Hereinafter, the machine component is referred to as a "computer program product".

In accordance with the present invention, the method steps include accessing an itemset and entering the itemset into a set of large itemsets when the number of times the itemset is present in a database of transactions establishes a support value that exceeds a predefined minimum support value. Then, for at least some of the itemsets in the set of large itemsets, the number of times selected subsets of the itemsets appear in transactions in the database is determined. An association rule is output when the number of times a selected subset appears in the database bears a predetermined relationship to the number of times the associated itemset appears in the database and thereby satisfies a minimum confidence constraint.

Preferably, the method includes concatenating itemsets in the set of large itemsets in accordance with a predetermined concatenation regime to generate a next set of candidate large itemsets and discarding all candidate large itemsets whose subsets are not large itemsets. Also, each itemset in the next set of candidate large itemsets is compared with the itemsets in the database to determine the number of times the candidate large itemset is present in the database. The method steps include entering a candidate large itemset into a next forward set of large itemsets only when the number of times the candidate large itemset is present in the database is greater than the minimum support value.

In the preferred embodiment, the taxonomy is a directed acyclic graph (DAG) taxonomy, and at least some of the itemsets contain items that are descendant items and ancestor items in the DAG taxonomy. Ancestors of an item are entered into transactions that include the item only when the ancestor appears in one of the itemsets in the next set of candidate large itemsets. Optimally, the method steps include accessing the DAG taxonomy to predetermine the ancestors of each item prior to entering the ancestors into the set of large itemsets, and when an itemset in the next set of candidate large itemsets contains an item and an ancestor of the item, deleting the itemset from the next set of candidate large itemsets. In still a further optimization, the method steps include entering ancestors of an item into transactions that include the item and deleting duplicate item entries from the transactions, prior to determining the number of times the associated itemset is present in the database.

In another preferred embodiment, the taxonomy is derived on the itemsets and is characterized by a hierarchical order of levels. The levels range from highest level to lowest level successively represented by integers 0 through n, and itemsets containing items at level 0 are accessed for entry into the set of large itemsets before itemsets containing items at other levels.

In this so-called "stratify" embodiment, an itemset at level 1 is deleted if the itemset is a descendant of an ancestor itemset in level 0 when the number of times the ancestor itemset is present in the database does not exceed the predefined minimum support value. As envisioned herein, the deleting step is performed before itemsets at level 1 are accessed for entry into the set of large itemsets.

In yet another preferred embodiment, the taxonomy is characterized by a hierarchical order of levels, and the levels range from highest level to lowest level successively represented by integers 0 through n. In this so-called "estimate" embodiment, the method further includes accessing a selected sample portion of the database to estimate candidate large itemsets. Itemsets that are estimated to be large are entered into the next set of candidate large itemsets. Also, itemsets that are not estimated to be large but that have ancestor itemsets all of which are estimated to be large are entered into the next set of candidate large itemsets. In a so-called "estmerge" embodiment, the method steps include determining the number of times descendant itemsets of an itemset that was not estimated to be large are present in the database when the next set of candidate large itemsets is counted.

If desired, association rules generated by the computer program storage device can be pruned of non-interesting rules by determining expected confidence and support values for an association rule based upon the confidence and support of an ancestor of the association rule. The association rule is identified as interesting when the confidence and support values of the association rule exceeds the expected confidence and support values by a predetermined factor.

In another aspect of the present invention, a computer program product is disclosed which is readable by a digital processing apparatus and which tangibly embodies a computer program. The computer program product combines a computer readable medium with program code elements that identify association rules of itemsets of a database, each itemset having more than one item arranged in a taxonomy, so as to discover generalized customer purchasing tendencies.

In this invention, the code elements are embodied in a program stored on the computer readable medium. These code elements access an itemset and enter it into a set of large itemsets when the number of times the itemset is present in the database exceeds a predefined minimum support value. Further, the code elements determine, for at least some of the itemsets in the set of large itemsets, the number of times selected subsets of the itemsets appear in transactions in the database. Additionally, the code elements output an association rule when the number of times a selected subset appears in the database bears a predetermined minimum confidence relationship to the number of times the associated itemset appears in the database, thereby satisfying a minimum confidence constraint.

In yet another aspect, a program storage device is readable by a digital processing apparatus and tangibly embodies a program of instructions which are executable by the digital processing apparatus. The digital processing apparatus executes the program to perform method steps for identifying generalized association rules of itemsets in transactions stored in a database and having more than one item, the items being characterized by a taxonomic structure, so as to discover customer purchasing tendencies

The method steps include identifying as large itemsets those itemsets having items located anywhere in the taxonomic structure and recurring with at least a user-defined minimum support. Further, the method steps include discovering association rules between the large itemsets and subsets thereof when the subset recurrence bears a predetermined relationship to itemset recurrence. Also, the method steps include outputting the association rules as representative of customer purchasing tendencies.

In still another aspect of the present invention, a database mining system is disclosed for discovering association rules in itemsets having items stored in a taxonomically structured database. The mining system includes a large itemset generator for generating large itemsets when the itemsets have a support in a transaction database at least equal to a user-defined minimum support value. Also, the mining system includes an association rule generator for receiving the large itemsets and outputting an association rule when an itemset bears a confidence relationship to at least one of its subsets equal to or greater than a predetermined confidence relationship. Moreover, the system includes a rule pruner for identifying an association rule as interesting when the support and the confidence relationship respectively exceed an expected support and an expected confidence relationship by a preselected factor.

A computer-based system is disclosed in yet another aspect for discovering purchasing tendencies of consumers by identifying association rules between itemsets of transactions and subsets of the itemsets, the subsets including one or more items. The system includes a multilevel taxonomic structure accessible by the computer for storing the items in a hierarchical relationship. A large itemset generator is provided for accessing the taxonomic structure and the transactions for determining a first number of times an itemset appears in the transactions and for designating the itemset as a large itemset when the first number of times exceeds a minimum support value.

An association rule discoverer accesses the large itemset generator for determining a second number of times at least one subset of an itemset appears in the transactions. In accordance with the present invention, the association rule discoverer outputs an association rule that is representative of purchasing tendencies of consumers when the first number of times bears a predetermined minimum confidence relationship to the second number of times.

The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which:

BRIEF DESCRIPTION OF THE PREFERRED EMBODIMENTS

FIG. 1 is a functional block diagram of the system for mining generalized association rules of the present invention;

FIG. 1A illustrates a machine component embodying the invention, with portions cut away for illustration;

FIG. 2 is a flow chart showing the overall operation of the present invention;

FIG. 3 is a flow chart showing the operation of the Basic embodiment of the present invention in identifying large itemsets;

FIG. 4 is a flow chart showing the candidate generation of the present invention;

FIG. 5 is a schematic diagram showing the data structure used by the Basic embodiment:

FIG. 6 is a flow chart showing the data management of the Basic embodiment;

FIG. 7 is a flow chart showing the buffer management of the Basic embodiment;

FIG. 8 is a flow chart showing the operation of the Cumulate embodiment of the present invention in identifying large itemsets;

FIG. 9 is a flow chart showing the operation of the Stratify embodiment of the present invention in identifying large itemsets;

FIG. 10 is a flow chart showing the operation of the Estimate embodiment of the present invention in identifying large itemsets;

FIG. 11 is a flow chart showing the operation of the EstMerge embodiment of the present invention in identifying large itemsets;

FIG. 12 is a flow chart showing a simple method for determining association rules in large itemsets across a taxonomy;

FIG. 13 is a flow chart showing a comparatively fast method for determining association rules in large itemsets across a taxonomy;

FIG. 14 is a flow chart showing the details of the method shown in FIG. 13; and

FIG. 15 is a flow chart showing the method by which non-interesting association rules are pruned.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Referring initially to FIG. 1, a system for mining databases for generalized association rules is shown, generally designated 10. In the particular architecture shown, the system 10 includes one or more digital processing apparatus, such as a client computer 12 and a server computer 14. In one intended embodiment, the server computer 14 may be a mainframe computer made by IBM Corp. of Armonk, N.Y., and use an operating system sold under trademarks such as MVS. Or, the server computer 14 may be a Unix computer, or OS/2 server, or Windows NT server, or IBM RS/6000 250 workstation with 128 MB of main memory running AIX 3.2.5. The server computer 14 may own a database system, such as DB2 or ORACLE, or it may have data on files on some data storage medium such as disk, e.g., a 2 GB SCSI 3.5" drive, or tape. It is to be understood that architectures other than the one shown may be used. For example, the functions of the client computer 12 may be incorporated into the server computer 14, and vice versa.

As shown, the operating system of the server computer 14 includes a mining kernel 16 which may be executed by a processor within the server computer 14 as a series of computer-executable instructions. These instructions may reside, for example, in RAM of the computer 14.

Alternatively, the instructions may be contained on a data storage device with a computer readable medium, such as a computer diskette 15 shown in FIG. 1A. Or, the instructions may be stored on a DASD array, magnetic tape, conventional hard disk drive, electronic read-only memory, optical storage device, or other appropriate data storage device. In an illustrative embodiment of the invention, the computer-executable instructions may be lines of compiled C.sup.++ language code.

FIGS. 2-15 illustrate the structure of such instructions as embodied in a computer program. Those skilled in the art will appreciate that FIGS. 2-15 illustrate the structures of computer program code elements that function according to this invention. Manifestly, the invention is practiced in its essential embodiment by a machine component that renders the computer program code elements in a form that instructs a digital processing apparatus (that is, a computer) to perform a sequence of function steps corresponding to those shown in the Figures. The machine component is shown in FIG. 1A as a combination of program code elements A-E in computer readable form that are embodied in a computer-usable data medium 17, on the computer diskette 15. As mentioned above, however, such media can also be found in semiconductor devices, on magnetic tape, and on optical disks.

Each of the code elements A-E is for directing a digital processing apparatus to facilitate some portion of the method by which this invention is practiced. Even when no single code clement A-E includes the complete method, two or more of the code elements together may comprise all of the program means necessary to facilitate the practice of the invention.

FIG. 1 shows that, through appropriate data access programs and utilities 18, the mining kernel 16 accesses one or more databases 20 and/or flat files (i.e., text files) 22 which contain data chronicling transactions. Alter executing the steps described below, the mining kernel outputs association rules it discovers to a mining results repository 24, which can be accessed by the client computer 12.

Additionally, FIG. 1 shows that the client computer 12 can include a mining kernel interface 26 which, like the mining kernel 16, may be implemented in suitable computer code. Among other things, the interface 26 functions as an input mechanism for establishing certain variables, including the minimum support value, minimum confidence value, and interesting factor R defined below. Further, the client computer 12 preferably includes an output module 28 for outputting/displaying the mining results on a graphic display 30, print mechanism 32, or data storage medium 34.

FIG. 2 shows the overall method of the present invention. Beginning with block 36, the system 10 identifies large itemsets in the database 20. Thus, block 36 is essentially a large itemset generator. As more fully disclosed below, by "large itemset" is meant a set of one or more items which are purchased in a user-defined percentage of all transactions in the database 20, i.e., itemsets which are appear in the database 20 in a user-defined "minimum support" percentage of transactions. Stated differently, at block 36 the database 20 is accessed for determining a first number of times an itemset appears in the database and for designating the itemset as a large itemset when the first number of times exceeds a minimum support value. In contrast, "small" itemsets are itemsets that do not fulfill the minimum support criteria.

Moreover, as disclosed below the items contained in the itemsets of transactions that are stored in the database 20 are characterized by a hierarchical taxonomy. Thus, as defined by the taxonomy some items are ancestors (also herein referred to as "parents") of other items. i.e., some items are located at higher levels of the taxonomy than other items which are located at lower levels and lie in a path from the ancestor item. Such items located at lower levels of the taxonomy are said to be descendant items of the higher level items in the taxonomy from which the lower level items branch. As used, herein, a first itemset is an ancestor to a second itemset if the first itemset contains an item that is an ancestor of an item contained in the second, itemset. Accordingly, by "large itemset" is further meant such an ancestor itemset that satisfies the minimum support constraint by virtue of the number of times its descendant itemsets appear in the aggregate in the database 20.

In other words, the items that appear in transaction itemsets in the database 20 are characterized by a taxonomical structure, i.e., a data structure wherein stored elements are arranged in one or more hierarchical taxonomics. For example, the items in the database 20 can be characterized by a taxonomical structure by item type category. Also, the items in the database 20 can be characterized by a taxonomical structure by item cost category. When multiple taxonomies are present, they may be combined in a directed acyclic graph (DAG) taxonomic structure, such as the structure schematically shown in FIG. 2.

The taxonomic structure of the present invention can be better understood in reference to the following example. Entry "E" could be, e.g., "clothes", representing an interior node of a taxonomy based upon item type. Entry "F", in contrast, could be, e.g., "articles costing less than $10", representing an interior node of a taxonomy based upon item cost. Both entries "E" and "F" are said to be at the highest level, termed herein zero level, of the DAG structure shown.

Then, entry "G" could be "socks", indicating an entry in the level immediately below the zero level ("level one") in both taxonomies. As indicated by DAG edges EG and FG, entry G descends from both entries "E" and "F".

Entries "H" and "M" could be "athletic socks" and "children's socks", respectively, indicating entries at level two of the DAG which, as respectively indicated by DAG edges GH and GM descend from entry "G" and, hence, from entries "E" and "F". In accordance with the present invention, entries "E" and "F" are close ancestors of entry "G" and ancestors of entries "H" and "M", but because the DAG is directed, the converse is not true in a DAG structure ("E" and "F" do not descend from "G", and "G" does not descend from "H" and "M"). Entry "G" is a descendant of entries "E" and "F", and entries "H" and "M" are descendants of entry "G" and, hence, are descendants of entries "E" and "F". Typically, only items at the lowest, or leaf level of the taxonomy will recorded in transactions. Support in the database 20 for items at higher levels of the taxonomy is determined from the support of the corresponding descendant items at the leaf level.

After the large itemsets have been identified at block 36, the process moves to block 38. In accordance with the invention disclosed in detail below, block 38 is an association rule discoverer which accesses the large itemset generator established by block 36 for determining a second number of times at least one subset of an itemset appears in the database 20. Then, the association rule discoverer at block 38 outputs an association rule representative of purchasing tendencies of consumers when the first number of times bears a predetermined relationship to the second number of times, i.e., when a predetermined or user specified minimum confidence value is satisfied.

As intended by the present invention, the association rule generator outputs generalized association rules that can span levels in the DAG, and that have the form

XY, wherein X, Y are itemsets, X .andgate. Y=.PHI., (i.e., X .andgate. Y is null), no item in Y is an ancestor to any item in X, and X, Y can contain items from any level in the taxonomy.

As an example, the present invention might access a database which chronicles transactions in an automobile parts, supplies, and service center. An association rule that can be discovered by the present invention might be that 98% of all customers who purchase tires along with tire pressure gages also purchase wheel balancing services during the transaction. Stated differently, in the example given it can be said with a 98% confidence level that wheel balancing services are not purchased separately from tires and tire pressure gages. As recognized by the present invention, the implications of association rules which are discovered between itemsets can produce substantial economic value to retailers, advertisers, marketing executives and indeed in a wide variety of applications.

At block 39, certain association rules discovered at block 38 might be pruned, on the basis that such rules might not be interesting. Stated differently, block 39 is a rule pruner for identifying an association rule as interesting when the support and the confidence relationships determined at blocks 36 and 38 respectively exceed an expected support and an expected confidence relationship multiplied by a preselected factor R. As more fully described below, the expected support and confidence of each rule is determined based upon the support and confidence of close ancestors, if any, of the rule.

Accordingly, any rule that has no ancestors is interesting, and any rule that has a close ancestor but for which the actual support and confidence of the rule fall outside a user-defined expected boundary that is based upon the ancestor is interesting. In contrast, any rule that has a close ancestor and for which the actual support and confidence of the rule fall within a user-defined expected boundary that is based upon the ancestor is not interesting.

The present invention recognizes that it need consider only close ancestors of rules when determining whether a test rule is interesting. More particularly, for a test rule to be truly interesting, it must be interesting with respect to its closest ancestor. If a test rule is otherwise interesting with respect to a relatively distant ancestor, but not a relatively closer ancestor to the test rule, it is actually the closer ancestor that is interesting with respect to the distant ancestor, and the test rule actually is not interesting in and of itself.

FIG. 3 shows one embodiment of the process of the large itemset generator established by block 36 in FIG. 2, referred to herein as "Basic". In contrast to the present invention, the prior procedures mentioned above generate and count candidate large sequences only by considering items at leaf nodes in a database, i.e., by not considering that useful association rules might span various levels in a taxonomy. F