or
Bookmark and Share
Method and apparatus for extracting structured data from HTML pages
   
Document Number
US Patent 7073122
Issued Date
July 4, 2006
Link
Inventors
Map
Abstract
A method and apparatus for extracting structured data from HTML pages whereby an HTML file belonging to a pre-determined class of HTML files can be transformed into an instance tree (142). Other than the HTML file, there are two other inputs to the extraction procedure: a set of constraints (134), and a structure template (140). The steps in the process include: parsing the HTML file, thereby creating a parse tree (126); annotating the parse tree, thereby creating an annotated parse tree (130); creating an array of nodes from the annotated parse tree using a set of constraints (134); and generating an instance tree (142) from the array of nodes using the structure template (140). The instance tree (142) encodes, in a form that may be used by other computer programs, all the relevant information in the HTML file as prescribed by the set of constraints (134) and makes explicit the structure of this information.
Drawing
Method and apparatus for extracting structured data from HTML pages - US Patent 7073122 Drawing
Drawing from US Patent 7073122
Tags:
Description:
Amusing 0%
Clever 0%
Complex 0%
Efficient 0%
Historic 0%
Important 0%
Innovative 0%
Interesting 0%
Practical 0%
Simple 0%
Number of Claims:
18
Comments:
no comments yet
Owner
Published
July 4, 2006
Application Number
10/363,880
Filed
September 8, 2000
US Classification
715/513   715/501.1
Int'l Classification
G06F   15/00   (20060101)  
Examiner
Attorney/Law Firm
USPTO Field of Search
715/513   715/514   715/501.1  
Related Patents
7555480 - Comparatively crawling web page data records relative to a template - Owned by Microsoft Corporation (Redmond, WA)

The invention provides a method of interactively crawling data records on a web page. Users may select various data records of interest on a web page to generate templates to search for similar data items on the same web page or on different web pages. A tree matching algorithm may be used to compare and extract data matching the generated template.

Claims
Description
About| FAQs| Terms & Disclaimer| Link to Us| Contact Us