WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Evaluation of media content in media files    
United States Patent5983176   
Link to this pagehttp://www.wikipatents.com/5983176.html
Inventor(s)Hoffert; Eric M. (San Francisco, CA), Cremin; Karl (Mt. View, CA), Degen; Leo (Petaluma, CA)
AbstractA method and apparatus for searching for multimedia files in a distributed database and for displaying results of the search based on the context and content of the multimedia files.



 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5983176
Evaluation of media content in media files - US Patent 5983176 Drawing
Evaluation of media content in media files
Inventor     Hoffert; Eric M. (San Francisco, CA) , Cremin; Karl (Mt. View, CA) , Degen; Leo (Petaluma, CA)
Owner/Assignee     Magnifi, Inc. (Cupertino, CA)
Patent assignment
All assignments
Publication Date     November 9, 1999
Application Number     08/848,357
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     April 30, 1997
US Classification     704/233 704/231 704/236
Int'l Classification    
Examiner     Dorvil; Richemond
Assistant Examiner     Sax; Robert Louis
Attorney/Law Firm     Blakely, Sokoloff, Taylor & Zafman LLP
Address
Parent Case     RELATED APPLICATIONS This application claims benefit of the following co-pending U.S. Provisional Applications: 1) Method and Apparatus for Processing Context and Content of Multimedia Files When Creating Searchable Indices of Multimedia Content on Large, Distributed Networks; Ser. No.: 60/018,312; Filed: May 24, 1996; 2) Method and Apparatus for Display of Results of a Search Queries for Multimedia Files; Ser. No.: 60/018,311; Filed: May 24, 1996; 3) Method for Increasing Overall Performance of Obtaining Search Results When Searching on a Large, Distributed Database By Prioritizing Database Segments to be Searched; Ser. No.: 60/018,238; Filed: May 24, 1996; 4) Method for Processing Audio Files to Compute Estimates of Music-Speech Content and Volume Levels to Enable Enhanced Searching of Multimedia Databases; Ser. No.: 60/021,452; Filed: Jul. 10, 1996; 5) Method for Searching for Copyrighted Works on Large, Distributed Networks; Ser. No.: 60/021,515; Filed: Jul. 10, 1996; 6) Method for Processing Video Files to Compute Estimates of Motion Content, Brightness, Contrast and Color to Enable Enhanced Searching of Multimedia Databases; Ser. No.: 60/021,517; Filed: Jul. 10, 1996; 7) Method and Apparatus for Displaying Results of Search Queries for Multimedia Files; Ser. No.: 60/021,466; Filed: Jul. 10, 1996; 8) A Method for Indexing Stored Streaming Multimedia Content When Creating Searchable Indices of Multimedia Content on Large, Distributed Networks; Ser. No.: 60/023,634; Filed: Aug. 9, 1996; 9) An Algorithm for Exploiting Lexical Proximity When Performing Searches of Multimedia Content on Large, Distributed Networks; Ser. No.: 60/023,633; Filed: Aug. 9, 1996; 10) A Method for Synthesizing Descriptive Summaries of Media Content When Creating Searchable Indices of Multimedia Content on Large, Distributed Networks; Ser. No.: 60/023,836; Filed: Aug. 12, 1996.
Priority Data    
USPTO Field of Search     704/233 704/231 704/236
Patent Tags     evaluation media content media files
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5298674
Yun

Mar,1994

[0 after 0 votes]
4829578
Roberts

May,1989

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method of determining if an audio file comprises music or speech comprising the steps:

examining time slices of said audio file to determine amplitude changes with time during each time slice;

classifying said audio file as music or speech based on said amplitude changes.

2. The method as recited by claim 1 wherein said step of examining time slices comprise computing normalized amplitude deviation values for each time slice.

3. The method as recited by claim 2 wherein said step of examining time slices further comprises averaging the normalized amplitude deviation values for all time slices to compute a music-speech metric.

4. The method as recited by claim 2 wherein said step of computing a normalized deviation value comprises computing maximum amplitude for all time slices, computing a deviation from the maximum amplitude for a particular time slice and normalizing the deviation value.

5. A method of analyzing audio file content comprising:

a) dividing said audio file into time segments of a predetermined size;

b) computing values of normalized amplitude deviation with time during each time segment;

c) averaging the normalized amplitude deviation values to compute a music speech metric; and

d) assessing whether the audio file represents music or speech based on the music-speech metric value.

6. The method as recited by claim 5 wherein said step of computing a normalized deviation value comprises computing an average maximum amplitude for each time slice, computing a deviation from the average maximum amplitude for a particular time slice and normalizing the deviation value.

7. The method as recited by claim 5 wherein the step of normalizing the deviation value comprises:

a) computing a value MAX-DEV as the absolute value of the difference between the maximum amplitude of a time slice and the average maximum amplitude for all time;

b) normalizing MAX-DEV over a reference value range by computing a value NORMALIZED-MAX-DEV=MAX-DEV*(REF-VAL/MAX) where REF-VAL is the maximum reference value in the range and MAX is the maximum amplitude for all time slices.

8. A method of assessing the volume of an audio file comprising the steps of:

a) dividing said audio file into time slices;

b) computing the average amplitude AVG-AMPLITUDE for each time slice;

c) computing the average of the AVG-AMPLITUDE to provide a metric;

d) assessing a volume level based on said metric.

9. The method as recited by claim 8 wherein said value AVG-AMPLITUDE is computed by summing amplitude values in each time slice and dividing.

10. A method of analyzing content of an audio file comprising the steps of:

a) assessing whether said audio file represents music or speech by examining slices of said audio file to determine amplitude changes with time during the time slice and classifying said audio file as music or speech based on said amplitude changes;

b) assessing a volume level of said audio file.

11. The method as recited by claim 10 wherein said step of examining time slices comprise computing normalized amplitude deviation values for each time slice.

12. The method as recited by claim 11 wherein said step of examining time slices further comprises averaging the normalized amplitude deviation values for all time slices to compute a music-speech metric.

13. The method as recited by claim 11 wherein said step of computing a normalized deviation value comprises computing an average maximum amplitude for each time slice, computing a deviation from the average maximum amplitude for a particular time slice and normalizing the deviation value.

14. The method of claim 10 wherein said step assessing the volume of an audio file further comprises the steps of:

a) dividing said audio file into time slices;

b) computing the average amplitude AVG-AMPLITUDE for each time slice;

c) computing the average of the AVG-AMPLITUDES to provide a metric;

d) assessing a volume level based on said metric.

15. The method as recited by claim 14 wherein said value AVG-AMPLITUDE is computed by summing amplitude values in each time slice and dividing.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

Field of the Invention

The present invention relates to the field of networking, specifically to the field of searching for and retrieval of information on a network.

Description of the Related Art

Wouldn't it be nice to be able to log onto your local internet service provider, access the worldwide web, and search for some simple information, like "Please find me action movies with John Wayne which are in color?" or "Please find me audio files of Madonna talking?", or "I would like black and white photos of the Kennedy assassination". Or, how about even "Please find me an action movie starring Michael Douglas and show me a preview of portions of the movie where he is speaking loudly". Perhaps, instead of searching the entire worldwide web, a company may want to implement this searching capability on its intranet.

Unfortunately, text based search algorithms cannot answer such queries. Yet, text based search tools are the predominate search tools available on the internet today. Even if text based search algorithms are enhanced to examine files for file type and, therefore, be able to detect whether a file is a audio, video or other multimedia file, little if any information is available about the content of the file beyond its file type.

Still further, what if the search returns a number of files. Which one is right? Can the user tell from looking at the title of the document or some brief text contained in the document as is done by many present day search engines? In the case of relatively small text files, downloading one or two or three "wrong" files, when searching for the right file, is not a major problem. However, when downloading relatively large multimedia files, it may be problematic to download the files without having a degree of assurance that the correct file has been found.

SUMMARY OF THE INVENTION

It is desireable to provide a search engine which is capable of searching the internet, or other large distributed network for multimedia information. It is also desirable that the search engine provide for analysis of the content of files found in the search and for display of previews of the information.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an overall diagram of a media search and retrieval system as may implement the present inventions.

FIGS. 2A-C illustrates a flow diagram of a method of media crawling and indexing as may utilize the present inventions.

FIG. 3A illustrates an overall diagram showing analysis of digital audio files.

FIGS. 3B, 3C and 3D illustrates waveforms.

FIG. 3E-H illustrate a flow diagram of a method of analyzing content of digital audio files.

FIG. 4A illustrates a user interface showing search results.

FIG. 4B illustrates components of a preview.

FIG. 4C-4E illustrate a flow diagram of a method of providing for previews.

For ease of reference, it might be pointed out that reference numerals in all of the accompanying drawings typically are in the form "drawing number" followed by two digits, xx; for example, reference numerals on FIG. 1 may be numbered 1xx; on FIG. 3, reference numerals may be numbered 3xx. In certain cases, a reference numeral may be introduced on one drawing and the same reference numeral may be utilized on other drawings to refer to the same item.

DETAILED DESCRIPTION OF THE EMBODIMENTS

What is described herein is a method and apparatus for searching for, indexing and retrieving information in a large, distributed network.

1.0 Overview

FIG. 1 provides an overview of a system implementing various aspects of the present invention. As was stated above, it is desirable to be provide a system which will allow searching of media files on a distributed network such as the internet or, alternatively, on intranets. It would be desirable if such a system were capable of crawling the network, indexing media files, examining and analyzing the media file's content, and presenting summaries to users of the system of the content of the media files to assist the user in selection of a desired media file.

The embodiment described herein may be broken down into 3 key components: (1) crawling and indexing of the network to discover multimedia files and to index them 100; (2) examining the media files for content (101-105); and (3) building previews which allow a user to easily identify media objects of interest 106. Each of these phases of the embodiment provide, as will be appreciated, for unique methods and apparatus for allowing advanced media queries.

2.0 Media Crawling and Indexing

FIGS. 2A-2C provides a description of a method for crawling and indexing a network to identify and index media files. Hypertext markup language (HTML) in the network is crawled to locate media files, block 201. Lexical information (i.e., textual descriptions) is located describing the media files, block 202 and a media index is generated, block 203. The media index is then weighted, block 204 and data is stored for each media object, block 205. Each of these steps will be described in greater detail below.

2.1 Crawl HTML to Locate Media Files

The method of the described embodiment for crawling HTML to locate media files is illustrated in greater detail by FIG. 2B. Generally, a process as used by the present invention may be described as follows:

The crawler starts with a seed of multimedia specific URL sites to begin its search. Each seed site is handled by a separate thread for use in a multithreaded environment. Each thread parses HTML pages (using a tokenizer with lexical analysis) and follows outgoing links from a page to search for new references to media files. Outgoing links from an HTML page are either absolute or relative references. Relative references are concatenated with the base URL to generate an absolute pathname. Each new page which is parsed is searched for media file references. When a new site is found by the crawler, there is a check against the internal database to ensure that the site has not already been visited (within a small period of time); this guarantees that the crawler only indexes unique sites within its database, and does not index the same site repeatably. A hash table scheme is used to guarantee that only unique new URLs are added to the database. The URL of a link is mapped into a single bit in a storage area which can contain up to approximately ten million URLs. If any URL link which is found hashes to the same bit position, then the URL is not added to the list of URLs for processing. As the crawler crawls the web, those pages which contain media references receive a higher priority for processing than those pages which do not reference media. As a result, pages linked to media specific pages will be visited by the crawler first in an attempt to index media related pages more quickly than through conventional crawler techniques.

When entering a new site, the crawler scans for a robot exclusion protocol file. If the file is present, it indicates those directories which should not be scanned for information. The crawler will not index material which is disallowed by the optional robot exclusion file. On a per directory basis, there is proposed to be stored a media description file (termed for purposes of this application the mediaX file). The general format of this file for the described embodiment is provided in Appendix A. This file contains a series of records of textual information for each media file within the current directory. As will be discussed in greater detail below, the crawler scans for the media description file in each directory at a web site, and adds the text based information stored there into the index being created by the crawler. The mediaX file allows for storage of information such as additional keywords, abstract and classification data. Since the mediaX file is stored directly within the directory where the media file resides, it ensures an implicit authentication process whereby the content provider can enhance the searchable aspects of the multimedia information and can do so in a secure manner.

The crawler can be constrained to operate completely within a single parent URL. In this case, the user inputs a single URL corresponding to a single web site. The crawler will then only follow outgoing links which are relative to the base URL for the site. All absolute links will not be followed. By following only those links which are relative to the base URL, only those web pages which are within a single web site will be visited, resulting in a search and indexing pass of a single web site. This allows for the crawling and indexing of a single media-rich web site. Once a single web site has had an index created, then users may submit queries to find content located only at the web site of interest. This scheme will work for what is commonly referred to as "Intranet" sites, where a media-rich web site is located behind a corporate firewall, or for commercial web sites containing large multimedia datasets.

2.1.1 Scan Page for Predetermined HTML Tag Types

Each HTML page is scanned for predetermined types of HTML tags, block 211. In this embodiment, the following tags are scanned for:

tables (single row and multi-row)

lists (ordered and unordered)

headings

java script

client side image maps

server side image maps

header separators

2.1.2 Determine if There is a Media URL

If there is a media uniform resource locator (URL), block 212. If there is a media URL, then the media URL is located and stored. However, in the described embodiment, certain media URL's may be excluded. For example, an embodiment may choose not to index URLs having certain keywords in the URL, certain prefixes, certain suffixes or particular selected URLs.

2.1.3 Locating Relevant Lexical Information

Next, relevant lexical information (text) is selected for each URL. Often a web page which references a media file provides significant description of the media file as textual information on the web page. When indexing a media file, the present invention has recognized that it would be useful to utilize this textual information. However, certain web pages may reference only a single media file, while other web pages may reference a plurality of media files. In addition, certain lexical information on the web page may be more relevant than other information to categorizing the media for later searching.

It has been observed that relevant textual information may be directly surrounding the media reference on a web page, or it may be far from the media reference. However, it has been found that more often than not, the relevant text is very close (in lexical distance) to the media reference. Therefore, the following general rules are applied when associating lexical information with a media file:

1) if the media file reference is found within a table, store the text within the table element as associated with the media file;

2) if the media file reference is found within a list, store the text within the list element as associated with the media file;

3) store the text in the heading as associated with the media file. In addition, in some embodiments, the text within higher level headings may also be stored.

4) if there is javascript, store the text associated with the javascript tag;

5) for client and server side image maps, if there is no relevant text, store only the URL. In addition, the image maps may be parsed to obtain all unique URLs and these may also be stored.

In some embodiments, a special tag may be stored within the indexed text where the media reference occurs in the web page. When queries are posed to the full-text database of the stored HTML pages which reference media, the distance of the keyword text from the media reference tag can be used to determine if there is a relevant match. The standard distance from media reference to matching keyword utilized is ten words in each direction outwards from the media reference. The word distance metric is called "lexical proximity". For standard web pages where text surrounding media is generally relevant this is an appropriate value.

If the results of a search using lexical proximity are not satisfactory to a user, the user needs a mechanism by which to broaden or narrow the search, based on the relevance which is found by the default lexical proximity. Users can employ an expand and narrow search button to change the default lexical proximity. The expand function will produce more and more search results for a given query, as the lexical proximity value is increased. A typical expand function will increase the lexical proximity value by a factor of two each time it is selected. When the expand function is used, more text will be examined which is located near the media reference to see if there is a keyword match. Expanding the search repeatedly will decrease precision and increase recall.

The narrow search button will do the reverse, by decreasing the lexical proximity value more and more. A typical narrow function will decrease the lexical proximity value by a factor of two each time it is selected. The narrow search button will reduce the number of search results, and hone in on that text information which only surrounds the media reference directly. Narrowing the search will increase precision and decrease recall. The relevance of all resulting queries should be quite high, on average, as a search is narrowed using this method.

When a database is limited in depth of entries, and is generated with a fixed lexical proximity value, a search query may often produce a search result list with zero hits. In order to increase the number of search results for the case of zero hits with fixed lexical proximity, a method is employed which will iterate on the lexical proximity value until a set of ten search results are returned. The algorithm is as follows:

perform the search query

look at the number of returned hits

if the number of returned hits is less than ten, then

perform a new search with the lexical proximity value doubled

continue the above process until ten search results are returned

Users should be able to specify the usage of lexical proximity to enhance the indexing of their search material. For example, if the web page author knows that all words which are ten words in front of the media reference are valid and relevant, then the author should specify a lexical proximity value which is only negative ten (i.e., look only in the reverse direction from the media URL by ten words). If the web page author knows that all words which are ten words after the media reference are valid and relevant, then the author should specify a lexical proximity value which is only positive ten. Finally, if the web author knows that both ten words ahead, and ten words behind the media reference are relevant, then the lexical proximity value should be set to positive/negative ten. Similarly, if the web author knows that the entire page contains relevant text for a single media file, then the lexical proximity value should be set to include all text on a page as relevant.

In addition to the above-described processes for locating relevant lexical information, in the described embodiment, certain information is generally stored for all media URL's. In particular, the following information is stored:

the name of the media file

URL of the media file

text string which is associated with the media file anchor reference

title of the HTML document containing the media file

keywords associated with the HTML document

URL for the HTML document containing the media file reference

keywords embedded in the media file

textual annotations in the media file

script dialogue, closed captioning and lyric data in the media file

auxiliary data in the media file (copyright, author, producer, etc.)

auxiliary data located within the media reference in the HTML document

auxiliary data located in an associated media description file

2.1.4 Streaming Files

Media content of files may be stored as downloadable files or as streaming files. Downloadable content is indexed by connecting to an HTTP server, downloading the media file, and then analyzing the file for the purposes of building a media rich index.

In the case of streaming, multimedia content, block 214, an HTTP server stores, not the content itself, but instead a reference to the media file. Therefore, the process of indexing such a file is not as straightforward as for a downloadable file which is stored on the HTTP server and may be downloaded from the server.

In the case of streaming media files certain information is gathered, block 215, as will be described with reference to FIG. 2C.

Below is described a method for indexing streaming files to index audio content and to index video content:

download the media file reference corresponding to the correct streaming media type

for each URL listed in the media file reference, perform the following operation:

connect directly to the media file on the media server where it resides, block 221

commence streaming of the media on the appropriate TCP socket, block 222

query the streaming media to obtain appropriate content attributes and header data, block 223

add all relevant content attributes and header information into the media rich index, block 224 (header information to be queried and indexed includes title, author, copyright; in the case of a video media file, additional information indexed may also include duration, video resolution, frame rate, etc.)

determine if streaming text or synchronized multimedia information, is included, block 225.

if it is, then stream the entire media clip, and index all text within the synchronized media track of the media file

if possible, store the time code for each block of text which occurs with the streaming media

This method can be applied to any streaming technology, including both streaming sound and video. The media data which is indexed includes information which is resident in the file header (i.e., title, author, copyright), and which can be computed or analyzed based on information in the media file (i.e., sound volume level, video color and brightness, etc.).

The latter category of information includes content attributes which can be computed while the media is streaming, or after the media has completed streaming from a server. It should be noted that once the streaming media has been queried and received results back from the server, the streaming process can conclude as the indexing is complete.

2.2 Generate and Weight a Media Index

As the network is crawled, a media index is generated by storing the information which has been discussed above in an index format. The media index is weighted to provide for increased accuracy in the searching capabilities. In the described embodiment, the weighing scheme is applied factoring a weight factor for each of the following text items:

______________________________________ WEIGHT- ING ITEM FACTOR ______________________________________ .cndot. URL of the media file 10 .cndot. Keywords embedded in the media file 10 .cndot. Textual annotations in the media file 10 .cndot. script dialogue, lyrics, and closed 10 captioning in the media file .cndot. Text strings associated with the media file 9 anchor reference .cndot. Text surrounding the media file reference 7 .cndot. Title of the HTML document containing 6 the media file .cndot. Keywords and meta-tags associated with 6 the HTML document .cndot. URL for the HTML document containing the 5 media fiie reference ______________________________________

In other embodiments, alternative weighting factors may be utilized without departure from the present invention.

2.3 Store Data for Each Media Object

Finally, data is stored for each media object. In the described embodiment, the following data is stored:

Relevant text

HTML document title

HTML meta tags

Media specific text (e.g., closed captioning, annotations, etc.)

Media URL

Anchor text

Content previews (discussed below)

Content attributes (such as brightness, color or B/W, contrast, speech v. music and volume level. In addition, sampling rate, frame rate, number of tracks, data rate, size may be stored).

Of course, in alternative embodiments a subset or superset of these fields may be used.

3.0 Content Analysis

As was briefly mention ed above, it is desirable to no t only search the lexical content surrounding a media file, but also to search the content of the media file itself in order to provide a more meaningful database of information to search.

As was shown in FIG. 1, the present invention is g generally concerned with indexing two types of media files (i) audio 102 and (ii) video 103.

3.1 Video Content

The present invention discloses an algorithm used to predict the likelihood that a given video file contains a low, medium or high degree of motion. In the described embodiment, the likelihood is computed as a single scalar value, which maps into one of N buckets of classification. The value associated with the motion likelihood is called the "motion" metric. A method for determining and classifying the brightness, contrast and color of the same video signal is also described. The combination of the motion metric along with brightness, contrast and color estimates enhance the ability of users to locate a specific piece of digital video.

Once a motion estimate and brightness, contrast and color estimate exist for all video files located in an index of multimedia content, it is possible for users to execute search queries such as:

"find me all action packed videos"

"find me all dramas and talk shows"

If the digital video information is indexed in a database together with auxiliary text-based information, then it is possible to execute queries such as:

"find me all action packed videos of James Bond from 1967"

"find me all talk shows with Bill Clinton and Larry King from 1993"

Combining motion with other associated video file parameters, users can execute queries such as:

"find me all slow moving, black and white movies made by Martin Scorcese"

"find me all dark action movies filmed in Zimbabwe"

The described method for estimating motion content and brightness, contrast and color can be used together with the described algorithm for searching the worldwide Internet in order to index and intelligently tag digital multimedia content. The described method allows for powerful searching based on information signals stored inside the content within very large multimedia databases. Once an index of multimedia information exists which includes a motion metric and brightness, contrast and color estimate, users can perform field based sorting of multimedia databases. For example, a user could execute the query: find me all video, from slow moving to fast, by Steven Spielberg, and the database engine would return a list of search results, ordered from slowest to fastest within the requested motion range. In addition, if the digital video file is associated with a digital audio sequence, then an analysis of the digital audio can occur. An analysis of digital audio could determine if the audio is either music or speech. It can also determine if the speaker is male or female, and other information. This type of information could then be used to allow a user query such as:

"find me all fast video clips which contain loud music";

"find me all action packed movies starring Sylvester Stallone and show me a preview of a portion of the movie where Stallone is talking".

This type of powerful searching of content will become increasingly important, as vast quantities of multimedia information become digitized and moved onto digital networks which are accessible to large numbers of consumer and business users.

The described method, in its preferred embodiment, is relatively fast to compute. Historically, most systems for analyzing video signals have operated in the frequency domain. Frequency domain processing, although potentially more accurate than image based analysis, has the disadvantage of being compute intensive, making it difficult to scan and index a network for multimedia information in a rapid manner.

The described approach of low-cost computation applied to an analysis of motion and brightness, contrast and color has been found to be useful for rapid indexing of large quantities of digital video information when building searchable multimedia databases. Coupled with low-cost computation is the fact that most video files on large distributed networks (such as the Internet) are generally of limited duration. Hence the algorithms described herein can typically be applied to short duration video files in such a way that they can be represented as a single scalar value. This simplifies presentation to the user.

In addition to the image space method described here, an algorithm is presented which works on digital video (such as MPEG) which has already been transformed into a frequency domain representation. In this case, the processing can be done solely by analyzing the frequency domain and motion vector data, without needing to perform the computation moving the images into frequency space.

3.1.1 Degree of Motion Algorithm Details (Image Space)

In order to determine if a given video file contains low, medium or high amounts of motion, it is disclosed to derive a single valued scalar which represents the video data file to a reasonable degree of accuracy. The scalar value, called the motion metric, is an estimate of the type of content found in the video file. The method described here is appropriate for those video files which may be in a variety of different coding formats (such as Vector Quantization, Block Truncation Coding, Intraframe DCT coded), and need to be analyzed in a uniform uncompressed representation. In fact, it is disclosed to decode the video into a uniform representation, since it may be coded in either an intraframe or an interframe coded format. If the video has been coded as intraframe, then the method described here is a scheme for determing the average frame difference for a pixel in a sequence of video. Likewise, for interframe coded sequences, the same metric is determined. This is desirable, even though the interframe coded video has some information about frame to frame differences. The reason that the interframe coded video is uncompressed and then a nalyzed, is that different coding schemes produce different types of interframe patterns which may be non uniform. The disclosed invention is based on three discoveries:

time periods can be compressed into buckets which average visual change activity

the averaged rate of change of image activity gives an indication of overall change

an indication of overall change rate is correlated with types of video conteent

The indication of overall change has been found to be highly correlated with the type of video information stored in an video file. It has been found through empirical examination that

slow moving video is typically comprised of small frame differences

moderate motion video is typically comprised of medium frame differences

fast moving video is typically comprised of large frame differences

and that,

video content such as talking heads and talk shows are comprised of slow moving video

video content such as newscasts and commercials are comprised of moderate speed video

video content such as sports and action films are comprised of fast moving video

The disclosed method operates generally by accessing a multimedia file and evaluating the video data to determine the visual change activity and by algorithm to compute the motion metric operates as follows:

A. Motion Estimator

if the number of samples N exceeds a threshold T, then repeat the Motion Estimator algorithm below for a set of time periods P=N/T. The value Z computed for each period P is then listed in a table of values.

an optional preprocessing step, employ an adaptive noise reduction algorithm to remove noise. Apply either a flat field (mean), or stray pixel (median) filter to reduce mild and severe noise respectively.

if the video file contains RGB samples, then run the algorithm and average the results into a single scalar value to represent the entire sequence

B. Motion Estimator

determine a fixed sampling grid in time consisting of X video frames

if video samples are compressed, then decompress the samples

decompress all video samples into a uniform decoded representation

adjust RGB for contrast (low/med/high)

compute the RGB frame differences for each frame X with its nearest neighbor

sum up all RGB frame differences for each pixel in each frame X

compute the average RGB frame difference for each pixel for each frame X

sum and then average RGB frame differences for all pixels in all frames in a sequence.

the resulting value is the motion metric Z. The motion metric Z is normalized by taking Z-NORMAL=Z*(REF-VAL/MAX-DIFFERENCE) where MAX-DIFFERENCE is the maximum difference for all frames.

map the value Z into one of five categories

low degree of motion

moderate degree of motion

high degree of motion

very high degree of motion

Using a typical RGB range of 0-255, the categories for the scalar Z map to:

0-20, motion content, low

20-40, motion content, moderate

40-60, motion content, high

60 and above, motion content, very high

A specific example, using actual values, is as follows:

number of video frames X=1000

sample size is 8 bits per pixel, 24 bits for RGB

average frame difference per frame is 15

the sequence is characterized as low motion

Note that when the number of video frames exceeds the threshold T, then the percentage of each type of motion metric category is displayed. For example, for a video sequence which is one hour long, which may consist of different periods of low, moderate and high motion, the resulting characterization of the video file would appear as follows:

40%, motion content low

10%, motion content moderate

50%, motion content high

Once the degree of motion has been computed, it is stored in the index of a multimedia database. This facilitates user queries and searches based on the degree of motion for a sequence, including the ability to provide field based sorting of video clips based on motion estimates.

3.1.2 Degree of Motion Algorithm Details (Frequency Domain)

The method described above is appropriate for those video files which may be in a variety of different coding formats (such as Vector Quantization, Block Truncation Coding, Intraframe DCT coded), and need to be analyzed in a uniform uncompressed representation. The coded representation is decoded and then an analysis is applied in the image space domain on the uncompressed pixel samples. However, some coding formats (such as MPEG) already exist in the frequency domain and can provide useful information regarding motion, without a need to decode the digital video sequence and perform frame differencing averages. In the case of a coding scheme such as MPEG, the data in its native form already contains estimates of motion implicitly (indeed, the representation itself is called motion estimation). The method described here uses the motion estimation data to derive an estimate of motion for a full sequence of video in a computationally efficient manner.

In order to determine if a given video file contains low, medium or high amounts of motion, it is necessary to derive a single valued scalar which represents the video data file to a reasonable degree of accuracy. The scalar value, called the motion metric, is an estimate of the type of content found in the video file. The idea, when applied to MPEG coded sequences, is based on four key principles:

the MPEG coded data contains both motion vectors and motion vector lengths

the number of non-zero motion vectors is a measure of how many image blocks are moving

the length of motion vectors is a measure of how far image blocks are moving

averaging the number and length of motion vectors per frame indicates degrees of motion

The indication of overall motion has been found to be correlated with the type of video information stored in an video file. It has been found through empirical examination that

slow moving video is comprised of few motion vectors and small vector lengths

moderate video is comprised of moderate motion vectors and moderate vector lengths

fast moving video is comprised of many motion vectors and large vector lengths

and that,

video content such as talking heads and talk shows are comprised of slow moving video

video content such as newscasts and commercials are comprised of moderate speed video

video content such as sports and action films are comprised of fast moving video

An algorithm to compute the motion metric may operates as follows:

Motion Estimator (Frequency Domain)

if the number of frames N exceeds a threshold T, then repeat the Motion Estimator algorithm below for a set of time periods P=N/T. The value Z computed for each period P is then listed in a table of values.

Motion Estimator Algorithm

determine a fixed sampling grid in time consisting of X video frames

determine the total number of non-zero motion vectors for each video frame

determine the average number of non-zero motion vectors per coded block

determine the average length of motion vectors per coded block

sum and average the number of non-zero motion vectors per block in a sequence as A

sum and average the length of non-zero motion vectors per block in a sequence as B

compute a weighted average of the two averaged values as Z=W1*A+W2*B

the resulting value is the motion metric Z

map the value Z into one of five categories

low degree of motion

moderate degree of motion

high degree of motion

very high degree of motion

Note that when the number of video frames exceeds the threshold T, then the percentage of each type of motion metric category is displayed. For example, for a video sequence which is one hour long, which may consist of different periods of low, moderate and high motion, the resulting characterization of the video file would appear as follows:

40%, motion content low

10%, motion content moderate

50%, motion content high

3.1.3 Brightness, Contrast and Color Algorithm Details

In order to determine if a given video file contains dark, moderate or bright intensities, it is necessary to derive a single valued scalar which represents the brightness information in the video data file to a reasonable degree of accuracy. The scalar value, called the brightness metric, is an estimate of the brightness of content found in the video file. The idea is based on two key principles:

time periods can be compressed into buckets which average brightness activity

the buckets can be averaged to derive an overall estimate of brightness level

By computing the luminance term for every pixel in a frame, and then for all frames in a sequence, and averaging this value, we end up with an average luminance for a sequence.

The same method above can be applied to determining a metric for contrast and color, resulting in a scalar value which represents an average contrast and color for a sequence.

3.1.4 Search Results Display

Once the motion and brightness level estimates have been determined, the values are displayed to user in tabular or graphical form. The tabular format would appear as shown below:

Degree of motion: high

Video intensity bright

The end result is a simple display of two pieces of textual information. This information is very low bandwidth, and yet encapsulates an extensive processing and computation on the data set. And users can more quickly find the multimedia information.

3.2 Audio Content

Before reviewing an algorithm used by the disclosed embodiment for analyzing audio files in detail, it is worthwhile to briefly turn to FIG. 3A which provides an overview of the process. A digital audio file is initially analyzed 301 and an initial determination is made whether the file is speech 307 or music 302. If the file is determined to be music, in one embodiment, if the file is "noisy", a noise reduction filter may be applied and the analysis repeated 303. This is because a noisy speech file may be misinterpreted as music. If the file is music, an analysis may be done to determine if the music is fast or slow 304 and an analysis may be done to determine if the music is bass or treble 305 based on a pitch analysis. In the case of speech, an analysis might be done to determine if the speech 308 is fast or slow based on frequency and whether it is male or female 309 based on pitch. By way of example, knowing that a portion of an audio track for a movie starring Sylvester Stallone has a fast, male voice, may be interpreted by retrieval software as indicating that portion of the audio track is a action scene involving Sylvester Stallone. In addition, in certain embodiments, it may be desirable to perform voice recognition analysis to recognize the voice into text 310. In some embodiments, the voice recognition capability may be limited to only recognizing a known voice, while in other more advanced embodiments, omni-voice recognition capability may be added. In either event, the recognized text may be added to the stored information for the media file and be used for searching and retrieval.

3.2.1 Computation of a Music-speech Metric

In order to determine if a given audio file contains music, speech, or a combination of both types of audio, it is disclosed in one embodiment to derive a single valued scalar which represents the audio data file to a reasonable degree of accuracy. The scalar value, called the music-speech metric, is an estimate of the type of content found in the audio file. The idea is based on three key principles:

time periods can be compressed into buckets which average amplitude activity

the averaged rate of change of amplitude activity gives an indication of overall change

an indication of overall amplitude change rate is correlated with types of audio content

The indication of overall change has been found to be highly correlated with the type of audio information stored in an audio file. It has been found through empirical examination that

music is typically comprised of a continuous amplitude signal

speech is typically comprised of a discontinuous amplitude signal

sound effects are typically comprised of a discontinuous amplitude signal

and that,

music signals are typically found to have low rates of change in amplitude activity

speech signals are typically found to have high rates of change in amplitude activity

sound effects are typically found to have high rates of change in amplitude activity

audio comprised of music and speech has moderate rates of change in amplitude activity

Continuous signals are characterized by low rates of change. Various types of music, including rock, classical and jazz are often relatively continuous in nature with respect to the amplitude signal. Rarely does music jump from a minimum to a maximum amplitude. This is illustrated by FIG. 3C which illustrates a typical amplitude signal 330 for music.

Similarly, it is rare that speech results in a continuous amplitude signal with only small changes in amplitude. Discontinuous signals are characterized by high rates of change. For speech, there are often bursty periods of large amplitude interspersed with extended periods of silence of low amplitude. This is illustrated by FIG. 3B which illustrates a typical amplitude signal 320 for speech.

Sometimes speech will be interspersed with music, for example if there is talk over a song. This is illustrated by FIG. 3D which illustrates signal 340 having period 341 which would be interpreted as music, period 342 which would be speech, period 343 music, period 344 speech, period 345 music and period 346 speech.

For sound effects, there are often bursty periods of large amplitude interspersed with bursty periods of low amplitude.

Turning now to FIG. 3E, if the audio file is a compressed file (which may be in any of a number of known compression formats), it is first decompressed using any of a known decompression algorithm, block 351. A amplitude analysis is then performed on the audio track to provide a music speech metric value. The amplitude analysis is performed as follows:

The audio track is divided into time segments of a predetermined length, block 352. In the described embodiment, each time segment is 50 ms. However, in alternate embodiments, the time segments may be of a greater or lesser length.

For each segment, a normalized amplitude deviation is computed, block 356. This is described in greater detail with reference to FIG. 3F. First, for each time segment, the maximum amplitude and minimum amplitude is determined, block 351. In the example of FIG. 3B, values range from 0 to 256 (in an alternative embodiment, the values may be based on floating point calculations and may range from 0 to 1.0). For the first interval 321, the maximum amplitude value is shown as 160, for the second interval 322, it is 158 and for the third interval 323, it is 156. Then, the average maximum amplitude and average minimum amplitude is computed for all time intervals, block 352. Again, using the example in FIG. 3B, the average maximum amplitude will be 158. Next, a value MAX-DEV is computed for each interval as the absolute value of maximum amplitude for the interval minus the average maximum, block 353. For the first interval of FIG. 3b, the MAX-DEV will be 2, for the second interval, it will be 0 and for the third interval, it will be 2. Finally, the MAX-DEV is normalized by computing MAX-DEV*(REF-VALUE/MAX) where the reference value is 256 in the described embodiment (and may be 1.0 in a floating point embodiment) and MAX is the maximum amplitude for all of the intervals. Thus, for the first interval, the normalized value for MAX-DEV will be 160-(256/160)=256. Normalizing the deviation value provides for removing dependencies based on volume differences in the audio files and allows for comparison of files recorded at different volumes.

Finally, the normalized MAX-DEV values for each segment are averaged together, block 357, to determine a music-speech metric. High values tend to indicate speech, low values tend to indicate music and medium values tend to indicate a combination, block 358.

It should be noted that if for efficiency, only a portion of the audio file may be analyzed. For example, N seconds of the audio file may be randomly chosen for analysis. Also, if the audio file contains stereo or quadraphonic samples, then run the algorithm described above may be run on each channel, and the results averaged into a single scalar value to represent the entire sequence.

Note also that when the number of samples exceeds the threshold T, then the percentage of each type of music-speech metric category may computed and displayed. For example, for a soundtrack which is one hour long, which may consist of different periods of silence, music, speech and sound effects, the resulting characterization of the audio file would appear as follows:

40%, music content: high, speech content: low

10%, music content: high, speech content: medium

10%, music content: medium, speech content: medium

10%, music content: medium, speech content: high

30%, music content: low, speech content: high

3.2.2 Volume Algorithm Details

In order to determine if a given audio file contains quiet, soft or loud audio information, it is disclosed to derive a single valued scalar which represents the volume information in the audio data file to a reasonable degree of accuracy. The scalar value, called the volume level metric, is an estimate of the volume of content found in the audio file. The idea is based on three key principles:

time periods can be compressed into buckets which average volu