|
Claims  |
|
|
We claim:
1. A method for generating an interactive, animated character in the user interface of a computer using a client-server architecture, the method comprising:
in response to a request from a client, creating an instance of a character and displaying the character in the user interface;
in the server, receiving from a client a set of client-specified user input commands that the character will respond to, the set comprising cursor input from a cursor control device;
in the server, monitoring for the specified user input commands;
in the server, when one of the user input commands is detected, sending a notification to the client;
in the server, receiving from the client a request that is conditioned upon the notification from the server; and
in response to the request from the client, playing back a client-specified sequence of animation output to animate the character in the user interface.
2. The method of claim 1 wherein the set of client-specified user input commands further comprises speech input received through a speech recognition engine, the method further comprising:
in response to a request from the client, playing back a client-specified sequence of animation to animate the character in the user interface and generating speech output lip-synched to animation representing a mouth of the character.
3. The method of claim 1 further including:
queuing requests to animate the character from the client when the character is currently playing back an animation;
immediately returning control to the client making the request to animate the character after determining that the character is busy; and
deferring processing of the request until the current animation is complete.
4. The method of claim 1 further including: arbitrating requests to control the character from more than one client.
5. The method of claim 1 wherein the step of creating an instance of a character further comprises registering the client with the server, the method further comprising:
registering a second client with the server;
in the server, receiving from the second client a second set of client-specified user input commands that the character will respond to, the second set comprising cursor input from the cursor control device;
keeping track of clients that have registered with the server;
arbitrating requests to control the character from more than one client; and
terminating the character when no clients are currently registered with the server.
6. The method of claim 1 wherein the step of creating an instance of the character comprises:
starting execution of the server in response to a request from the client;
in the server, registering a notification interface for the client in response to a request from the client; and
in the server, receiving from the client a request telling the server which character to create.
7. The method of claim 1 including:
synchronizing execution of the client with the execution of the character by allowing the client to post notification requests with the server in a first in first out queue used to store animation requests while the character is currently being
animated, and sending a notification from the server to the client when the first notification request is at the top of the queue.
8. The method of claim 7 wherein the notification requests are embedded with text that is synthesized into speech output by the server so that the client can synchronize itself to individual words in the speech output.
9. A computer readable medium on which is stored software for performing the method of claim 1.
10. A client-server animation system for generating interactive animated characters, the system comprising:
an animation server for receiving requests from clients to create a character on the user interface, for controlling playback of a sequence of frames of animation and lip synched speech output from the character on the user interface in response
to requests from the clients, for receiving an identification of cursor device and speech input commands, and for notifying the clients when the server determines that the cursor device input and the speech input commands have been provided by a user;
a speech recognition engine in communication with an audio input device for receiving speech input from the user and for analyzing the speech input to identify the speech input commands; and in communication with the server for sending
notification messages to the server when the speech input commands are detected; and
a speech synthesis engine in communication with an audio output device for generating speech output, and in communication with the server for receiving requests to generate audio output corresponding to a text string provided by the clients via
the server, and for notifying the server when a tag is detected in the text string so that the server can synchronize display of text in the text string with the speech output.
11. The animation system of claim 10 wherein the server includes a queue for queuing requests from clients to play specified sequences of animation of the character; and wherein the server keeps track of which of the clients is currently active
and processes the requests in the queue corresponding to an active client.
12. The animation system of claim 10 wherein the server includes a mouth animation module for receiving notifications from the speech synthesis engine synchronized with speech output of phonemes, and wherein the mouth animation module is
operable to play a frame of animation of a mouth of the character that corresponds to a current phoneme such that animation of the mouth is synchronized with the speech output.
13. The animation system of claim 10 wherein the animation server includes a parser for parsing speech input commands provided by the clients and passing parsed speech input commands to the speech recognition engine.
14. The animation system of claim 10 wherein the animation server includes a regionizer for scanning an animation frame and computing a non-rectangular bounding region for a non-transparent portion of the animation frame in real time as the
sequence of constructed animation frames is played in the user interface on the display monitor; and wherein the animation system includes a region window controller for receiving the non-rectangular bounding region from the regionizer, for creating a
region window on a display screen independent of any other window on the display screen and having a screen boundary in the user interface defined by the non-rectangular bounding region, and for clipping the constructed animation frame to the
non-rectangular bounding region.
15. The animation system of claim 10 including a web browser for retrieving a web page from secondary storage of a local computer or from a remote computer, for parsing the web page to identify an embedded agent object tag, and for starting the
server in response to detecting the embedded agent object tag; wherein the server is responsive to a first script command embedded in the web page to play a first sequence of frames of animation and lip synched speech output from the character on the
user interface, and wherein the server is responsive to a second script command for receiving an identification of a speech input command and for sending notification to a local client representing the web script when the server detects the speech input
command.
16. The system of claim 15 wherein the animation system includes a runtime compiler in communication with the web browser for compiling and executing a script program including the first and second script commands.
17. A method for generating an interactive, animated character in the user interface of a computer using a client-server architecture, the method comprising:
in response to a request from a client, creating an instance of a character and displaying the character in the user interface;
in the server, receiving from a client a set of client-specified user input commands that the character will respond to, the set comprising a speech input command;
in the server, monitoring for the specified user input commands;
in the server, sending a notification to the client when one of the user input commands is detected;
in the server, receiving from the client a request that is conditioned upon the notification from the server; and
in response to the request from the client, playing back a client-specified sequence of animation and speech output to animate the character in the user interface.
18. The method of claim 17 wherein the client is a script embedded in a web page, wherein the script includes a first script command specifying text of the speech input command, wherein the server sends a notification to the client when the
server detects that an end user has spoken the speech input command; and wherein the script includes a second script command requesting lip synched speech output from the server.
19. The method of claim 17 further including:
parsing a web page to identify an embedded script; and
compiling the script to create the client.
20. The method of claim 19 further including:
in the server, processing requests to animate the character and play lip-synched output from the web script client.
21. A computer readable medium on which is stored software for performing the method of claim 17. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
TECHNICAL FIELD
The invention relates to user interface design in computers and more specifically relates to animated user interfaces.
BACKGROUND
One way to make the user interface of a computer more user friendly is to incorporate natural aspects of human dialog into the user interface design. User interfaces that attempt to simulate social interaction are referred to as social
interfaces.
An example of this type of interface is the user interface of a program called Bob from Microsoft Corporation. Bob uses a social interface with animated characters that assist the user by providing helpful tips as the user navigates through the
user interface. The Bob program exposes a number of user interface services to application programs including an actor service, a speech balloon service, a tracking service and a tip service.
The actor service plays animated characters in response to an animation request from an application. This service allows applications to play animated characters to get the user's attention and help the user navigate through the user interface.
To make the character appear as if it is conversing with the user, the application can use the speech balloon service to display text messages in a graphical object that looks like a cartoon-like speech balloon. Applications can use the speech balloon
service to display a special kind of text messages called a "tip" that gives the user information about how to operate the program. In the Bob user interface environment, the application program is responsible for monitoring for user input events that
trigger tips. In response to detecting an event, the application passes it to the tracking service, which determines whether a tip should be displayed. One function of the tracking service is to avoid bothering the user by displaying too many tips. To
prevent this, the tracking service counts the number of occurrences of an event and prevents the display of a tip after a given number of occurrences. The tracking service tells the tip service whether to initiate the display of a tip. When a tip is to
be displayed, the tip service provides information about the tip to the application so that it can display an appropriate text message in a speech balloon.
While the Bob program does provide a number of helpful user interface features, it has a number of limitations. One of the significant limitations is that the animated characters must be displayed within the window of a single host application.
Specifically, the animation must be displayed within the window of a host Bob application program where the background image of the window is known. This is a significant limitation because the animation is confined within the window of single
application program
Another important limitation of the animated characters in the Bob program is that they have no speech input or output capability. Speech input and output capability makes a user interface much more engaging to the user.
Speech synthesis and recognition software is commercially available. Microsoft Corporation has defined an application programming interface (API) called SAPI (Speech Application Programming Interface), and number of companies have created
implementations of this interface. The purpose of SAPI is to provide speech services that application developers can incorporate into their programs by invoking functions in SAPI.
Despite the availability of speech services provided in SAPI compliant speech engines, there are a number of difficult design issues in developing interactive user interface characters that support speech input and output. One difficulty is
determining how the interactive animation services will be exposed to application programs. In many applications with interactive animation, such as games for example, the application must provide and control its own user interface. This increases the
complexity of the application program and prevents sharing of animation and input/output services among application programs.
A related difficulty with interactive animation is determining how to incorporate it into Internet applications. The content of a web page preferably should be small in size so that it is easy to download, it should be secure, and it should be
portable. These design issues make it difficult to develop interactive animation for Web pages on the Internet.
SUMMARY OF THE INVENTION
The invention provides a client-server animation system used to display interactive, animated user interface characters with speech input and output capability. One aspect of the invention is an animation server that makes a number of animation
and speech input and output services available to clients (e.g., application programs, Web page scripts, etc.). Another aspect of the invention is the way in which the clients can specify input commands including both speech and cursor device input for
the character, and can request the server to play animation and speech output to animate the character. The animated output can combine both speech and animation such that the mouth of a user interface character is lip-synched to the speech output. The
animation server exposes these services through an application programming interface accessible to applications written in conventional programming languages such as C and C++, and through a high level interface accessible through script languages. This
high level interface enables programmers to embed interactive animation with speech input and output capability in Web pages.
One implementation of the animation system comprises an animation server, speech synthesis engine, and a speech recognition engine. The speech synthesis engine converts text to digital audio output in response to requests from the animation
server. The speech recognition engine analyzes digitized audio input to identify words or phrases selected by the animation server.
The animation server exposes its animation and speech input/output services to clients through a programming interface. The server's interface includes methods such as Play(name of animation) or Speak(text string) that enable the clients to make
request to animate a user interface character. The server constructs each frame of animation and controls the display of the animation in the user interface. To support lip- synched speech output, the server includes a mouth animation module that
receives notification from the speech synthesis engine when it is about to output a phoneme. In response to this notification, it maps a frame of animation representing the character's mouth position to the phoneme that is about to be played back.
Clients specify the speech or cursor input that a character will respond to through a command method in the server's interface. The server monitors input from the operating system (cursor device input) and the speech recognition engine (speech
input) for this input. When it detects input from the user that a client has requested notification of, it sends a notification to that client. This feature enables the client to tell the server how to animate the user interface character in response
to specific types of input. The server enables multiple clients to control a single user interface character by allowing one client to be active at a time. The end user and clients can make themselves active.
The animation system outlined above has a number of advantages. It enables one or more clients to create an engaging user interface character that actually converses with the user and responds to specific input specified by the client. Clients
do not have to have complex code to create animation and make an interactive interface character because the server exposes services in a high level interface. This is advantageous for web pages because a web page can include an interactive character
simply by adding a reference to the agent server and high level script commands that specify input for the character and request playback of animation and lip-synched speech to animated the character.
Further features and advantages of the invention will become apparent from the following detailed description and accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a general block diagram of a computer that serves as an operating environment for the invention.
FIG. 2 is a screen shot illustrating an example of animated character located on top of the user interface in a windowing environment.
FIG. 3 is a diagram illustrating the architecture of an animation system in one implementation of the invention.
FIG. 4 is flow diagram illustrating how the animation server in FIG. 3 plays an animation.
FIG. 5 illustrates an example of the animation file structure.
FIG. 6 is a flow diagram illustrating a method used to retrieve image data to construct a current frame of animation.
FIG. 7 is a flow diagram illustrating the process for obtaining the bounding region of an arbitrary shaped animation.
FIG. 8 is a diagram illustrating an example of a COM server and its relationship with an instance of object data.
FIG. 9 is a conceptual diagram illustrating the relationship between a COM object and a user of the object (such as a client program).
FIG. 10 illustrates the relationship among the different types of objects supported in the animation server.
FIG. 11 is a diagram of a web browsing environment illustrating how interactive, animated user interface characters can be activated from Web pages.
DETAILED DESCRIPTION
Computer Overview
FIG. 1 is a general block diagram of a computer system that serves as an operating environment for the invention. The computer system 20 includes as its basic elements a computer 22, one or more input devices 28, including a cursor control
device, and one or more output devices 30, including a display monitor. The computer 22 has at least one high speed processing unit (CPU) 24 and a memory system 26. The input and output device, memory system and CPU are interconnected and communicate
through at least one bus structure 32.
The CPU 24 has a conventional design and includes an ALU 34 for performing computations, a collection of registers 36 for temporary storage of data and instructions, and a control unit 38 for controlling operation of the system 20. The CPU 24
may be a processor having any of a variety of architectures including Alpha from Digital, MIPS from MIPS Technology, NEC, IDT, Siemens, and others, x86 from Intel and others, including Cyrix, AMD, and Nexgen, and the PowerPC from IBM and Motorola.
The memory system 26 generally includes high-speed main memory 40 in the form of a medium such as random access memory (RAM) and read only memory (ROM) semiconductor devices, and secondary storage 42 in the form of long term storage mediums such
as floppy disks, hard disks, tape, CD-ROM, flash memory, etc. and other devices that store data using electrical, magnetic, optical or other recording media. The main memory 40 also can include video display memory for displaying images through a
display device. The memory 26 can comprise a variety of alternative components having a variety of storage capacities.
The input and output devices 28, 30 are conventional peripheral devices coupled to or installed within the computer. The input device 28 can comprise a keyboard, a cursor control device such as a mouse or trackball, a physical transducer (e.g.,
a microphone), etc. The output device 30 shown in FIG. 1 generally represents a variety of conventional output devices typically provided with a computer system such as a display monitor, a printer, a transducer (e.g., a speaker), etc. Since the
invention relates to computer generated animation and speech input and output services, the computer must have some form of display monitor for displaying this animation, a microphone and analog to digital converter circuitry for converting sound to
digitized audio, and speakers and digital to audio converter circuitry for converting digitized audio output to analog sound waves.
For some devices, the input and output devices actually reside within a single peripheral. Examples of these devices include a network adapter card and a modem, which operate as input and output devices.
It should be understood that FIG. 1 is a block diagram illustrating the basic elements of a computer system; the figure is not intended to illustrate a specific architecture for a computer system 20. For example, no particular bus structure is
shown because various bus structures known in the field of computer design may be used to interconnect the elements of the computer system in a number of ways, as desired. CPU 28 may be comprised of a discrete ALU 34, registers 36 and control unit 38 or
may be a single device in which one or more of these parts of the CPU are integrated together, such as in a microprocessor. Moreover, the number and arrangement of the elements of the computer system may be varied from what is shown and described in
ways known in the computer industry.
Animation System Overview
FIG. 2 is a screen shot illustrating an example of animated character located on top of the user interface in a windowing environment. This screen shot illustrates one example of how an implementation of the invention creates arbitrary shaped
animation that is not confined to the window of a hosting application. The animated character 60 can move anywhere in the user interface. In this windowing environment, the user interface, referred to as the "desktop" includes the shell 62 of the
operating system as well as a couple of windows 64, 66 associated with currently running application programs. Specifically, this example includes an Internet browser application running in one window 64 and a word processor application 66 running in a
second window on the desktop of the Windows 795 Operating System.
The animated character moves on top of the desktop and each of the windows of the executing applications. As the character moves about the screen, the animation system computes the bounding region of the non-transparent portion of the animation
and generates a new window with a shape to match this bounding region. This gives the appearance that the character is independent from the user interface and each of the other windows.
To generate an animation like this, the animation system performs the following steps:
1) loads the bitmap(s) for the current frame of animation;
2) constructs a frame of animation from these bitmaps (optional depending on whether the frame is already constructed at authoring time).
3) computes the bounding region of the constructed frame in real time;
4) sets a window region to the bounding region of the frame; and
5) draws the frame into the region window.
The bounding region defines the non-transparent portions of a frame of animation. A frame in an animation is represented as a rectangular area that encloses an arbitrary shaped animation. The pixels located within this rectangular area but do
not form part of the arbitrary-shaped animation are transparent in the sense that they will not occlude or alter the color of the corresponding pixels in the background bitmap (such as the desktop in the Windows.RTM. Operating System) when combined with
it. The pixels located in the arbitrary animation are non-transparent and are drawn to the display screen so that the animation is visible in the foreground.
The bounding region defines the area occupied by non-transparent pixels within the frame, whether they are a contiguous group of pixels or disjoint groups of contiguous pixels. For example, if the animation were in the shape of a red doughnut
with a transparent center, the bounding region would define the red pixels of the doughnut as groups of contiguous pixels that comprise the doughnut, excluding the transparent center. If the animation comprised a football and goalposts, the bounding
region would define the football as one or more groups of contiguous pixels and the goalposts as one or more groups of contiguous pixels. The bounding region is capable of defining non-rectangular shaped animation including one or more transparent holes
and including more than one disjoint group of pixels.
Once computed, the bounding region can be used to set a region window, a non-rectangular window capable of clipping input and output to the non-transparent pixels defined by the bounding region. Region windows can be implemented as a module of
the operating system or as a module outside of the operating system. Preferably, the software module implementing region windows should have access to input events from the keyboard and cursor positioning device and to the other programs using the
display screen so that it can clip input and output to the bounding region for each frame. The Windows.RTM. Operating System supports the clipping of input and output to region windows as explained further below.
The method outlined above for drawing non-rectangular animation can be implemented in a variety of different types of computer systems. Below we describe an implementation of the invention in a client-server animation system. However the basic
principles of the invention can be applied to different software architectures as well.
FIG. 3 is a general block diagram illustrating the architecture of a client server animation system. The animation system includes an animation server 100, which controls the playback of animation, and one or more clients 102-106, which request
animation services from the server. During playback of the animation, the server relies on graphic support software in the underlying operating system 120 to create windows, post messages for windows, and paint windows.
In this specific implementation, the operating system creates and clips input to non-rectangular windows ("region windows"). To show this in FIG. 3, part of the operating system is labeled, "region window controller" (see item 122). This is the
part of the operating system that manages region windows. The region window controller 122 creates a region window having a boundary matching the boundary of the current frame of animation. When the system wants to update the shape of a region window,
the regionizer specifies the bounding region of the current frame to the operating system. The operating system monitors input and notifies the server of input events relating to the animation.
The services related to the playback of animation are implemented in four modules 1) the sequencer 108; 2) the loader 110 3) the regionizer 112; and 4) the mouth animation module 114. The sequencer module 108 is responsible for determining which
bitmap to display at any given time along with its position relative to some fixed point on the display.
The loader module 110 is responsible for reading the frame's bitmap from some input source (either a computer disk file or a computer network via a modem or network adapter) into memory. In cases where the bitmap is compressed, the loader module
is also responsible for decompressing the bitmap into its native format. There are variety of known still image compression formats, and the decompression method, therefore, depends on the format of the compressed bitmap.
The regionizer module 112 is responsible for generating the bounding region of the frame, setting it as the clipping region of the frame's hosting region window and then drawing the frame into the region. In slower computers, it is not feasible
to generate the bounding region as frames are constructed and played back. Therefore, in this implementation the regionizer also supports the loading of bounding region information in cases where it is precomputed and stored along with the frame data in
the animation file.
The mouth animation module 114 is responsible for coordinating speech output with the animation representing a user interface character's mouth. The mouth animation module receives a message from a speech synthesis engine 116 whenever a specific
phoneme is about to be spoken. When the mouth animation module receives this message, it performs a mapping of the specified phoneme to image data stored in a animation mouth data file that corresponds to the phoneme. It is responsible for loading,
decompressing, and controlling the playback of the animation representing the character's mouth.
The speech synthesis engine 116 is responsible for generating speech output from text. In this implementation, the speech synthesis engine 116 is a SAPI compliant text to speech generator from Centigram Communications Corp., San Jose, Calif.
Other SAPI compliant text to speech generators can be used as well. For example, Lernout and Hauspie of Belgium also makes a SAPI compliant text to speech generator.
The speech recognition engine 118 is responsible for analyzing digitized audio input to identify significant words or phrases selected by the animation server. The animation server defines these words or phrases by defining a grammar of
acceptable phrases. The client specifies this grammar by specifying sequences of words that it wants the system to detect in a text string format. The server also supports a command language that includes boolean operators and allows alternative words. This command language enables the client to specify a word or phrase along with a number of possible alternative or option words to look for in the speech input. The syntax of the command language is described in more detail below.
The speech recognition used in this implementation is a SAPI compliant speech recognition engine made by Microsoft Corporation. A suitable alternative speech recognition engine is available from Lernout and Hauspie of Belgium.
The operating system in this implementation is the Windows.RTM. 95 operating system from Microsoft Corporation. The application programming interface for the operating system includes two functions used to create and control region windows.
These functions are:
1) SetWindowRgn; and
2) GetWindowRgn
SetWindowRgn
The SetWindowRgn function sets the window region of a rectangular host window. The window region is an arbitrary shaped region on the display screen defined by an array of rectangles. These rectangles describe the rectangular regions of pixels
in the host window that the window region covers.
The window region determines the area within the host window where the operating system permits drawing. The operating system does not display any portion of a window that lies outside of the window region.
______________________________________ int SetWindowRgn( HWND hWnd, // handle to window whose window region is to be set HRGN hRgn, // handle to region BOOL bRedraw // window redraw flag ); ______________________________________
Parameters
hWnd
Handle to the window whose window region is to be set.
hRgn
Handle to a region. The function sets the window region of the window to this region. If hRgn is NULL, the function sets the window region to NULL.
bRedraw
Boolean value that specifies whether the operating system redraws the window after setting the window region. If bRedraw is TRUE, the operating system does so; otherwise, it does not.
Typically, the program using region windows will set bRedraw to TRUE if the window is visible.
Return Values
If the function succeeds, the return value is nonzero.
If the function fails, the return value is zero.
Remarks
If the bRedraw parameter is TRUE, the system sends the
WM.sub.-- WINDOWPOSCHANGING and WM.sub.-- WINDOWPOSCHANGED messages to the window.
The coordinates of a window's window region are relative to the upper-left corner of the window, not the client area of the window. After a successful call to SetWindowRgn, the operating system owns the region specified by the region handle
hRgn. The operating system does not make a copy of the region. Thus, the program using region windows should not make any further function calls with this region handle. In particular, it should not close this region handle.
GetWindowRgn
The GetWindowRgn function obtains a copy of the window region of a window. The window region of a window is set by calling the SetWindowRgn function.
______________________________________ int GetWindowRgn( HWND hWnd, // handle to window whose window region is to be obtained HRGN hRgn // handle to region that receives a copy of the window region ); ______________________________________
Parameters
hWnd
Handle to the window whose window region is to be obtained.
hrgn
Handle to a region. This region receives a copy of the window region.
Return Values
The return value specifies the type of the region that the function obtains. It can be one of the following values:
______________________________________ Value Meaning ______________________________________ NULLREGION The region is empty. SIMPLEREGION The region is a single rectangle. COMPLEXREGION The region is more than one rectangle. ERROR An error
occurred; the region is unaffected. ______________________________________
Comments
The coordinates of a window's window region are relative to the upper-left corner of the window, not the client area of the window.
The region window controller shown in FIG. 3 corresponds to the software in the operating system that supports the creation of region windows and the handling of messages that correspond to region windows.
In this implementation, the speech recognition engine and the speech synthesis engine communicate with an audio input and output device such as a sound card according to the SAPI specification from Microsoft. In compliance with SAPI, these
engines interact with an audio device through software representations of the audio device referred to as multimedia audio objects, audio sources (which provide input to the speech recognition engine) and audio destinations (which mediate output from the
speech synthesis engine). The structure and operation of this software representation are described in detail in the SAPI specification available from Microsoft.
In the next two sections, we describe two alternative implementations of the animation system shown in FIG. 3. Both implementations generate arbitrary shaped animation and can compute the arbitrary shaped region occupied by non-transparent
pixels of a frame in real time. However, the manner in which each system computes and stores this region data varies. Specifically, since it is not computationally efficient to re-compute the region data for every frame, these systems use varying
methods for caching region data. The advantages of each approach are summarized following the description of the second implementation.
First Implementation of the Animation System
FIG. 4 is flow diagram illustrating how the animation server plays an animation. First, the animation data file is opened via the computer's operating system as shown in step 150. The animation data file includes an animation header block and a
series of bitmaps that make up each of the frames in the animation. Once operating system has opened the file, the loader module 108 reads the animation header block to get all | | |