WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Apparatus for integrally controlling audio and video signals in real time and multi-site communication control method    
United States Patent5548346   
Link to this pagehttp://www.wikipatents.com/5548346.html
Inventor(s)Mimura; Itaru (Sayama, JP); Ueda; Hirotada (Kokubunji, JP); Sumino; Shigeo (Chofu, JP); Ikezawa; Mitsuru (Kodaira, JP); Suzuki; Toshiaki (Musashino, JP); Kinoshita; Taizo (Tachikawa, JP); Tada; Katsumi (Yokohama, JP)
AbstractA real time visual communication system capable of improving a correspondence between a received video signal and a received audio signal in real time and improving reality. An AV signal is separated into a video signal and an audio signal, and the output state of the video or audio signal is controlled by the characteristics of the audio or video signal. For example, the sound field, reverberation, and the like are controlled in accordance with the characteristics of the video signal. A suitable image pickup unit is selected in accordance with the characteristics of the audio signal to make the sights of conversation participants coincide with each other. It is possible to reproduce sounds of audio signals well matching video signals and to provide visual communication having good reality because of the combination of matched audio and video signals.



 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5548346
Apparatus for integrally controlling audio and video signals in real

     time and multi-site communication control method - US Patent 5548346 Drawing
Apparatus for integrally controlling audio and video signals in real time and multi-site communication control method
Inventor     Mimura; Itaru (Sayama, JP); Ueda; Hirotada (Kokubunji, JP); Sumino; Shigeo (Chofu, JP); Ikezawa; Mitsuru (Kodaira, JP); Suzuki; Toshiaki (Musashino, JP); Kinoshita; Taizo (Tachikawa, JP); Tada; Katsumi (Yokohama, JP)
Owner/Assignee     Hitachi, Ltd. (Tokyo, JP)
Patent assignment
All assignments
Publication Date     August 20, 1996
Application Number     08/336,646
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     November 4, 1994
US Classification     348/738 348/14.1 348/462 348/484 348/722
Int'l Classification     H04N 005/60
Examiner     Metjahic; Safet
Assistant Examiner     Hsia; Sherrie
Attorney/Law Firm     Antonelli, Terry, Stout & Kraus
Address
Parent Case    
Priority Data     Nov 05, 1993[JP]5-276477 Dec 06, 1993[JP]5-305129
USPTO Field of Search     348/15 348/462 348/465 348/473 348/484 348/485 348/480 348/481 348/482 348/483 348/722 348/738 348/632 348/633 348/515
Patent Tags     integrally controlling audio video signals real time multi-site communication control
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5389976
Miyagawa

Feb,1995

[0 after 0 votes]
4907082
Richards
348/485
Mar,1990

[0 after 0 votes]
4964162
McAdam
380/215
Dec,1969

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. An apparatus for integrally controlling in real time an audio signal and a video signal transmitted in real time, comprising:

a separator for receiving a video signal and an audio signal synchronous with said video signal and separating the received signals into said audio and video signals;

a display unit for displaying said video signal;

a sound output unit for outputting a sound of said audio signal; and

a control means for processing and controlling the output state of said audio signal in accordance with said video signal, said processing and controlling being conducted before said outputting of said audio signal.

2. An apparatus according to claim 1, wherein said control means includes a video analyzer for analyzing said video signal, and a table for storing a relationship between an output from said video analyzer and the output state of said audio signal, whereby said sound output unit is controlled by an output from said table.

3. An apparatus according to claim 1, wherein said control means includes a video analyzer for analyzing said video signal and detecting a discontinuity of said video signal, and a table for storing a relationship between an output from said video analyzer and the output state of said audio signal, whereby said sound output unit is controlled by an output from said table.

4. An apparatus according to claim 1, wherein said control means includes an audio signal analyzer for analyzing said audio signal, and a table for storing a relationship between an output from said audio signal analyzer and the output state of said video signal, whereby said display unit is controlled by an output from said table.

5. An apparatus according to claim 1, wherein said control means includes a means for controlling to change said video signal to an icon and display said icon on said display unit, and a table for storing a relationship between a level of said audio signal and a display size of said icon.

6. An apparatus according to claim 1, wherein said control means includes a means for controlling to change said video signal to an icon and display said icon on said display unit, and a table for storing a relationship between a level of said audio signal and a display color of said icon.

7. An apparatus for integrally controlling an audio signal and a video signal in real time, comprising:

a separator for receiving a video signal and an audio signal synchronous with said video signal and separating the received signals into said audio and video signals;

a display unit for displaying said video signal;

a sound output unit for outputting a sound of said audio signal;

a control means for controlling the output state of one of said audio and video signals in accordance with the other of said audio and video signals;

a microphone; and

an image pickup means;

wherein a composite signal of said video signal and said audio signal synchronous with said video signal is received via a network interconnecting communication terminals at other sites, and said control means includes a correlation analyzing means for analyzing a correlation between said audio signal supplied from said network and said audio signal obtained from said microphone, and controls said display unit, said sound output unit, and said image pickup means, in accordance with an output from said correlation analyzing means.

8. An apparatus according to claim 7, wherein said control means controls an image pickup angle of said image pickup means.

9. An apparatus according to claim 7, further comprising a plurality of sound output units, wherein said control means controls a balance of reproduced sounds of the plurality of sound output units to orientate a sound field to a display screen area at which said video signal synchronizing said audio signal having a largest correlation is displayed.

10. An apparatus according to claim 9, wherein said display unit displays said composite signal of said video signal and said audio signal synchronous with said video signal in a window, and said control means controls the balance of reproduced sounds of the plurality of sound output units and controls to display, in a different manner from an ordinary state, the window in which said video signal synchronizing said audio signal having a largest correlation is displayed.

11. A multi-site communication method for a multi-site communication system having a plurality of communication terminals at different sites interconnected by a communication network for transmitting an audio signal and a video signal between the communication terminals, wherein correlations between an audio signal generated at one communication terminal and audio signals generated at other communication terminals are analyzed, and a conversation partner of said one communication terminal is identified from said other communication terminals in accordance with a result of an analyzed correlation.

12. A multi-site communication method according to claim 11, wherein said one communication terminal includes a display unit for displaying images received from said other communication terminals at predetermined display positions and a plurality of cameras disposed near said predetermined display positions for taking images of participants at said other communication terminals, and wherein said video signal recorded by a camera near a predetermined display position corresponding to said identified conversation partner is selected and transmitted at least to said communication terminal of said identified conversation partner.

13. A multi-site communication method according to claim 12, wherein a display state of an image is controlled in accordance with identification of said conversation partner.

14. A multi-site communication method according to claim 13, wherein contents of image decoding are controlled in accordance with identification said conversation partner.

15. A multi-site communication method according to claim 13, wherein contents of image encoding are controlled in accordance with identification of said conversation partner.

16. A multi-site communication method according to claim 14, wherein the image decoding includes a hierarchical decoding scheme, and a hierarchy thereof is changed in accordance with said identification of conversation partner.

17. A multi-site communication method according to claim 11, wherein a reproduction state of audio signal sounds is controlled in accordance with identification of said conversation partner.

18. A communication terminal connected to a plurality of other communication terminals at different sites via a communication network for transmitting and receiving an audio signal and a video signal to and from the plurality of other communication terminals, comprising:

a correlation analyzing means for analyzing correlation between said audio signal to be transmitted to another communication terminal at another site and audio signals received from said other communication terminals at the different sites; and

a conversation partner identifying means for identifying the another communication terminal of a conversation partner in accordance with an output of said correlation analyzing means.

19. A communication terminal according to claim 18, further comprising:

a display unit for displaying images received from said other communication terminals at predetermined display positions;

a plurality of cameras disposed near said predetermined display positions for taking the images of participants at said other communication terminals; and

a video signal selecting means for selecting said video signal recorded by a camera near a predetermined display position corresponding to an identified conversation partner.

20. A communication terminal according to claim 18, further a comprising video controlling means for controlling a display state of an image in accordance with identification of said conversation partner.

21. A communication terminal according to claim 18, further comprising a conversation partner identified result transmitting means for transmitting identification of said conversation partner to said other communication terminals via said communication network.

22. A communication terminal connected via said communication network to a plurality of communication terminals including the communication terminal recited in claim 21, further comprising:

a conversation partner identified result receiving means for receiving the identification of said conversation partner from said communication network; and

a video signal decoding control means for controlling contents of decoding said video signal in accordance with the identification received by said result receiving means.

23. A communication terminal according to claim 18, further comprising a video signal encoding control means for controlling contents of encoding said video signal in accordance with identification of said conversation partner.

24. A communication terminal according to claim 22, wherein said video signal decoding control means changes a hierarchy of a hierarchical decoding scheme in accordance with the identification said conversation partner.

25. A communication terminal according to claim 18, further comprising a sound controlling means for controlling a sound reproduction in accordance with identification of said conversation partner.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to an apparatus for integrally controlling audio and video signals for systems such as TV conferencing systems and visual telemetry systems in which audio and video signals transmitted from a spatially remote site are used to reproduce scenes rich in reality. More particularly, the invention relates to an apparatus for integrally controlling audio and video signals by analyzing received video signals and controlling audio signal processing parameters in accordance with the analyzed results.

2. Description of the Related Art

As video systems for transmitting audio and video signals from a spatially remote site, movies and televisions are known which have been in practical use from old days. Techniques of movies and televisions are well known and the details thereof are omitted. Only the effects of a combination of audio and video signals are given herein. Basic sound signals for a movie or a television are recorded simultaneously when a scene is taken. After scenes are taken, the basic sound signals are repetitively edited and processed while looking at the scenes to generate audio signals matching the scenes. Editing and processing include an addition of effect sounds and new sounds after recording and an adjustment of quality and volume of recorded sounds. An object of editing is to improve reality. It is well known that reality improves if high quality audio signals matching the contents of scenes are used. For example, a movie of a surround stereophonic sound system in which sound images move following a motion of scene images, provides excellent reality more than a movie of a monophonic sound system.

Audio signals cannot be repetitively edited or processed while audio and video signals of a movie or a television are transmitted in real time from a spatially remote site, being unable to provide excellent reality such as described above.

As full-duplex visual communication systems, TV conferencing systems have been in practical use. In a TV conferencing system, audio and video signals recorded by a microphone and a camera (hereinafter a video signal containing an audio signal is represented by an AV signal where applicable) are transmitted to a remote site via communication networks, and images and sounds of scenes are reproduced on a display unit and from a loudspeaker. Microphones, cameras, display units, and loudspeakers are prepared at respective communication sites which are interconnected by communication networks to realize full-duplex and multi-site communications. As simplex visual communication systems, there are a visual telemetry system in which scenes at a remote site are monitored by using AV signals and a telepresence system in which a user has a virtual experience as if presenting at a remote site by looking at images and listening sounds at the remote site. Such TV conferencing systems, visual telemetry systems, and tele presence systems are real-time visual communication systems by which present events are recorded by a TV camera and a microphone and transmitted to a destination with high fidelity. Recently, a system called an easy-to-use computer supported cooperative work (CSCW) has become available in which images transmitted in real time and computer graphics generated by a computer are displayed at the same time.

FIG. 37 is a schematic diagram showing an example of a conventional multi-site, individual-type TV conferencing system.

In this multi-site TV conferencing system S51, AV signals are transmitted among TV conferencing sites (A to E) 3751 to 3755 via a communication network 3756, each site being equipped with a TV conferencing apparatus for each of participants A to E.

FIG. 38 is a schematic diagram showing the configuration of, for example, the TV conferencing apparatus at E site 3755.

The TV conferencing apparatus at E site 3755 has a camera 3862, a microphone 3869, a display unit 3801, and loudspeakers 3860 and 3861.

The camera 3862 takes an image of the participant E at the TV conferencing site E and its video signal is transmitted to the other TV conferencing sites (A to D) 3751 to 3754. The microphone 3869 records voices of the participant E and its audio signal is transmitted to the other TV conferencing sites (A to D) 3751 to 3754.

In windows 2564 to 2567 of the display nit 3801, the images of the participants A to D at the other TV conferencing sites (A to D) 3751 to 3754 are displayed. Voices of the participants A to D at the other TV conferencing sites (A to D) 3751 to 3754 are synthesized and reproduced from the loudspeakers 3860 and 3861.

With conventional TV conferencing systems and visual telemetry systems, a correspondence between audio and video signals becomes poor in some cases because a conference room or a space in which an object to be monitored does not always satisfies the sound recording conditions matching scene images. For example, consider zoom-up of the image of a speaker at a TV conferencing system. In order to realize a good correspondence between audio and video signals during an image zoom-up operation, it is necessary, for example, for a microphone to move and record speeches near at the speaker at the same time when a camera is moved for the zoom-up operation, and for a sound recording area to coincide with an image taking area. However, in practice, it is impossible for a conventional system to move a microphone near to a speaker. Therefore, even if the image of a speaker is zoomed up, the sound volume does not change and the AV signal having a poor correspondence is transmitted to a communication partner. Such an AV signal reproduced at the destination provides low reality hindering a smooth progress of a conference. For example, if a conference is progressed always with voices from a far field, it is easily conceivable that the conference does not become attractive and its smooth progress is difficult.

In addition to a poor correspondence between audio and video signals, there is a poor correspondence between video signals. This will be explained in the following.

FIGS. 39A and 39B are schematic diagrams explaining the states at the TV conferencing sites (A and E) 3755 and 3751 of the conventional TV conferencing system S51 wherein participants E and A at the TV conferencing sites (E and A) 3755 and 3751 have a conversation.

As shown in FIG. 39A, at the TV conferencing site E 3755, the participant A is displayed in the leftside window 2564 of the display unit 3801 and the participant E looks at the window 2564. Therefore, an angle .theta. between a sight of the participant E and the optical axis of the camera 3862 becomes large.

As shown in FIG. 39B, at the TV conferencing site A 3751, the participant E is displayed in the rightside window 2567 of the display unit 3801 and the participant A looks at the window 2567. Therefore, an angle .theta. between a sight of the participant A and the optical axis of the camera 3862 becomes large.

The participants E and A feel therefore that the partner is not looking at him or her, losing reality of discussion in the conference room.

As described above, with the conventional TV conferencing system S51, conversation partners (speakers and listeners) are not displayed clearly and distinguishably and reality cannot be produced.

JP-A-61-10381 discloses a technique of selectively transmitting only an image of a participant not speaking.

JP-A-60-203086 discloses a technique of displaying an enlarged image of a participant now speaking.

JP-A-63-77282 discloses a technique of changing the direction of a camera toward a participant now speaking.

These conventional techniques are related to application techniques of apparatuses on the speaker side. In a TV conference, reality can be obtained if conversation partners (speakers and listeners) are displayed clearly and distinguishably. Any one of the conventional techniques cannot display clearly and distinguishably conversation partners, being unable to provide sufficient reality.

If a correspondence between audio and video signals is poor in a monitor operation of a visual telemetry system (e.g., if audio signals unnecessary for video signals are reproduced), these unnecessary audio signals may cause an overlook of an instrument and an erroneous decision of occurrence of an event.

As apparent from the description of editing sounds of a television or a movie, editing and processing of sounds are performed in order to improve the correspondence between audio and video signals and improve reality. However, conventional real-time visual communication systems such as TV conferencing systems and visual telemetry systems do not record and process sounds and images after they have once recorded and processed, being unable to provide a conference with good reality and a correct and speedy monitor operation.

SUMMARY OF THE INVENTION

It is a first object of the present invention to provide an apparatus for a real-time visual communication system such as TV conferencing systems, visual telemetry systems, and telepresence systems, capable of improving a correspondence between audio and video signals and realizing AV communication with good reality.

It is a second object of the present invention to provide an excellent and easy-to-use user interface by processing audio signals contained in video signals.

It is a third object of the present invention to provide a multi-site communication method and a communication terminal capable of clearly and distinguishably displaying conversation partners (speakers and listeners) and improving reality.

In order to achieve the above objects of the invention, a video signal is analyzed and an audio signal is processed in real time in accordance with the analyzed results. An AV communication system of this invention includes means for analyzing a video signal and deriving characteristics of an image, database means for storing audio signal processing parameters corresponding to the image characteristics, and audio signal processing means for controlling an audio signal in accordance with parameters read from the database.

Specifically, according to the present invention, the apparatus for integrally controlling an audio signal and a video signal in real time, is realized by: a separator for receiving a video signal and an audio signal synchronous with the video signal and separating the received signals into the audio and video signals; a display unit for displaying the video signal; a sound output unit for outputting a sound of the audio signal; and control means for controlling the output state of one of the audio and video signals in accordance with the other of the audio and video signals.

The control means includes a video analyzer for analyzing the video signal, and a table for storing the relationship between an output from the video analyzer and the output state of the audio signal, whereby the sound output unit is controlled by an output from the table.

The control means includes an audio signal analyzer for analyzing the audio signal, and a table for storing the relationship between an output from the audio signal analyzer and the output state of the video signal, whereby the display unit is controlled by an output from the table.

The control means includes means for controlling to change the video signal to an icon and display the icon on the display unit, and a table for storing the relationship between a level of the audio signal and a display size of the icon.

The control means includes means for controlling to change the video signal to an icon and display the icon on the display unit, and a table for storing the relationship between a level of the audio signal and a display color of the icon.

In applying the invention to a multi-site TV conferencing system, the apparatus further includes a microphone and image pickup means, wherein a composite signal of the video signal and the audio signal synchronous with the video signal is received via a network interconnecting communication terminals at other sites, and the control means includes correlation analyzing means for analyzing a correlation between the audio signal supplied from the network and the audio signal obtained from the microphone, and controls the display unit, the sound output unit, and the image pickup means, in accordance with an output from the correlation analyzing means.

The control means controls an image pickup angle of the image pickup means.

The apparatus further includes a plurality of sound output units, wherein the control means controls the balance of reproduced sounds of the plurality of sound output units to orientate a sound field to a display screen area at which the video signal synchronizing the audio signal having a largest correlation is displayed.

The display unit displays the composite signal of the video signal and the audio signal synchronous with the video signal in a window, and the control means controls the balance of reproduced sounds of the plurality of sound output units and controls to display, in a different manner from an ordinary state, a window in which the video signal synchronizing the audio signal having a largest correlation is displayed.

A multi-site communication system having a good correspondence between video systems is realized by the following methods and apparatuses.

The invention provides a multi-site communication method for a multi-site communication system having a plurality of communication terminals at different sites interconnected by a communication network for transmitting an audio signal and a video signal between the communication terminals, wherein correlations between the audio signal generated at one communication terminal and the audio signals generated at other communication terminals are analyzed, and a conversation partner of the one communication terminal is identified from the other communication terminals in accordance with the correlation analyzed result.

The invention provides the multi-site communication method, wherein the one communication terminal includes a display unit for displaying images received from the other communication terminals at predetermined display positions and a plurality of cameras disposed near at the predetermined display positions for taking the images of participants at the other communication terminals, and wherein the video signal recorded by the camera near the predetermined display position corresponding to the identified conversation partner is selected and transmitted at least to the communication terminal of the identified conversation partner.

The invention provides the multi-site communication method, wherein the display state of an image is controlled in accordance with the conversation partner identified result.

The invention provides a communication terminal connected to a plurality of other communication terminals at different sites via a communication network for transmitting and receiving an audio signal and a video signal to and from the plurality of other communication terminals. The communication terminal includes: correlation analyzing means for analyzing correlations between the audio signal to be transmitted to another communication terminal at another site and the audio signal received from another communication terminal at another site; and conversation partner identifying means for identifying another communication terminal of a conversation partner in accordance with the correlation analyzed result.

The invention provides the communication terminal, further including: a display unit for displaying images received from the other communication terminals at predetermined display positions; a plurality of cameras disposed near at the predetermined display positions for taking the images of participants at the other communication terminals; and video signal selecting means for selecting the video signal recorded by the camera near the predetermined display position corresponding to the identified conversation partner.

The invention provides the communication terminal further including video controlling means for controlling the display state of an image in accordance with the conversation partner identified result.

The invention provides the communication terminal further including conversation partner identified result transmitting means for transmitting the conversation partner identified result to the other communication terminals via the communication network.

The invention provides the communication terminal connected via the communication network to a plurality of communication terminals including the communication terminal recited just above further including: conversation partner identified result receiving means for receiving the conversation partner identified result from the communication network; and video signal decoding control means for controlling the contents of decoding the video signal in accordance with the received identified result.

The invention provides the communication terminal recited just above further including video signal encoding control means for controlling the contents of encoding the video signal in accordance with the conversation partner identified result.

The apparatus of this invention analyzes the characteristics of an input video signal such as chrominance, frequency distribution, luminance histogram, motion quantity per unit time, and motion direction. In accordance with these analyzed characteristics, the contents of a subject image are predefined. The predefined contents and derived characteristics of video signals are stored in the database as a search key. Also stored in the database are audio signal processing parameters in correspondence with the contents and characteristics of video signals. Processing parameters suitable for an image are read from the database and supplied to the audio signal processor which changes its processing characteristics in accordance with the parameters to change the audio signal. For example, the sound field is controlled to reproduce an acoustic space suitable for an image, by changing the sound volume, right and left balance, frequency characteristics, reverberation, and the like. Audio signal processing parameters suitable for improving reality are stored in advance in the database, and parameters suitable for each image are read therefrom. It is therefore possible to always reproduce sounds matching each image, providing TV conferencing systems, visual telemetry systems, and the like which are excellent in reality. If parameters like those used by professional acoustic operators are stored in the database, the same effects of real time acoustic editing can be obtained.

The audio signal processor may be controlled in accordance with not only the video signal characteristics but also user preference. Audio signal processing parameters may be controlled through a user interface unit of a computer system.

In a multi-site communication system, video signals are used for the operations described in the following.

Since a conversation progresses with some delay between each partner, there is a large correlation between states of audio signals given by conversation partners, whereas there is a small correlation between states of audio signals given by partners not participating the conversation. Conversation participants (speaker and its partner) can be identified by analyzing correlations.

An image of a speaker is taken by a camera positioned near the window displaying the image of a partner (listener) so that the sights of both participants coincide with each other and reality can be improved.

Reality can further be improved by displaying the images of conversation participants differently from other persons.

A conversation partner identified result is transmitted over the communication network to another communication terminal. Therefore, even at a communication terminal not participating the conversation, conversation participants can be identified. By using the conversation partner identified result, it is possible to display the images of conversation participants differently from other persons, further improving reality.

According to the present invention, an audio signal is processed properly by analyzing the characteristics of a video signal, thereby forming a audio-video signal space excellent in reality. By adding an audio signal matching a video signal to the latter, it is possible to configure an audio-video system not only having improved reality but also being easy-to-use.

In a multi-site communication system to which the invention is applied, conversation participants can be identified, thereby providing reality of as if the participants are discussing in the same conference room.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing the structure of an embodiment of the invention.

FIG. 2 is a block diagram showing a modification of the embodiment shown in FIG. 1.

FIG. 3 is a block diagram showing the structure of a video signal analyzer.

FIG. 4 is a schematic diagram explaining the operation of the video signal analyzer.

FIG. 5 is a block diagram showing the structure of an embodiment with a scene change detector according to the invention.

FIG. 6 is a block diagram showing the structure of the scene change detector.

FIG. 7 is a block diagram showing the structure of a color change detector.

FIG. 8 is a block diagram showing the structure of another video signal analyzer.

FIG. 9 is a block diagram showing the structure of an embodiment with a user interface unit.

FIG. 10 is a schematic diagram showing the outline of the system structure according to an embodiment of the invention.

FIG. 11 is a block diagram showing the structure of an embodiment with an audio signal analyzer.

FIG. 12 is a schematic diagram showing an example of a screen with icon sizes being controlled by a sound volume.

FIG. 13 is a block diagram showing the structure of an embodiment with icon sizes being controlled by a sound volume.

FIG. 14 is a schematic diagram showing an example of a screen with icon sizes being controlled by a tone of an audio signal.

FIG. 15 is a block diagram showing the structure of an embodiment with icon size being controlled by a tone of an audio signal.

FIGS. 16A and 16B show examples of icons displayed on a screen.

FIG. 17 is a block diagram showing the structure of an embodiment of the invention.

FIG. 18 shows a display screen explaining the operation of the embodiment shown in FIG. 17.

FIG. 19 are graphs showing the sound volume control characteristics relative to sound image motions.

FIG. 20 is a block diagram showing the structure of an audio signal processing digital filter.

FIG. 21 is a block diagram of a loudspeaker signal processor providing a sound image orientation.

FIG. 22 is a block diagram showing the structure of an embodiment of the invention.

FIG. 23 is a schematic diagram showing the structure of an image pickup unit.

FIG. 24 is a schematic diagram showing the layout of loudspeakers.

FIG. 25 is a schematic diagrams showing the layout of windows on a screen.

FIG. 26 is a block diagram showing the structure of a correlation analyzer.

FIG. 27 is a block diagram showing the structure of a speech monitor.

FIG. 28A shows an example of an audio signal waveform, and FIG. 28B shows an average sound power signal.

FIG. 29 is a block diagram showing the structure of a correlation detector.

FIGS. 30A and 30B show average audio power signals relative to time, and FIG. 30C shows integrated values of the average audio power signals.

FIG. 31 is a diagram showing a relationship between audio signals, average audio power signals, and correlations.

FIGS. 32A and 32B are schematic diagrams showing an agreement of sights of users.

FIG. 33 is a schematic diagram explaining a sound field control.

FIG. 34 is a block diagram showing the structure of an embodiment of the invention.

FIG. 35 is a block diagram showing the structure of a video display controller.

FIGS. 36A, 36B, and 36C are schematic diagrams showing an agreement of sights of users.

FIG. 37 is a schematic diagram showing the structure of a conventional multi-site TV conferencing system.

FIG. 38 is a schematic diagram showing a conventional TV conferencing apparatus.

FIGS. 39A and 39B are schematic diagrams showing a disagreement of sights of users.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

Embodiments of the invention will be described with reference to the accompanying drawings.

FIG. 1 is a system block diagram showing an embodiment of the present invention. The apparatus of this embodiment is constituted by an AV (audio/video) separator 1 for separating a video signal and an audio signal synchronized with the former, a video signal characteristic analyzer 2 for analyzing the characteristics of a video signal, an audio signal processor 3 for processing an audio signal, a display unit 5 for displaying images represented by a video signal, and loudspeakers 4, 4' for reproducing processed audio signals. Next, the flow of AV signals and operation of the embodiment apparatus will be described. An AV signal transmitted from a remote site is inputted to the AV separator 1 which separates it into an audio signal and a video signal. The separated audio signal is supplied to the audio signal processor 3, and the separated video signal is supplied to the video signal analyzer 2 and to the video display unit 5. The video signal analyzer 2 analyzes the characteristics of an inputted video signal, and in accordance with the analyzed characteristics, generates a control signal for controlling the audio signal processor 3. The operation of the audio signal processor 3 to be later described includes, for example, an operation of improving a clarity of voices if a transmitted image indicates a conversation between participants, an operation of providing an expansion of sounds if an image shows a broad outdoors, an operation of adding reverberation signals if an image shows a broad indoors such as a hall, and an operation of orientating sound images following a motion of an image if the image is a moving image. These effects can be realized as follows. For improving a speech clarity, a balance between higher and lower frequencies of an audio signal is adjusted by a filter. For adding reverberation, a convolution calculation for calculating desired reverberation times is performed. For localizing a sound image following a moving image, a sound volume balance between a plurality of loudspeakers and a balance between direct sounds (sounds directly received by a listener without reflection from a wall or the like) and reflected sounds (sounds reflected by a wall or the like and having a phase delay and frequency change) are adjusted following a motion of a sound generating object in the image.

FIG. 2 is a block diagram showing a modification of the embodiment shown in FIG. 1. A different point of the modification shown in FIG. 2 from the embodiment shown in FIG. 1 is a database 12 (hereinafter called an AV database) which stores combinations of the video signal characteristics and corresponding audio signal processing parameters. The AV database 12 stores audio signal processing parameters and is accessed to read parameters matching the characteristics of a video signal analyzed by the video signal analyzer. The reason why a database is used for the control of an audio signal is that there is a case wherein a correspondence between the video signal characteristics and the audio signal control characteristics cannot be properly calculated. For this reason, instead of programming a characteristic control sequence for the acoustic edition, a database is used in which stored are relationships between video signals and corresponding audio signal processing parameters determined by the rule of experiences of acoustic editors. For an audio signal process matching senses, it is more realistic and effective to use the rule of experiences of professional editors.

In correspondence with video signal characteristics, the AV database 12 stores, for example:

(1) a volume of sound to be reproduced;

(2) a balance between sounds reproduced by loudspeakers;

(3) the frequency characteristics of an audio signal to be reproduced (equalizing characteristics);

(4) the characteristics of a reverberation signal to be added to an audio signal by the audio signal processor (e.g., an impulse response used by a convolution calculation); and

(5) the amplitudes of, balance and transmission time difference between, direct sounds (sounds directly received by a listener without reflection by a wall or the like) and reflected sounds.

These parameters to be stored will be more detailed. The parameters (1) and (2) are associated with the volume of an audio signal and can be performed by adjusting the gain of an audio signal output amplifier. The parameters (1) and (2) are therefore gain data of audio signal output amplifiers. The parameters (3) to (5) are associated with a use of digital signal convolution calculation. For these parameters, a digital filter can be configured, for example as shown in FIG. 20, by delay elements, multipliers, and an adder. In this digital filter, a digital audio signal is inputted to the digital filter and delayed by the delay elements 150 to 154 by a integer multiple delay time of the sampling time, and thereafter each delayed output is supplied to each corresponding multiplier 155 to 161. Each multiplier 155 to 161 multiplies the input audio signal data by a preset coefficient (coefficients l to m) and outputs the result to the adder 162. The adder 162 adds all the outputs from the multipliers 155 to 161 and calculates a final output of the filter shown in FIG. 20. With the digital filer of the embodiment shown in FIG. 20, the equalizing and reverberation characteristics of an audio signal can be adjusted by changing the coefficients of the multipliers. For example, if a clarity of speech is to be improved by using the digital filter, the filter coefficients for the cut-off characteristics of a low frequency range are set so as to stop signals of low frequency components which are causes to lower a speech clarity. If reverberation signals are to be added, the filter coefficients for a low-pass filter are set so as to prolong an impulse response continuing time.

FIG. 21 shows an embodiment of a circuit for processing direct and indirect sounds and determining a sound image orientation. This circuit includes right and left signal processors 190 and 191 for performing a stereophonic process for a single series of an audio signal, and a signal distributor 170. The right and left signal processors 190 and 191 have the same structure. Therefore, the structure and operation of only the right signal processor 190 will be described by way of example. The signal processor has digital filters 171 to 173, gain controllers 174 to 176, and an adder 177. Use of a plurality of digital filters enables to generate a direct sound and indirect sounds and to adjust the frequency characteristics of an output signal and a mixing ratio of a direct sound to indirect sounds. By stereophonically reproducing right and left channel signals generated in this manner, it becomes possible to generate an audio signal excellent in sound image localization. For the control of sound image localization, digital filter coefficients and a mixing ratio (particularly, gain values) of a direct sound to indirect sounds are stored in advance in the database.

Next, an example of using stored audio signal processing data will be briefly described. Consider for example that a video signal transmitted from a partner communication site contains a human image and the image of its mouth is changing. In such a case, the video signal analyzer 11 shown in FIG. 2 judges that the person in the image is speaking. The audio signal transmitted with the video signal represents sounds spoken by the person. Therefore, in order to improve a clarity of the audio signal, coefficients of the digital filters suitable for suppressing low frequency components are read from the database and the audio signal is inputted to the digital filters. The upper limit of a high frequency range of a human voice is about 7 kHz. Therefore, the filter coefficients are also adjusted to cut frequencies of 7 kHz or higher to elimin