|
Claims  |
|
|
What is claimed is:
1. An apparatus for integrally controlling in real time an audio signal and
a video signal transmitted in real time, comprising:
a separator for receiving a video signal and an audio signal synchronous
with said video signal and separating the received signals into said audio
and video signals;
a display unit for displaying said video signal;
a sound output unit for outputting a sound of said audio signal; and
a control means for processing and controlling the output state of said
audio signal in accordance with said video signal, said processing and
controlling being conducted before said outputting of said audio signal.
2. An apparatus according to claim 1, wherein said control means includes a
video analyzer for analyzing said video signal, and a table for storing a
relationship between an output from said video analyzer and the output
state of said audio signal, whereby said sound output unit is controlled
by an output from said table.
3. An apparatus according to claim 1, wherein said control means includes a
video analyzer for analyzing said video signal and detecting a
discontinuity of said video signal, and a table for storing a relationship
between an output from said video analyzer and the output state of said
audio signal, whereby said sound output unit is controlled by an output
from said table.
4. An apparatus according to claim 1, wherein said control means includes
an audio signal analyzer for analyzing said audio signal, and a table for
storing a relationship between an output from said audio signal analyzer
and the output state of said video signal, whereby said display unit is
controlled by an output from said table.
5. An apparatus according to claim 1, wherein said control means includes a
means for controlling to change said video signal to an icon and display
said icon on said display unit, and a table for storing a relationship
between a level of said audio signal and a display size of said icon.
6. An apparatus according to claim 1, wherein said control means includes a
means for controlling to change said video signal to an icon and display
said icon on said display unit, and a table for storing a relationship
between a level of said audio signal and a display color of said icon.
7. An apparatus for integrally controlling an audio signal and a video
signal in real time, comprising:
a separator for receiving a video signal and an audio signal synchronous
with said video signal and separating the received signals into said audio
and video signals;
a display unit for displaying said video signal;
a sound output unit for outputting a sound of said audio signal;
a control means for controlling the output state of one of said audio and
video signals in accordance with the other of said audio and video
signals;
a microphone; and
an image pickup means;
wherein a composite signal of said video signal and said audio signal
synchronous with said video signal is received via a network
interconnecting communication terminals at other sites, and said control
means includes a correlation analyzing means for analyzing a correlation
between said audio signal supplied from said network and said audio signal
obtained from said microphone, and controls said display unit, said sound
output unit, and said image pickup means, in accordance with an output
from said correlation analyzing means.
8. An apparatus according to claim 7, wherein said control means controls
an image pickup angle of said image pickup means.
9. An apparatus according to claim 7, further comprising a plurality of
sound output units, wherein said control means controls a balance of
reproduced sounds of the plurality of sound output units to orientate a
sound field to a display screen area at which said video signal
synchronizing said audio signal having a largest correlation is displayed.
10. An apparatus according to claim 9, wherein said display unit displays
said composite signal of said video signal and said audio signal
synchronous with said video signal in a window, and said control means
controls the balance of reproduced sounds of the plurality of sound output
units and controls to display, in a different manner from an ordinary
state, the window in which said video signal synchronizing said audio
signal having a largest correlation is displayed.
11. A multi-site communication method for a multi-site communication system
having a plurality of communication terminals at different sites
interconnected by a communication network for transmitting an audio signal
and a video signal between the communication terminals, wherein
correlations between an audio signal generated at one communication
terminal and audio signals generated at other communication terminals are
analyzed, and a conversation partner of said one communication terminal is
identified from said other communication terminals in accordance with a
result of an analyzed correlation.
12. A multi-site communication method according to claim 11, wherein said
one communication terminal includes a display unit for displaying images
received from said other communication terminals at predetermined display
positions and a plurality of cameras disposed near said predetermined
display positions for taking images of participants at said other
communication terminals, and wherein said video signal recorded by a
camera near a predetermined display position corresponding to said
identified conversation partner is selected and transmitted at least to
said communication terminal of said identified conversation partner.
13. A multi-site communication method according to claim 12, wherein a
display state of an image is controlled in accordance with identification
of said conversation partner.
14. A multi-site communication method according to claim 13, wherein
contents of image decoding are controlled in accordance with
identification said conversation partner.
15. A multi-site communication method according to claim 13, wherein
contents of image encoding are controlled in accordance with
identification of said conversation partner.
16. A multi-site communication method according to claim 14, wherein the
image decoding includes a hierarchical decoding scheme, and a hierarchy
thereof is changed in accordance with said identification of conversation
partner.
17. A multi-site communication method according to claim 11, wherein a
reproduction state of audio signal sounds is controlled in accordance with
identification of said conversation partner.
18. A communication terminal connected to a plurality of other
communication terminals at different sites via a communication network for
transmitting and receiving an audio signal and a video signal to and from
the plurality of other communication terminals, comprising:
a correlation analyzing means for analyzing correlation between said audio
signal to be transmitted to another communication terminal at another site
and audio signals received from said other communication terminals at the
different sites; and
a conversation partner identifying means for identifying the another
communication terminal of a conversation partner in accordance with an
output of said correlation analyzing means.
19. A communication terminal according to claim 18, further comprising:
a display unit for displaying images received from said other communication
terminals at predetermined display positions;
a plurality of cameras disposed near said predetermined display positions
for taking the images of participants at said other communication
terminals; and
a video signal selecting means for selecting said video signal recorded by
a camera near a predetermined display position corresponding to an
identified conversation partner.
20. A communication terminal according to claim 18, further a comprising
video controlling means for controlling a display state of an image in
accordance with identification of said conversation partner.
21. A communication terminal according to claim 18, further comprising a
conversation partner identified result transmitting means for transmitting
identification of said conversation partner to said other communication
terminals via said communication network.
22. A communication terminal connected via said communication network to a
plurality of communication terminals including the communication terminal
recited in claim 21, further comprising:
a conversation partner identified result receiving means for receiving the
identification of said conversation partner from said communication
network; and
a video signal decoding control means for controlling contents of decoding
said video signal in accordance with the identification received by said
result receiving means.
23. A communication terminal according to claim 18, further comprising a
video signal encoding control means for controlling contents of encoding
said video signal in accordance with identification of said conversation
partner.
24. A communication terminal according to claim 22, wherein said video
signal decoding control means changes a hierarchy of a hierarchical
decoding scheme in accordance with the identification said conversation
partner.
25. A communication terminal according to claim 18, further comprising a
sound controlling means for controlling a sound reproduction in accordance
with identification of said conversation partner. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to an apparatus for integrally controlling
audio and video signals for systems such as TV conferencing systems and
visual telemetry systems in which audio and video signals transmitted from
a spatially remote site are used to reproduce scenes rich in reality. More
particularly, the invention relates to an apparatus for integrally
controlling audio and video signals by analyzing received video signals
and controlling audio signal processing parameters in accordance with the
analyzed results.
2. Description of the Related Art
As video systems for transmitting audio and video signals from a spatially
remote site, movies and televisions are known which have been in practical
use from old days. Techniques of movies and televisions are well known and
the details thereof are omitted. Only the effects of a combination of
audio and video signals are given herein. Basic sound signals for a movie
or a television are recorded simultaneously when a scene is taken. After
scenes are taken, the basic sound signals are repetitively edited and
processed while looking at the scenes to generate audio signals matching
the scenes. Editing and processing include an addition of effect sounds
and new sounds after recording and an adjustment of quality and volume of
recorded sounds. An object of editing is to improve reality. It is well
known that reality improves if high quality audio signals matching the
contents of scenes are used. For example, a movie of a surround
stereophonic sound system in which sound images move following a motion of
scene images, provides excellent reality more than a movie of a monophonic
sound system.
Audio signals cannot be repetitively edited or processed while audio and
video signals of a movie or a television are transmitted in real time from
a spatially remote site, being unable to provide excellent reality such as
described above.
As full-duplex visual communication systems, TV conferencing systems have
been in practical use. In a TV conferencing system, audio and video
signals recorded by a microphone and a camera (hereinafter a video signal
containing an audio signal is represented by an AV signal where
applicable) are transmitted to a remote site via communication networks,
and images and sounds of scenes are reproduced on a display unit and from
a loudspeaker. Microphones, cameras, display units, and loudspeakers are
prepared at respective communication sites which are interconnected by
communication networks to realize full-duplex and multi-site
communications. As simplex visual communication systems, there are a
visual telemetry system in which scenes at a remote site are monitored by
using AV signals and a telepresence system in which a user has a virtual
experience as if presenting at a remote site by looking at images and
listening sounds at the remote site. Such TV conferencing systems, visual
telemetry systems, and tele presence systems are real-time visual
communication systems by which present events are recorded by a TV camera
and a microphone and transmitted to a destination with high fidelity.
Recently, a system called an easy-to-use computer supported cooperative
work (CSCW) has become available in which images transmitted in real time
and computer graphics generated by a computer are displayed at the same
time.
FIG. 37 is a schematic diagram showing an example of a conventional
multi-site, individual-type TV conferencing system.
In this multi-site TV conferencing system S51, AV signals are transmitted
among TV conferencing sites (A to E) 3751 to 3755 via a communication
network 3756, each site being equipped with a TV conferencing apparatus
for each of participants A to E.
FIG. 38 is a schematic diagram showing the configuration of, for example,
the TV conferencing apparatus at E site 3755.
The TV conferencing apparatus at E site 3755 has a camera 3862, a
microphone 3869, a display unit 3801, and loudspeakers 3860 and 3861.
The camera 3862 takes an image of the participant E at the TV conferencing
site E and its video signal is transmitted to the other TV conferencing
sites (A to D) 3751 to 3754. The microphone 3869 records voices of the
participant E and its audio signal is transmitted to the other TV
conferencing sites (A to D) 3751 to 3754.
In windows 2564 to 2567 of the display nit 3801, the images of the
participants A to D at the other TV conferencing sites (A to D) 3751 to
3754 are displayed. Voices of the participants A to D at the other TV
conferencing sites (A to D) 3751 to 3754 are synthesized and reproduced
from the loudspeakers 3860 and 3861.
With conventional TV conferencing systems and visual telemetry systems, a
correspondence between audio and video signals becomes poor in some cases
because a conference room or a space in which an object to be monitored
does not always satisfies the sound recording conditions matching scene
images. For example, consider zoom-up of the image of a speaker at a TV
conferencing system. In order to realize a good correspondence between
audio and video signals during an image zoom-up operation, it is
necessary, for example, for a microphone to move and record speeches near
at the speaker at the same time when a camera is moved for the zoom-up
operation, and for a sound recording area to coincide with an image taking
area. However, in practice, it is impossible for a conventional system to
move a microphone near to a speaker. Therefore, even if the image of a
speaker is zoomed up, the sound volume does not change and the AV signal
having a poor correspondence is transmitted to a communication partner.
Such an AV signal reproduced at the destination provides low reality
hindering a smooth progress of a conference. For example, if a conference
is progressed always with voices from a far field, it is easily
conceivable that the conference does not become attractive and its smooth
progress is difficult.
In addition to a poor correspondence between audio and video signals, there
is a poor correspondence between video signals. This will be explained in
the following.
FIGS. 39A and 39B are schematic diagrams explaining the states at the TV
conferencing sites (A and E) 3755 and 3751 of the conventional TV
conferencing system S51 wherein participants E and A at the TV
conferencing sites (E and A) 3755 and 3751 have a conversation.
As shown in FIG. 39A, at the TV conferencing site E 3755, the participant A
is displayed in the leftside window 2564 of the display unit 3801 and the
participant E looks at the window 2564. Therefore, an angle .theta.
between a sight of the participant E and the optical axis of the camera
3862 becomes large.
As shown in FIG. 39B, at the TV conferencing site A 3751, the participant E
is displayed in the rightside window 2567 of the display unit 3801 and the
participant A looks at the window 2567. Therefore, an angle .theta.
between a sight of the participant A and the optical axis of the camera
3862 becomes large.
The participants E and A feel therefore that the partner is not looking at
him or her, losing reality of discussion in the conference room.
As described above, with the conventional TV conferencing system S51,
conversation partners (speakers and listeners) are not displayed clearly
and distinguishably and reality cannot be produced.
JP-A-61-10381 discloses a technique of selectively transmitting only an
image of a participant not speaking.
JP-A-60-203086 discloses a technique of displaying an enlarged image of a
participant now speaking.
JP-A-63-77282 discloses a technique of changing the direction of a camera
toward a participant now speaking.
These conventional techniques are related to application techniques of
apparatuses on the speaker side. In a TV conference, reality can be
obtained if conversation partners (speakers and listeners) are displayed
clearly and distinguishably. Any one of the conventional techniques cannot
display clearly and distinguishably conversation partners, being unable to
provide sufficient reality.
If a correspondence between audio and video signals is poor in a monitor
operation of a visual telemetry system (e.g., if audio signals unnecessary
for video signals are reproduced), these unnecessary audio signals may
cause an overlook of an instrument and an erroneous decision of occurrence
of an event.
As apparent from the description of editing sounds of a television or a
movie, editing and processing of sounds are performed in order to improve
the correspondence between audio and video signals and improve reality.
However, conventional real-time visual communication systems such as TV
conferencing systems and visual telemetry systems do not record and
process sounds and images after they have once recorded and processed,
being unable to provide a conference with good reality and a correct and
speedy monitor operation.
SUMMARY OF THE INVENTION
It is a first object of the present invention to provide an apparatus for a
real-time visual communication system such as TV conferencing systems,
visual telemetry systems, and telepresence systems, capable of improving a
correspondence between audio and video signals and realizing AV
communication with good reality.
It is a second object of the present invention to provide an excellent and
easy-to-use user interface by processing audio signals contained in video
signals.
It is a third object of the present invention to provide a multi-site
communication method and a communication terminal capable of clearly and
distinguishably displaying conversation partners (speakers and listeners)
and improving reality.
In order to achieve the above objects of the invention, a video signal is
analyzed and an audio signal is processed in real time in accordance with
the analyzed results. An AV communication system of this invention
includes means for analyzing a video signal and deriving characteristics
of an image, database means for storing audio signal processing parameters
corresponding to the image characteristics, and audio signal processing
means for controlling an audio signal in accordance with parameters read
from the database.
Specifically, according to the present invention, the apparatus for
integrally controlling an audio signal and a video signal in real time, is
realized by: a separator for receiving a video signal and an audio signal
synchronous with the video signal and separating the received signals into
the audio and video signals; a display unit for displaying the video
signal; a sound output unit for outputting a sound of the audio signal;
and control means for controlling the output state of one of the audio and
video signals in accordance with the other of the audio and video signals.
The control means includes a video analyzer for analyzing the video signal,
and a table for storing the relationship between an output from the video
analyzer and the output state of the audio signal, whereby the sound
output unit is controlled by an output from the table.
The control means includes an audio signal analyzer for analyzing the audio
signal, and a table for storing the relationship between an output from
the audio signal analyzer and the output state of the video signal,
whereby the display unit is controlled by an output from the table.
The control means includes means for controlling to change the video signal
to an icon and display the icon on the display unit, and a table for
storing the relationship between a level of the audio signal and a display
size of the icon.
The control means includes means for controlling to change the video signal
to an icon and display the icon on the display unit, and a table for
storing the relationship between a level of the audio signal and a display
color of the icon.
In applying the invention to a multi-site TV conferencing system, the
apparatus further includes a microphone and image pickup means, wherein a
composite signal of the video signal and the audio signal synchronous with
the video signal is received via a network interconnecting communication
terminals at other sites, and the control means includes correlation
analyzing means for analyzing a correlation between the audio signal
supplied from the network and the audio signal obtained from the
microphone, and controls the display unit, the sound output unit, and the
image pickup means, in accordance with an output from the correlation
analyzing means.
The control means controls an image pickup angle of the image pickup means.
The apparatus further includes a plurality of sound output units, wherein
the control means controls the balance of reproduced sounds of the
plurality of sound output units to orientate a sound field to a display
screen area at which the video signal synchronizing the audio signal
having a largest correlation is displayed.
The display unit displays the composite signal of the video signal and the
audio signal synchronous with the video signal in a window, and the
control means controls the balance of reproduced sounds of the plurality
of sound output units and controls to display, in a different manner from
an ordinary state, a window in which the video signal synchronizing the
audio signal having a largest correlation is displayed.
A multi-site communication system having a good correspondence between
video systems is realized by the following methods and apparatuses.
The invention provides a multi-site communication method for a multi-site
communication system having a plurality of communication terminals at
different sites interconnected by a communication network for transmitting
an audio signal and a video signal between the communication terminals,
wherein correlations between the audio signal generated at one
communication terminal and the audio signals generated at other
communication terminals are analyzed, and a conversation partner of the
one communication terminal is identified from the other communication
terminals in accordance with the correlation analyzed result.
The invention provides the multi-site communication method, wherein the one
communication terminal includes a display unit for displaying images
received from the other communication terminals at predetermined display
positions and a plurality of cameras disposed near at the predetermined
display positions for taking the images of participants at the other
communication terminals, and wherein the video signal recorded by the
camera near the predetermined display position corresponding to the
identified conversation partner is selected and transmitted at least to
the communication terminal of the identified conversation partner.
The invention provides the multi-site communication method, wherein the
display state of an image is controlled in accordance with the
conversation partner identified result.
The invention provides a communication terminal connected to a plurality of
other communication terminals at different sites via a communication
network for transmitting and receiving an audio signal and a video signal
to and from the plurality of other communication terminals. The
communication terminal includes: correlation analyzing means for analyzing
correlations between the audio signal to be transmitted to another
communication terminal at another site and the audio signal received from
another communication terminal at another site; and conversation partner
identifying means for identifying another communication terminal of a
conversation partner in accordance with the correlation analyzed result.
The invention provides the communication terminal, further including: a
display unit for displaying images received from the other communication
terminals at predetermined display positions; a plurality of cameras
disposed near at the predetermined display positions for taking the images
of participants at the other communication terminals; and video signal
selecting means for selecting the video signal recorded by the camera near
the predetermined display position corresponding to the identified
conversation partner.
The invention provides the communication terminal further including video
controlling means for controlling the display state of an image in
accordance with the conversation partner identified result.
The invention provides the communication terminal further including
conversation partner identified result transmitting means for transmitting
the conversation partner identified result to the other communication
terminals via the communication network.
The invention provides the communication terminal connected via the
communication network to a plurality of communication terminals including
the communication terminal recited just above further including:
conversation partner identified result receiving means for receiving the
conversation partner identified result from the communication network; and
video signal decoding control means for controlling the contents of
decoding the video signal in accordance with the received identified
result.
The invention provides the communication terminal recited just above
further including video signal encoding control means for controlling the
contents of encoding the video signal in accordance with the conversation
partner identified result.
The apparatus of this invention analyzes the characteristics of an input
video signal such as chrominance, frequency distribution, luminance
histogram, motion quantity per unit time, and motion direction. In
accordance with these analyzed characteristics, the contents of a subject
image are predefined. The predefined contents and derived characteristics
of video signals are stored in the database as a search key. Also stored
in the database are audio signal processing parameters in correspondence
with the contents and characteristics of video signals. Processing
parameters suitable for an image are read from the database and supplied
to the audio signal processor which changes its processing characteristics
in accordance with the parameters to change the audio signal. For example,
the sound field is controlled to reproduce an acoustic space suitable for
an image, by changing the sound volume, right and left balance, frequency
characteristics, reverberation, and the like. Audio signal processing
parameters suitable for improving reality are stored in advance in the
database, and parameters suitable for each image are read therefrom. It is
therefore possible to always reproduce sounds matching each image,
providing TV conferencing systems, visual telemetry systems, and the like
which are excellent in reality. If parameters like those used by
professional acoustic operators are stored in the database, the same
effects of real time acoustic editing can be obtained.
The audio signal processor may be controlled in accordance with not only
the video signal characteristics but also user preference. Audio signal
processing parameters may be controlled through a user interface unit of a
computer system.
In a multi-site communication system, video signals are used for the
operations described in the following.
Since a conversation progresses with some delay between each partner, there
is a large correlation between states of audio signals given by
conversation partners, whereas there is a small correlation between states
of audio signals given by partners not participating the conversation.
Conversation participants (speaker and its partner) can be identified by
analyzing correlations.
An image of a speaker is taken by a camera positioned near the window
displaying the image of a partner (listener) so that the sights of both
participants coincide with each other and reality can be improved.
Reality can further be improved by displaying the images of conversation
participants differently from other persons.
A conversation partner identified result is transmitted over the
communication network to another communication terminal. Therefore, even
at a communication terminal not participating the conversation,
conversation participants can be identified. By using the conversation
partner identified result, it is possible to display the images of
conversation participants differently from other persons, further
improving reality.
According to the present invention, an audio signal is processed properly
by analyzing the characteristics of a video signal, thereby forming a
audio-video signal space excellent in reality. By adding an audio signal
matching a video signal to the latter, it is possible to configure an
audio-video system not only having improved reality but also being
easy-to-use.
In a multi-site communication system to which the invention is applied,
conversation participants can be identified, thereby providing reality of
as if the participants are discussing in the same conference room.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram showing the structure of an embodiment of the
invention.
FIG. 2 is a block diagram showing a modification of the embodiment shown in
FIG. 1.
FIG. 3 is a block diagram showing the structure of a video signal analyzer.
FIG. 4 is a schematic diagram explaining the operation of the video signal
analyzer.
FIG. 5 is a block diagram showing the structure of an embodiment with a
scene change detector according to the invention.
FIG. 6 is a block diagram showing the structure of the scene change
detector.
FIG. 7 is a block diagram showing the structure of a color change detector.
FIG. 8 is a block diagram showing the structure of another video signal
analyzer.
FIG. 9 is a block diagram showing the structure of an embodiment with a
user interface unit.
FIG. 10 is a schematic diagram showing the outline of the system structure
according to an embodiment of the invention.
FIG. 11 is a block diagram showing the structure of an embodiment with an
audio signal analyzer.
FIG. 12 is a schematic diagram showing an example of a screen with icon
sizes being controlled by a sound volume.
FIG. 13 is a block diagram showing the structure of an embodiment with icon
sizes being controlled by a sound volume.
FIG. 14 is a schematic diagram showing an example of a screen with icon
sizes being controlled by a tone of an audio signal.
FIG. 15 is a block diagram showing the structure of an embodiment with icon
size being controlled by a tone of an audio signal.
FIGS. 16A and 16B show examples of icons displayed on a screen.
FIG. 17 is a block diagram showing the structure of an embodiment of the
invention.
FIG. 18 shows a display screen explaining the operation of the embodiment
shown in FIG. 17.
FIG. 19 are graphs showing the sound volume control characteristics
relative to sound image motions.
FIG. 20 is a block diagram showing the structure of an audio signal
processing digital filter.
FIG. 21 is a block diagram of a loudspeaker signal processor providing a
sound image orientation.
FIG. 22 is a block diagram showing the structure of an embodiment of the
invention.
FIG. 23 is a schematic diagram showing the structure of an image pickup
unit.
FIG. 24 is a schematic diagram showing the layout of loudspeakers.
FIG. 25 is a schematic diagrams showing the layout of windows on a screen.
FIG. 26 is a block diagram showing the structure of a correlation analyzer.
FIG. 27 is a block diagram showing the structure of a speech monitor.
FIG. 28A shows an example of an audio signal waveform, and FIG. 28B shows
an average sound power signal.
FIG. 29 is a block diagram showing the structure of a correlation detector.
FIGS. 30A and 30B show average audio power signals relative to time, and
FIG. 30C shows integrated values of the average audio power signals.
FIG. 31 is a diagram showing a relationship between audio signals, average
audio power signals, and correlations.
FIGS. 32A and 32B are schematic diagrams showing an agreement of sights of
users.
FIG. 33 is a schematic diagram explaining a sound field control.
FIG. 34 is a block diagram showing the structure of an embodiment of the
invention.
FIG. 35 is a block diagram showing the structure of a video display
controller.
FIGS. 36A, 36B, and 36C are schematic diagrams showing an agreement of
sights of users.
FIG. 37 is a schematic diagram showing the structure of a conventional
multi-site TV conferencing system.
FIG. 38 is a schematic diagram showing a conventional TV conferencing
apparatus.
FIGS. 39A and 39B are schematic diagrams showing a disagreement of sights
of users.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
Embodiments of the invention will be described with reference to the
accompanying drawings.
FIG. 1 is a system block diagram showing an embodiment of the present
invention. The apparatus of this embodiment is constituted by an AV
(audio/video) separator 1 for separating a video signal and an audio
signal synchronized with the former, a video signal characteristic
analyzer 2 for analyzing the characteristics of a video signal, an audio
signal processor 3 for processing an audio signal, a display unit 5 for
displaying images represented by a video signal, and loudspeakers 4, 4'
for reproducing processed audio signals. Next, the flow of AV signals and
operation of the embodiment apparatus will be described. An AV signal
transmitted from a remote site is inputted to the AV separator 1 which
separates it into an audio signal and a video signal. The separated audio
signal is supplied to the audio signal processor 3, and the separated
video signal is supplied to the video signal analyzer 2 and to the video
display unit 5. The video signal analyzer 2 analyzes the characteristics
of an inputted video signal, and in accordance with the analyzed
characteristics, generates a control signal for controlling the audio
signal processor 3. The operation of the audio signal processor 3 to be
later described includes, for example, an operation of improving a clarity
of voices if a transmitted image indicates a conversation between
participants, an operation of providing an expansion of sounds if an image
shows a broad outdoors, an operation of adding reverberation signals if an
image shows a broad indoors such as a hall, and an operation of
orientating sound images following a motion of an image if the image is a
moving image. These effects can be realized as follows. For improving a
speech clarity, a balance between higher and lower frequencies of an audio
signal is adjusted by a filter. For adding reverberation, a convolution
calculation for calculating desired reverberation times is performed. For
localizing a sound image following a moving image, a sound volume balance
between a plurality of loudspeakers and a balance between direct sounds
(sounds directly received by a listener without reflection from a wall or
the like) and reflected sounds (sounds reflected by a wall or the like and
having a phase delay and frequency change) are adjusted following a motion
of a sound generating object in the image.
FIG. 2 is a block diagram showing a modification of the embodiment shown in
FIG. 1. A different point of the modification shown in FIG. 2 from the
embodiment shown in FIG. 1 is a database 12 (hereinafter called an AV
database) which stores combinations of the video signal characteristics
and corresponding audio signal processing parameters. The AV database 12
stores audio signal processing parameters and is accessed to read
parameters matching the characteristics of a video signal analyzed by the
video signal analyzer. The reason why a database is used for the control
of an audio signal is that there is a case wherein a correspondence
between the video signal characteristics and the audio signal control
characteristics cannot be properly calculated. For this reason, instead of
programming a characteristic control sequence for the acoustic edition, a
database is used in which stored are relationships between video signals
and corresponding audio signal processing parameters determined by the
rule of experiences of acoustic editors. For an audio signal process
matching senses, it is more realistic and effective to use the rule of
experiences of professional editors.
In correspondence with video signal characteristics, the AV database 12
stores, for example:
(1) a volume of sound to be reproduced;
(2) a balance between sounds reproduced by loudspeakers;
(3) the frequency characteristics of an audio signal to be reproduced
(equalizing characteristics);
(4) the characteristics of a reverberation signal to be added to an audio
signal by the audio signal processor (e.g., an impulse response used by a
convolution calculation); and
(5) the amplitudes of, balance and transmission time difference between,
direct sounds (sounds directly received by a listener without reflection
by a wall or the like) and reflected sounds.
These parameters to be stored will be more detailed. The parameters (1) and
(2) are associated with the volume of an audio signal and can be performed
by adjusting the gain of an audio signal output amplifier. The parameters
(1) and (2) are therefore gain data of audio signal output amplifiers. The
parameters (3) to (5) are associated with a use of digital signal
convolution calculation. For these parameters, a digital filter can be
configured, for example as shown in FIG. 20, by delay elements,
multipliers, and an adder. In this digital filter, a digital audio signal
is inputted to the digital filter and delayed by the delay elements 150 to
154 by a integer multiple delay time of the sampling time, and thereafter
each delayed output is supplied to each corresponding multiplier 155 to
161. Each multiplier 155 to 161 multiplies the input audio signal data by
a preset coefficient (coefficients l to m) and outputs the result to the
adder 162. The adder 162 adds all the outputs from the multipliers 155 to
161 and calculates a final output of the filter shown in FIG. 20. With the
digital filer of the embodiment shown in FIG. 20, the equalizing and
reverberation characteristics of an audio signal can be adjusted by
changing the coefficients of the multipliers. For example, if a clarity of
speech is to be improved by using the digital filter, the filter
coefficients for the cut-off characteristics of a low frequency range are
set so as to stop signals of low frequency components which are causes to
lower a speech clarity. If reverberation signals are to be added, the
filter coefficients for a low-pass filter are set so as to prolong an
impulse response continuing time.
FIG. 21 shows an embodiment of a circuit for processing direct and indirect
sounds and determining a sound image orientation. This circuit includes
right and left signal processors 190 and 191 for performing a stereophonic
process for a single series of an audio signal, and a signal distributor
170. The right and left signal processors 190 and 191 have the same
structure. Therefore, the structure and operation of only the right signal
processor 190 will be described by way of example. The signal processor
has digital filters 171 to 173, gain controllers 174 to 176, and an adder
177. Use of a plurality of digital filters enables to generate a direct
sound and indirect sounds and to adjust the frequency characteristics of
an output signal and a mixing ratio of a direct sound to indirect sounds.
By stereophonically reproducing right and left channel signals generated
in this manner, it becomes possible to generate an audio signal excellent
in sound image localization. For the control of sound image localization,
digital filter coefficients and a mixing ratio (particularly, gain values)
of a direct sound to indirect sounds are stored in advance in the
database.
Next, an example of using stored audio signal processing data will be
briefly described. Consider for example that a video signal transmitted
from a partner communication site contains a human image and the image of
its mouth is changing. In such a case, the video signal analyzer 11 shown
in FIG. 2 judges that the person in the image is speaking. The audio
signal transmitted with the video signal represents sounds spoken by the
person. Therefore, in order to improve a clarity of the audio signal,
coefficients of the digital filters suitable for suppressing low frequency
components are read from the database and the audio signal is inputted to
the digital filters. The upper limit of a high frequency range of a human
voice is about 7 kHz. Therefore, the filter coefficients are also adjusted
to cut frequencies of 7 kHz or higher to elimin | | |