|
Description  |
|
|
BACKGROUND OF THE INVENTION
This invention relates in general to a word pronunciation voice verifying
system for automatically verifying a plurality of words uttered by a
speaker, and particularly to such a system for extracting the most
recognizable characteristic feature from the voice.
The characteristic feature or parameter which has heretofore been employed
in such a voice verifying system is extracted by a logic circuit based on
the following formula:
##EQU1##
WHERE, WHEN
##EQU2##
In the formula given above, x.sub.i is a characteristic parameter
corresponding to a specified vocal sound x.sub.i ; F.sub.j is the output
of a band pass filter for extracting a desired voice frequency;
.gamma..sub.i is a threshold value; .alpha..sub.j, .beta..sub.j are
weights added to the band pass filter. The component .SIGMA..beta..sub.j
F.sub.j characterizes vocal sounds which are liable to be confused with a
vocal sound x.sub.i and produce an error, and such component is therefore
subtracted from the component .SIGMA..alpha..sub.j F.sub.j characterizing
the vocal sound x.sub.i. If the value thus obtained is larger than a
predetermined threshold value .gamma..sub.i, it is used as a
characteristic parameter which characterizes the vocal sound x.sub.i so
that a high verification accuracy can be obtained. The characteristic
parameter extracted by formulas (1) and (2) can provide a highly stable
verification for a specified individual for whom the weights
.alpha..sub.j, .beta..sub.j and the threshold value .gamma..sub.i are
previously set, but lacks stability when the speaker is replaced by
another person. Thus, the prior art system is not suitable for the
verification of a number of different speakers.
SUMMARY OF THE INVENTION
The object of the present invention is to eliminate the above-mentioned
disadvantages by providing a word pronunciation voice verifying system
having a unit for extracting a highly stable characteristic parameter for
a number of speakers.
The system of the present invention comprises, as shown by the block
diagram of FIG. 1, a unit 1 for standardizing the level of the input voice
uttered by a speaker whose voice is to be verified, a frequency analyzing
unit 2 for analyzing the standardized voice signal by a plurality of
parallel channels each having a different band pass frequency, a sample
and hold unit 3 for detecting the output level of each of the plurality of
channels and holding the maximum value during a sampling time for each of
the channels, a characteristic feature extracting unit 4 for adding
various weights to the output F.sub.j of each of the channels in the
detected frequency bands to extract vocal sound information required for
the verification, a gate unit 5, responsive to a signal from a parameter
discriminating unit 13, for selecting the parameter to be transmitted out
of the time series patterns for each of the input voices extracted by the
characteristic feature extracing unit 4, a memory unit 6 for storing the
characteristic parameter pattern which has passed through the selective
gate unit 5, a binary reference parameter memory unit 12 for storing the
time series pattern of the characteristic parameter previously extracted
from the voice of a reference speaker "A," a first resemblance calculator
unit 11 for calculating the resemblance between the parameters of the
patterns stored in the memory unit 6 and the reference parameter memory
unit 12, respectively, a discriminating unit 13 for selecting a suitable
parameter corresponding to a maximum resemblance based on the result of
said resemblance calculation, sending out a selective signal relative to
the suitable parameter to the selective gate unit 5 and storing said
parameter selective signal therein, a second resemblance calculator unit 7
for calculating, by a pattern matching process, the resemblance between
the unknown input pattern to be verified, stored in the memory unit 6, and
the reference parameter, stored in the memory unit 12, and a
discriminating and output unit 8 for discriminating the vocabulary by
regarding the reference pattern, consisting of reference parameters
corresponding to the maximum resemblance derived from the result of said
resemblance calculation, as an input pattern and sending out the result of
the discrimination as an output.
BRIEF DESCRIPTION OF THE DRAWINGS
In the drawings:
FIG. 1 is a block diagram of a voice verifying system according to a first
embodiment of the present invention,
FIG. 2 is a more detailed block diagram of the voice level standardizing
unit shown in FIG. 1,
FIG. 3 is a block diagram of a single analog filter which may be used in
the frequency analyzer unit 2, shown in FIG. 1,
FIGS. 4, 5 and 6 are block diagrams of the characteristic feature
extracting unit, and
FIG. 7 is a block diagram showing the principal components of another
embodiment of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The verification system shown in FIG. 1 and broadly cataloged above will
now be described in detail with reference to FIGS. 2-6.
The voice level standardizing unit 1 consists of an AGC circuit adapted to
standardize the voice input level of a speaker to be verified and transmit
the standardized voice signal to the frequency analyzer unit 2. It must
have a sufficiently fast response characteristic to follow even abrupt
changes in the level of the input voice wave. FIG. 2 is a block diagram of
one example of such a unit showing the waveform patterns at various
stages. An input voice wave applied to a terminal 21 is fed to a full wave
rectifier 22 and a delay circuit 23. A variable time constant circuit 24
produces an approximate envelope of the output waveform from the rectifier
22. A low pass filter (LPF) 25 eliminates any unevenness still remaining
in the output of the variable time constant circuit 24, thereby producing
a more smoothly shaped envelope. An adder circuit 26 adds a direct current
component to the envelope output so that no zero level signal is used as a
divisor in the divider circuit 27. A delay circuit 23 provides the input
voice wave with a time delay equal to that of the envelope output whereby
the time patterns of the two signals correspond to each other. The delayed
input voice wave is fed to the divider means 27, and a standardized output
is obtained at output terminal 28 in which the peak value of the input
voice wave is confined to the envelope and kept at a nearly constant
level.
The frequency analyzer unit 2 consists of a group of active filters each
having a plurality of channels (13 channels in this embodiment), and is
adapted to spectrum analyze the standardized voice signal. FIG. 3 is a
block diagram showing the arrangement of an analog filter particularly
suitable for the present invention. Reference numeral 31 denotes an input
terminal, 32 an adder, 33, 34 integrators, 35, 36 potentiometers, and 37
an output terminal. Its transmission coefficient G(S) is given by the
following formula:
##EQU3##
where T is the time constant of the integrators, and .alpha. and .beta.
are the potentiometer coefficients.
Such an analog filter is simple in construction as compared with a digital
filter, and spectrum analysis can be easily effected thereby. Needless to
say, however, any other type of filter can be used in the present
invention.
The output level detector 3 samples the output of each of the channels of
the frequency analyzer unit 2 in a proper time interval, for example, a
sampling period of 10 ms, and holds the peak value thereof. With a
sampling period of about 10 ms, the output of the filter can be easily
sampled even in the range of consonant sounds.
The characteristic feature extracting unit 4 is arranged to render the
weights .alpha..sub.j and .beta..sub.j given in formula (1) variable to
suit the individual speaker in order to extract the most suitable
parameter exactly corresponding to a deviation in characteristic
parameters due to individual differences between speakers. Accordingly, p
sets of different characteristic parameters are prepared for one vocal
sound x.sub.i. More specifically, formula (1) can be expanded as follows:
##EQU4##
where x.sub.ik is the k th of the p sets of characteristic parameters
produced for the vocal sound x.sub.i, and can be expressed by the
following formulas:
##EQU5##
wherein,
______________________________________
when x.sub.ik >0 then x.sub.ik = 1, and
(5)
when x.sub.ik .ltoreq.0 then x.sub.ik = 0.
______________________________________
In formulas (3) to (5) above, "i" is the number of vocal sounds "x" to be
verified, "n" is the number of output channels of the output level
detector 3, and "p" is the number of characteristic parameters prepared
for one vocal sound depending on the required accuracy rate of the
verification and the number of speakers to be verified. .alpha..sub.jk and
.beta..sub.jk are weights for the k th characteristic parameter x.sub.ik
selected by the same method as the weights .alpha..sub.j and .beta..sub.j
in order to emphasize the characteristic feature of the vocal sound
x.sub.i, thereby enabling verification to be made easily and accurately.
.gamma..sub.ik is a threshold value of the parameter x.sub.ik. Stated more
specifically, according to the present invention, the assembly of p sets
of characteristic parameters as shown on the right side of formula (3);
that is {x.sub.i0, x.sub.i1, x.sub.i2 . . . x.sub.ik . . . x.sub.ip-1 },
can be obtained by permitting the value of the weights .alpha..sub.j and
.beta..sub.j given in formula (1) to slowly change. x.sub.ik is the k th
parameter as counted from the parameter x.sub.i0 in the above-mentioned
assembly.
In formula (4), expanded in the same manner as formula (1),
##EQU6##
is a component characterizing the vocal sound x.sub.i, while
##EQU7##
is a component which is liable to be confused with
##EQU8##
and cause an erronious verification.
If the input vocabulary is assumed to have characteristic parameters
x.sub.1, x.sub.2, . . . x.sub.5 given by formula (1) corresponding to the
five vowels .vertline.a.vertline., .vertline.i.vertline.,
.vertline.u.vertline., .vertline.e.vertline., .vertline.o.vertline., and
the values of the weights .alpha..sub.j, .beta..sub.j of these parameters
are allowed to slowly change for a speaker "A" who is selected as a
reference, as indicated in Table 1, to provide a number of additional
characteristic parameters, and when p=2 and .gamma.=0.05, it was found
that all of the characteristic parameters additively provided for fifty
optionally selected adult men include parameters suitable for the
respective speakers.
This clearly indicates that the method of the invention is very effective
for a great many speakers. Further, when the speaker is a woman or a
child, parameters suitable for the voice of women and children can be
obtained by somewhat changing the connections between the characteristic
feature extraction unit 4 and the filter output channels. Thus, Table 1 is
shown by way of example only and is not intended to limit the scope of the
present invention.
Table 1
______________________________________
Changes of Weights .alpha..sub.j .beta..sub.j of
Characteristic Parameters
characteristic
changes of changes of
parameter x.sub.i
weight .alpha..sub.jk
weight .beta..sub.jk
______________________________________
x.sub.1 .alpha..sub.1k =.alpha..sub.1 .+-.Kr.alpha..sub.1
.beta..sub.1K =.beta..sub.1 .-+.Kr.beta..sub.1
x.sub.2 .alpha..sub.2k =.alpha..sub.2 .+-.Kr.alpha..sub.2
.beta..sub.2K =.beta..sub.2 .-+.Kr.beta..sub.2
x.sub.3 .alpha..sub.3k =.alpha..sub.3 .+-.Kr.alpha..sub.3
.beta..sub.3K =.beta..sub.3 .-+.Kr.beta..sub.3
x.sub.4 .alpha..sub.4k =.alpha..sub.4 .+-.Kr.alpha..sub.4
.beta..sub.4k =.beta..sub.4 .-+.Kr.beta..sub.4
x.sub.5 .alpha..sub.5k =.alpha..sub.5 .+-.Kr.alpha..sub.5
.beta..sub.5K =.beta..sub.5 .-+.Kr.beta..sub.5
______________________________________
K = 0, 1, 2, ...P - 1
As may be seen from formula (4), the characteristic parameters extracted by
formulas (4) and (5) can be made linear, and a suitable adder circuit can
be easily constituted by analog operational elements and a Schmitt trigger
circuit. Further, as is clear from formula (5), the output of this circuit
is a binary signal, whereby it is extremely convenient for digital
processing.
FIG. 4 is a block diagram showing one embodiment of a threshold value logic
circuit constituting the characteristic feature extracting unit 4. Such a
circuit comprises two adders 41, 42, consisting of analog operational
elements, and a Schmitt trigger circuit 43 fed by the output of adder 42,
whereby the characteristic parameter x.sub.ik is obtained at the output.
Accordingly, i times p sets of similar threshold value logic circuits are
required for the characteristic feature extracting unit 4.
In FIG. 4, when an error occurs in which the parameter x.sub.ik of the
output signal produces not only an output in response to a predetermined
vocal sound X, but also an output in response to another vocal sound Y, or
when it is difficult to produce an output in response to the predetermined
vocal sound X, then the component of the vocal sound Y or the component of
the vocal sound X can be separately extracted by using the additional
adders 44, 45 as shown in FIG. 5, and the extracted component can be
supplied to the adder 41 for .beta..sub.j F.sub.j or the adder 42 for
.alpha..sub.j F.sub.j, thereby correcting or intensifying the output.
Further, as shown in FIG. 6, circuits 41a, 42a, 43a having the same
arrangement as those of FIG. 4 but with different weights can be provided,
and the outputs supplied to an AND circuit 46, thereby extracting only the
parameter for the vocal sound X. More specifically, even if an error is
caused between the vocal sounds .vertline.a.vertline. and
.vertline.i.vertline. in circuits 41, 42, 43, while an error also occurs
between the vocal sounds .vertline.a.vertline. and .vertline.e.vertline.
in circuits 41a, 42a, 43a, only the vocal sound .vertline.a.vertline.
appears at the output of the AND circuit 46, whereby the occurrence of
errors can be completely prevented.
The selective gate unit 5 consists of an AND circuit. The characteristic
parameter x.sub.ik obtained by formula (4) is fed in as one input, and a
selective signal (its details will be mentioned later) relative to the
characteristic parameter is applied as another input. Therefore, only the
characteristic parameters which correspond to the selective signal are
passed through the selective gate unit and transmitted to the memory unit
6 where they are stored.
The memory unit 6 consists of a R A M (Random Access Memory) adapted to
store binary time series patterns which have passed through the selective
gate unit 5.
The reference parameter memory unit 12 consists of a P R O M (Programable
Read Only Memory) adapted to store time series reference parameter
patterns extracted by the characteristic feature extracting unit 4, to
which the voice uttered by a reference speaker "A" is fed as a reference
voice input.
The first resemblance calculator unit 11 is adapted to calculate the
resemblance between the characteristic parameter extracted by the unit 4,
to which the voice uttered by a speaker "a" to be verified is fed as an
input, and the reference parameter, for the purpose of selecting a
characteristic parameter most suitable for speaker "a". Since both
parameters are binary signals, the Hamming distance process can be used to
calculate the resemblance. When the reference speaker "A" utters the same
words as speaker "a", who is to be verified, the extracted characteristic
parameters x.sub.1, x.sub.2, . . . x.sub.i form a binary time series
pattern for the respective vocabulary Y.sub.m. The pattern can be
expressed by the following formula:
Y.sub.m = {y.sub.m (t).vertline.x.sub.1 (t.sub.ym), x.sub.2 (t.sub.ym),
x.sub.i (t.sub.ym)} (6)
where
m = 1, 2, . . . , q
q = the number of words in the input vocabulary or recognition phrase, and
i = the number of vocal sounds included in the vocabulary Y.sub.m.
In formula (6) above, Y.sub.m is the input vocabulary to be verified, for
example, Y.sub.1 ;/one/, Y.sub.2 ;/two/, Y.sub.3 ;/three/, . . . , Y.sub.q
;/multiple/etc., and Y.sub.m is a time series pattern consisting of
characteristic parameters x.sub.1 (t.sub.ym), x.sub.2 (t.sub.ym), . . . ,
x.sub.i (t.sub.ym) corresponding to the vocabulary Y.sub.m, and is also a
time pattern "t".
When speaker "a" utters the reference vocabulary, the characteristic
parameter extracted by unit 4 is given by the right half of formula (3),
and it can be detected by the following formula:
Y.sub.m = {y'.sub.m (t).vertline.x.sub.10 (t.sub.y'm), x.sub.11
(t.sub.y'm), . . . , x.sub.1k (t.sub.y'm), . . . , x.sub.1p-1
(t.sub.y'm),x.sub.20 (t.sub.y'm), x.sub.21 (t.sub.y'm), . . . , x.sub.2k
(t.sub.y'm), . . . , x.sub.2p-1 (t.sub.y'm), . . . , x.sub.i0 (t.sub.y'm),
x.sub.i1 (t.sub.y'm), . . . , x.sub.ik (t.sub.y'm), . . . , x.sub.ip-1
(t.sub.y'm)} (7)
where m = 1, 2, . . . , q
In formula (7), Y.sub.m (t) is a time series pattern corresponding to the
vocabulary Y.sub.m which includes the characteristic parameter suitable
for verifying the words uttered by speaker "a" with minimum error. More
specifically, each of the parameter assemblies expressed by {x.sub.10
(t.sub.y'm), x.sub.11 (t.sub.y'm), . . . , x.sub.1p-1 (t.sub.y'm)}
{x.sub.20 (t.sub.y'm), x.sub.21 (t.sub.y'm), . . . , x.sub.2p-1
(t.sub.y'm) } . . . {x.sub.i0 (t.sub.y'm), x.sub.i1 (t.sub.y'm), . . . ,
x.sub.ip-1 (t.sub.y'm) } should include at least one such suitable
parameter. In order to select these suitable parameters, the resemblance
or degree of correspondence between the reference parameters of speaker
"A" and the characteristic parameter of speaker "a" can be calculated as
follows. First, in order to select a suitable parameter out of the
assembly {x.sub.10 (t.sub.y'm), x.sub.11 (t.sub.y'm), . . . , x.sub.1p-1
(t.sub.y'm)}, it is necessary to find a Hamming distance between x.sub.1k
(t.sub.y'm) and x.sub.1 (t.sub.ym). For example, the digits forming each
parameter can be applied to a NAND circuit, and their sum can be expressed
as a Hamming distance S.sub.1k as follows:
##EQU9##
where .phi.(x.sub.10 (t.sub.y'm), x.sub.1 (t.sub.ym)) is the resemblance
between the parameters x.sub.10 (t.sub.y'm) and x.sub.1 (t.sub.ym) in the
input vocabulary Y.sub.m. Accordingly, S.sub.10 is the sum of the
resemblances in the respective vocabulary. Further, .phi.(x.sub.11
(t.sub.y'm), x.sub.1 (t.sub.ym)) is the resemblance between the parameters
x.sub.11 (t.sub.y'm) and x.sub.1 (t.sub.ym), and S.sub.11 is the sum
thereof. Similarly, S.sub.1p-1 shows the sum of the resemblances between
the parameters x.sub.1p-1 (t.sub.y'm) and x.sub.1 (t.sub.ym).
If the maximum resemblance sum among the total sums S.sub.10, S.sub.11, . .
. , S.sub.1p-1 in formula (8) is assumed to be S.sub.1k, then the
parameter of S.sub.1k is the most suitable one for verifying the voice of
speaker "a" with minimum error. The selected parameter represents the most
recognizable characteristic in the entire input vocabulary, whereby a
highly stable characteristic parameter can be selected for a given input
vocabulary. In a similar manner, regarding {x.sub.20 (t.sub.y'm), x.sub.21
(t.sub.y'm), . . . , x.sub.2p-1 (t.sub.y'm)}, . . . , {x.sub.k0
(t.sub.y'm), x.sub.k1 (t.sub.y'm), . . . , x.sub.kp-1 (t.sub.y'm)}, . . .
, {x.sub.i0 (t.sub.y'm), x.sub.i1 (t.sub.y'm), . . . , x.sub.ip-1
(t.sub.y'm)}, a suitable parameter can be selected among each of the
parameter assemblies by determining the resemblances between them and
x.sub.2 (t.sub.ym), . . . , x.sub.k (t.sub.ym), . . . , x.sub.i
(t.sub.ym).
The suitable parameter discriminating unit 13 is adapted to discriminate
the parameter having the maximum resemblance sum among all of the
resemblances calculated by the first resemblance calculator unit 11 and
store the latter therein. The discriminating unit 13 sends out a code
signal corresponding to the suitable parameter as a parameter selective
signal to the selective gate unit 5.
In discriminating a suitable parameter, it is not always necessary to use
all of the input vocabulary words. For example, if only 10 out of 50 words
are used, a suitable parameter can be obtained.
If the characteristic pattern is recomposed in relation to formula (7)
considering only the suitable parameter, the following formula results:
Y.sub.m = {y".sub.m (t) .vertline.x.sub.1K (t.sub.y'm), x.sub.2K
(t.sub.y'm), . . . , x.sub.iK (t.sub.y'm)} (9)
where m = 1, 2, . . . , q
In formula (9), "q" is the number of words in the input vocabulary, "i" is
the number of vocal sounds included in the m th word Y.sub.m, and x.sub.1K
(t.sub.y'm), x.sub.2K (t.sub.y'm), . . . , x.sub.iK (t.sub.y'm) are
suitable parameters. The pattern obtained with formula (9) can be used as
a parameter selective signal for the speaker "a."
The second resemblance calculator unit 7 calculates the resemblance between
the reference pattern, consisting of reference parameters stored in the
reference parameter memory unit 12, and an unknown pattern extracted from
the voice of speaker "a" to find a Hamming distance between the two
patterns, and transmits the calculated resemblance to a discriminating and
output unit 8. The latter discriminates the reference pattern
corresponding to the maximum resemblance, and sends out a word
corresponding to the latter as an output.
In verifying words or vocabularies by the system of the present invention,
a learning function is first necessary. More specifically, when speaker
"a" initially utters all or part of the words or vocabularies to be
verified, his voice input is applied to the voice level standardizing unit
so that the time series patterns of characteristic parameters expressed by
formula (7) can be obtained by the characteristic feature extracting unit.
At that time the selective gate unit 5 is fully opened so that the time
series patterns are all sent to the memory unit 6, and then transmitted to
the first resemblance calculator unit 11 where a suitable parameter given
by formula (9) is determined. A parameter selective signal corresponding
to the suitable parameter is stored in the suitable parameter
discriminating unit, thereby completing the learning function.
In verifying vocabularies, when speaker "a" subsequently utters a
vocabulary, the time series pattern of the characteristic parameter
expressed by formula (7) is transmitted to the selective gate unit 5 as
before. At that time, a parameter selective signal from the suitable
parameter discriminating unit 13 is applied to the gate unit 5, and
therefore only the gate corresponding to the reference pattern expressed
by formula (9) is opened. Therefore, only the most suitable or
recognizable parameter for speaker "a" passes through the gate unit 5 and
reaches the second resemblance calculator unit 7, where it is compared
with the reference pattern to thereby enable the words or vocabulary
uttered by speaker "a" to be verified.
With the invention constructed as developed above, the system can be
implemented by a simple circuit arrangement, and a high verification
accuracy rate can be obtained by extracting a stable characteristic
parameter for a great many speakers.
In particular, when effecting pattern matching a Hamming distance process
can be applied by coding characteristic parameters into binary signals so
that the processing mechanism for discriminating the vocabulary at the
final stage can be greatly simplified. Further, in obtaining an optimum
pattern match, the time axis and poke operation can be increased and
reduced as with previous pattern processing.
A suitable parameter can usually be selected by having the speaker utter
only part of the complete vocabulary selected at the time of learning, and
therefore the learning time can be considerably reduced. According to
experiments, it has been found that uttering only 10 out of 50 vocabulary
words enables the use of the system.
As is obvious from the foregoing description, the first and second
resemblance calculator units 11, 7 have similar functions, so that, as
shown in FIG. 7, the two units can be replaced by a single resemblance
calculator unit 14 whose output is selectively switched to the
discriminating units 8 and 13. In FIG. 7, the same reference numerals
employed in FIG. 1 indicate the same components.
* * * * *
|
|
|
|
|
Description  |
|