|
Description  |
|
|
BACKGROUND OF THE INVENTION
Our invention relates to voice analysis and, more particularly, to speaker
verification and identification arrangements.
It is often desirable to identify an individual or verify an asserted
identity from voice characteristics. Commercial transactions conducted
over telephone facilities are expedited when a party can be identified
immediately without resorting to documents or prearranged codes.
Similarly, controlled access to secured premises is facilitated by the use
of voice identification techniques. Prior automatic speaker recognition
systems have been based on the comparison of a predetermined spoken
message with a previously stored reference of the same or similar message,
or a comparison of selected speech parameters of particular utterances
with stored parameters of a corresponding utterance. Combinations of pitch
period, intensity, formant and other speech characteristics have been
utilized for speaker recognition.
In one type of system such as disclosed in U.S. Pat. No. 3,466,394 issued
to W. K. French on Sept. 9, 1969, selected peaks and valleys of successive
pitch periods are used to obtain characteristic coordinates of the voiced
input of an unknown speaker. These coordinates are selectively compared to
previously stored reference coordinates. As a result of the comparison, a
decision is made as to the identity of the unknown speaker. This
arrangement as well as others relying on particular speech characteristics
require that the characteristic coordinates be normalized to prevent
errors due to variations in the individual's speech pattern.
Another type of arrangement, such as disclosed in U.S. Pat. No. 3,700,815
issued Oct. 24, 1972 to G. R. Doddington, et al and assigned to the same
assignee, compares the characteristic way an individual utters a test
sentence with a previously stored utterance of the same sentence. The
comparison is restricted to a prescribed sentence and requires that the
two utterances be temporally aligned by time warping so that a valid
comparison may be made.
U.S. Pat. No. 4,032,711 issued on June 28, 1977 to M. R. Sambur and
assigned to the same assignee, discloses an arrangement in which each
utterance is filtered to obtain parameters that are highly indicative of
the individual but are independent of the content of the utterance.
Consequently, it is no longer required to compare utterances of the same
phrase for speaker recognition. The statistical parameters that are
utilized, however, are not useful for recognition of the contents of the
utterance.
U.S. Pat. No. 4,181,821 is issued to Frank C. Pirz and Lawrence R. Rabiner,
Jan. 1, 1980 and assigned to the same assignee discloses a word
recognition system in which speech patterns of many individuals are
clustered to derive a small number of templates for each word. The set of
templates are representative of the general population so that the
utterances from a broad range of any individuals can be recognized. The
linear prediction template parameters utilized for speaker-independent
recognition are adapted to recognize the information in speech patterns
applied thereto. In many applications, it is important to simultaneously
determine both the speaker and the utterance that is spoken. In telephone
credit card transactions, for example, identification of the speaker on
the basis of his voice characteristics assures that the transaction being
recognized by an automatic word analyzer is properly authorized. The
concurrent use of the same speaker independent speech parameters for word
recognition and speaker identification or verification improves the
service rendered and makes the speaker recognition more economical. It is
an object of the invention to provide improved speaker recognition in
combination with spoken word analysis systems.
BRIEF SUMMARY OF THE INVENTION
The invention is directed to a speaker recognition arrangement in which a
plurality of templates representative of utterances of a prescribed
reference word are stored. Jointly responsive to the utterances of each
reference word by an identified speaker and the stored templates for the
reference word, a set of signals characteristic of the identified speaker
are produced. An utterance of an unknown speaker is analyzed and the
utterance is identified as one or more reference words. Signals
characteristic of the unknown speaker are generated responsive to the
unknown speaker's utterance and the stored templates of the identified
reference words. The signals characteristic of the unknown speaker are
compared to the signals characteristic of the identified speakers for the
recognized reference words to select an indentity for the unknown speaker.
BRIEF DESCRIPTION OF THE DRAWING
FIG. 1 depicts a general block diagram of a speaker recognizer illustrative
of the invention;
FIG. 2 depicts a detailed block diagram of a speaker identification circuit
illustrative of the invention;
FIG. 3 shows a detailed block diagram of a minimum detector current useful
in the circuit of FIG. 2;
FIG. 4 shows a more detailed block diagram of the PLA arithmetic circuit of
FIG. 2;
FIG. 5 shows a detailed block diagram of a quantizer circuit useful in the
distance processor of FIG. 2;
FIG. 6 shows a more detailed block diagram of the distance processor of
FIG. 2;
FIG. 7 shows a more detailed block diagram of the memory address circuit of
FIG. 2;
FIG. 8 shows a more detailed block diagram of the controller of FIG. 2;
FIGS. 9-12 show flow diagrams illustrating the speaker identification
process of the invention; and
FIG. 13 shows a more detailed block diagram of the threshold generator
circuit of FIG. 2;
FIG. 14 shows a detailed block diagram of an initial threshold generation
circuit that may be used in the circuit of FIG. 2; and
FIG. 15 shows a flow diagram illustrating the initial threshold generation
process for FIG. 14.
GENERAL DESCRIPTION
FIG. 1 shows a general block diagram of a speaker recognition arrangement
illustrative of the invention. Recognizer 105 is adapted to receive a
speech signal from electroacoustic transducer 101 and to identify the
speech signal as one or more words. Recognizer 105 may comprise the
recognition system disclosed in U.S. Pat. No. 4,181,821 issued to F. C.
Pirz and L. R. Rabiner Jan. 1, 1980 and assigned to the same assignee or
similar arrangement utilizing multiple templates for each reference word.
As described in U.S. Pat. No. 4,181,821, the feature signals of many
utterances of each reference word by a large number of speakers are
clustered into groups. A reference word template is generated for each
group. The multiple templates can then be utilized to recognize the
utterances of speakers from the general population by comparing the group
representative template feature signals to those of any speaker. During
the recognition process, a signal representative of the correspondence
between the features of each group representative template and the speaker
utterance features is generated for every reference word. Clustering
arrangements for word recognition are described in the article "Speaker
Independent Recognition of Isolated Words Using Clustering Techniques" by
L. R. Rabiner, S. E. Levinson, A. E. Rosenberg, and J. G. Wilpon, IEEE
Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-27, No.
4, pp. 236-249, August, 1979.
Each recognition template in recognizer 105 is characteristic of a distinct
group of speakers with similar speech patterns for a word. We have found
that the distribution of the correspondence signals is consistent for
individual speakers and varies characteristically from speaker to speaker.
In accordance with the invention, the same speech correspondence signals
obtained from recognition of the content of the speech pattern are used
concurrently to recognize the speaker. In the recognition arrangement of
U.S. Pat. No. 4,181,821, the acoustic features are linear prediction
parameters and the correspondence signals represent the distances between
vectors generated from the linear production parameters on a frame
sequence basis. The utilization of linear prediction parameters in speech
recognition by distance processing is described in the article "Minimum
Prediction Residual Principle Applied to Speech Recognition" by F.
Itakura, IEEE Transactions on Acoustics, Speech and Signal Processing,
Vol. ASSP-23, pp. 57-72, February, 1975 and the article "Considerations in
Dynamic Time Warping for Discrete Word Recognition" by L. R. Rabiner, A.
E. Rosenberg, and S. E. Levinson, IEEE Transactions On Acoustics, Speech
and Signal Processing, Vol. ASSP- 26, No. 6, pp. 575-582, December, 1978.
It is to be understood, however, that spectral, formant or other speech
parameters may be used.
Recognizer 105 provides a signal I which identifies the word corresponding
to the utterance and a set of signals d.sub.Ij representative of the
distance between the j.sup.th stored template vector (j=1,2, . . . J) for
the word I and the speech feature vector corresponding to the spoken
utterance. The J distance signals are supplied to distance signal
processor 110 which is operative to normalize and quantize the distance
signals. The normalization includes selecting the minimum distance signal
d.sub.Ijmin of the d.sub.I1, d.sub.I2, . . . , d.sub.Ij, . . . , d.sub.IJ
signals and forming a set of normalized signals
d'.sub.Ij =d.sub.Ij -d.sub.Ijmin (1)
The resulting normalized signals are representative of the vector distances
with biases removed. The normalized signals d'.sub.Ij are then quantized
into approximately equally populated groups in accordance with
0,0.ltoreq.d'.sub.Ij <0.1
1,0.1.ltoreq.d'.sub.Ij <0.2
X.sub.Ij =2,0.2.ltoreq.d'.sub.Ij <0.3 (2)
3,0.3.ltoreq.d'.sub.Ij <0.4
4,0.4.ltoreq.d'.sub.Ij
The outputs of distance processor 110, X.sub.I1, X.sub.I2, . . . X.sub.IJ,
are representative of the correspondence between the speaker's utterance
of word I and the J group representative templates for the reference word
I stored in recognizer 105.
Initially recognizer 105 is used in a training mode to generate reference
signals R.sub.Ijk characteristic of the speakers who will use the system.
Each identified speaker 1.ltoreq.k.ltoreq.K utters the reference words
into transducer 101. Th d.sub.Ijk distance signals from the recognizer are
transformed by distance signal processor 110 into reference signals
R.sub.Ijk which reference signals are stored in identified speaker
characteristics store 120. Store 120 then contains a set of signals
R.sub.I1k, R.sub.I2k, . . . R.sub.IJk for each reference word I spoken by
speaker k. R.sub.Ijk signals for additional speakers may be added and the
R.sub.Ijk characteristic for any speaker may be deleted or revised at a
later time.
When the circuit of FIG. 1 is used to identify a speaker, the speaker's
utterance is recognized as a series of words I.sup.1, I.sup.2, . . .
I.sup.m, . . . I.sup.M. For each word I.sup.m, distance processor 110
transforms the d.sbsb.I.sub.m.sbsb.j signals from recognizer 105 into
quantized normalized signals T.sbsb.I.sub.m.sbsb.1, T.sbsb.I.sub.m.sbsb.2,
. . . T.sbsb.I.sub.m.sbsb.J. The output sequence from distance processor
110 is then inserted into input speaker characteristics store 130. The
reference signals for the first speaker (k=1) in identified speaker
characteristics store 130 are then retrieved and sequentially applied to
one input of comparison logic 140. Similarly, the input speaker signals in
store 130 are applied to the other input of comparison logic 140. Logic
circuit 140 is adapted to form the distance signal
##EQU1##
which is a measure of the correspondence between the unknown speaker's
characteristics and the first identified speaker's characteristics based
on the stored templates for word I.sup.m. The overall correspondence
signal
##EQU2##
for the first identified speaker is accumulated in arithmetic circuit 150
and stored in selector circuit 160 along with the speaker identification
signal k=1. The comparison process is then repeated to obtain overall
distance signal D.sub.s2 for identified speaker k=2. Signal D.sub.s2 is
compared to signal D.sub.s1 in selector 160 which stores the smaller
overall distance signal and the speaker identification signal
corresponding thereto. In general, comparator logic 140 forms a distance
signal
##EQU3##
for each speaker. The overall distance signal for speaker k
##EQU4##
is accumulated in circuit 150. The minimum of the D.sub.sk signals for
k=1,2, . . . K as well as the corresponding speaker identification signal
k are stored in selector 160 after the comparison operations for the last
speaker (K) are completed.
The circuit of FIG. 1 may also be modified to verify the identity asserted
by a speaker. For verification, only the asserted identity (k) locations
of identified speaker characteristics store 120 for the recognized word
series I.sup.1, I.sup.2, . . . I.sup.m are addressed after the input
speaker characteristics T.sbsb.I.sub.m.sbsb.1, T.sbsb.I.sub.m.sbsb.2, . .
. T.sbsb.I.sub.m.sbsb.J are inserted into input speaker characteristics
store 130. The overall distance signal D.sub.sk for speaker k is
accumulated in circuit 150. A verification threshold signal is produced in
threshold circuit 170 as is well known in the art. The D.sub.sk signal
from arithmetic circuit 150 is then compared to the verification threshold
signal TH in comparator 180. The verified identity signal is obtained from
comparator 180 only if D.sub.sk .ltoreq.TH.
Speaker recognition threshold principles are described in the articles
"Evaluation of an Automatic Speaker Verification Over Telephone Lines" by
A. E. Rosenberg, Bell System Technical Journal, Vol. 55, pp. 723-744,
July-August 1976 and "Speaker Recognition by Computer" by E. Bunge,
Phillips Technical Review, Vol. 37, No. 8, pp. 207-219, 1977.
DETAILED DESCRIPTION
FIG. 2 shows a detailed block diagram of a speaker recognizer illustrative
of the invention. Word recognizer 205 includes utterance analyzer and
utterance feature signal store 208, reference word template store 206 and
feature signal processor 209. Template store 206 includes J templates for
each reference word in the recognition vocabulary. Each template is
representative of linear prediction acoustic features of utterances of the
reference word by a distinct group of speakers. The template is obtained
by clustering a large number of utterance feature signals from a general
population. The clustering provides a small number of templates that may
be used in speaker independent recognition. For purposes of illustration,
it is assumed that the reference word set is limited to the digits 0
through 9 and that 12 distinct group templates j=1,2, . . . , 12 are
stored for each digit.
Utterance analyzer 208 receives a speech signal from microphone 201 that
corresponds to a sequence of M a (e.g. 9) digits. The analyzer converts
the speech signal into linear prediction acoustic features which are
stored therein. Feature signal recognizer 209 is adapted to compare the
feature signals of each successive word from analyzer 208 to the templates
from template store 206. For each reference word, the utterance features
are compared with the 1.ltoreq.j.ltoreq.12 templates. After all templates
of every reference word are compared to the utterance, feature signal
recognizer provides a digit identification signal I. I corresponds to the
reference word having one or more group templates that closely match the
word feature signals of the input speaker.
When each reference word template set is processed, a set of distance
signals d.sub.i1, d.sub.i2, . . . d.sub.ij, . . . d.sub.i,12 are
generated. Signal d.sub.ij is representative of the overall correspondence
of the input word feature signals from analyzer 208 to the feature signals
of template j for reference word i. Signal d.sub.ij is the distance
between the vector of input word feature signals and the vector of the
j.sup.th template feature signals for word i as is well known in the art.
The recognized word identification signal I obtained from feature signal
recognizer 209 is placed in latch 212. The sequence of distance signals
d.sub.I1,d.sub.I2, . . . d.sub.Ij, . . . d.sub.I,12 for the recognized
digit I are sequentially supplied to distance signal processor 210 shown
in greater detail in FIG. 6. Processor 210 is operative to transform the
recognized word distance signals into a set of quantized normalized
signals X.sub.I1,X.sub.I2, . . . X.sub.IJ, . . . X.sub.I,12. Each signal
X.sub.IJ represents the correspondence between the utterance of the input
speaker to a distinct group template.
The speaker recognition circuit of FIG. 2 is operative in both a training
mode and an identification mode. During the training mode, the distance
signal processor receives the distance signals of several utterances of an
indentified word I by an identified speaker and provides a set of
quantized normalized signals X.sub.IJ representative of the average
correspondence of the identified speaker's feature signals to the template
feature signals of word I. An acceptance threshold signal TH.sub.Ik is
also developed which is indicative of the acceptable variations of the
quantized normalized correspondence signals for word I spoken by
identified speaker k. In the identification mode, an unknown speaker's
distance signals for the identified word are normalized and quantized to
provide correspondence signals representative of his speech. The unknown
speaker's correspondence signals are stored and compared to the
correspondence signals of the identified speaker.
In order to provide a set of reference correspondence signals for
comparison with speakers to be recognized, the circuit of FIG. 2 is set to
its training mode in which each speaker repeats each reference word n,
e.g. five, times. The train mode is initiated by the generation of signal
TR in controller 290.
Each of controllers 803, 805, and 807 is a microcomputer such as described
in the article "Let a Bipolar Processor Do Your Control and Take Advantage
of Its High Speed" by Steven Y. Lau appearing in Electronic Design, 4,
Feb. 15, 1979 on pages 128-139. As is well known in the art, a controller
of this type produces a sequence of selected output signals responsive to
the states of the input signals applied thereto. Each control circuit
incorporates a read only memory containing a permanently stored
instruction set adapted to provide the control signal sequence therefrom.
The instructions for the controllers are shown in FORTRAN language in
Appendix A.
Referring to FIG. 8 which shows the controller in greater detail, input
device 801 provides signal TR responsive to a manual command. Device 801
may comprise a keyboard encoder or other arrangement. When the circuit of
FIG. 2 is placed in the train mode, signal TR identifying the mode, signal
k.sub.M identifying the speaker are produced. FIG. 9 shows a flow diagram
illustrating the training mode process. The TR signal initiates the
operation of train controller 803 which first produces signals GR, JSO,
and NSO. Signal GR presets the shift registers and latches of FIG. 2 to
their initial states as per box 901 in FIG. 9. Signal JSO resets counters
715 and 730 in FIG. 7 to their zero states as indicated in index setting
box 905 and signal NSO resets counter 501 in the quantizer circuit of FIG.
5 to its zero state (index setting box 910). Speaker identification signal
k.sub.r is set to k.sub.M by input device 801 (index box 915). Signal RW
is then produced by controller 803 to enable recognizer 205 in FIG. 2 as
per operation box 920.
As a result of the operation of word recognizer 205, five sets of distance
signals are sequentially supplied to distance signal processor 210 shown
in detail in FIG. 6 and the identified word I is placed in latch 212. Upon
completion of the recognition operation, recognizer 205 sends signal RE to
controller 803. Distance processor 210 is then adapted to normalize and
quantize the distance signals in accordance with Equations 1 and 2 by
controller 803. Referring to FIG. 6, and d.sub.ij distance signals are
supplied to one input of Adder 603 and the input of minimum detector 601.
Minimum detector 601 shown in detail in FIG. 3 is operative to select the
minimum distance signal of each set (d.sub.Ijmin) which minimum is applied
to Adder 607. Latch 609 is initially cleared to zero by control signal GR
and the combination of Adder 607 and latch 609 functions as an accumulator
which forms the signal
##EQU5##
representative of the sum of the five minimum distance signals responsive
to the succession of shift pulses IJ1 from controller 803 (operation box
925).
Referring to FIG. 3, latch 303 is preset to the largest possible code LPN
by control signal GR prior to the minimum detector operation. The input
signal is applied to the B input of comparator 302 via line 305. The
output of latch 303 is supplied to the A input of comparator 302. The B<A
output of the comparator is enabled only if the B input signal is smaller
in value than the A input signal from latch 303. AND-gate 301 provides an
enabling output on line 307 when the B<A output of comparator 302 is
enabled concurrently with each successive control signal IJ1 on line 308.
Responsive to the enabled output of gate 301, the signal on line 305 is
inserted into latch 303. After a sequence of input signals to comparator
302, the minimum valued input signal is stored in latch 303.
Register 605 comprises 12 stages, one for each successive distance signal
of a set. The shift register is initially cleared to zero by signal GR.
Adder 603 and shift register 605 function as an accumulator for each of
the 12 distance signals of the sets. Responsive to the first set of
distance signals shift register 605 contains the succession of signals
d.sub.I11, d.sub.I12, . . . , d.sub.I1,12. Each successive distance signal
set is then added to the sums for the previous sets in register 605. After
the fifth set is applied to Adder 603, shift register 605 contains the set
of signals
##EQU6##
The summing operation is indicated in operation box 925.
When each distance signal set is processed, counter 501 is incremented by
control signal IN1 as per index box 930. Subsequent to the formation of
the 12 summed signals of Equation 7 in decision box 935, signals HN, HI,
and HA are obtained from controller 803 and the d.sub.Ij distance signals
from word recognizer 205 are supplied to threshold signal generator 215
shown in detail in FIG. 13 (operation box 940). After the threshold signal
TH.sub.Ik is formed, signal JSO resets counter 715 (operation box 942) and
the threshold signal is inserted into store 220 (operation box 945). The
threshold signal generator develops a threshold signal TH.sub.Ik
representative of the range of distance signals for valid identifications.
The threshold range signal is a function of the statistical distribution
of the distance signals from recognizer 205 or may be precalculated and
stored in initial threshold store 1310.
The summed minimum signal
##EQU7##
in latch 609 is subtracted from each successive output of shift register
605 in subtractor 611. The output of subtractor 611 is proportional to the
normalized distance signals d'.sub.ij of Equation 1. The 12 successive
outputs of subtractor 611 are then applied to the input of quantizer 615
to form the X.sub.Ij signals as indicated in the loop including operation
box 953 and 955, index box 960 and decision box 965. Quantizer 615 is
shown in detail in FIG. 5.
Referring to FIG. 5 each normalized summed signal is supplied to the inputs
of comparator 507, 517, 527, 537, and 547. Counter 501 was incremented for
each set of distance signals by signal NS1 (index box 935) and its output
is five corresponding to the five repetitions of the reference word I. The
"five" signal is transferred to the inputs of multiplier 505, 515, 525,
and 535. The outputs of multipliers 505, 515, 525, and 535 are 2.0, 1.5,
1.0, and 0.5 respectively. As a result of the operation of comparators
507, 517, 527, 537, and 547, a five bit coded signal X.sub.ij is obtained
from the outputs of OR-gates 509, 519, 529, 539, and 549 (operation box
953). In this way each signal
##EQU8##
is classified into one of five groups. If signal
##EQU9##
is greater than or equal to 2.0, the greater than or equal output of
comparator 507 is enabled and X.sub.Ij =10000. For the signal on line 560
greater than 1.5 and less than 2.0, the less than output of comparator 507
and the greater than output of comparator 517 are enabled whereby X.sub.Ij
is set at 01000. The same X.sub.Ij code is obtained if the signal on line
560 equals 1.5 since the equal output of comparator 517 is enabled.
Similarly, X.sub.Ij is 00100 if the signal on line 560 is equal to or
greater than 1.0 but less than 1.5. X.sub.Ij is 00010 for the signal on
line 560 equal to or greater than 0.5 but less than 1.5. X.sub.Ij is 00001
when the signal on line 560 is equal to or greater than 0.0 but less than
0.5. The sequence of signals X.sub.I1, X.sub.12, . . . X.sub.Ij . . .
X.sub.I,12 from the quantizer of FIG. 5 represent the correspondence
between the k.sub.M identified speaker's utterance and the 12 templates
for the identified word I stored in reference word template store 206.
Identified speaker correspondence signal store 220 is adapted to store the
X.sub.Ij and TH.sub.Ik outputs of distance signal processor 210 and
threshold signal generator 215 for every identified word and every
speaker. Memory address generator 280 shown in detail in FIG. 7 supplied
the address signals needed to store the X.sub.Ij correspondence signals
and the TH.sub.Ik threshold signal obtained from the utterance of each
identified word by a speaker k.
Referring to FIG. 7, store 220 is addressed by the k.sub.r output of
selector 705, the I.sub.r output of selector 710 and the j.sub.r output of
counter 715. In the training mode, a path is established between the
k.sub.m input and the output of selector 705 responsive to signal TR.
Signal TR causes the I input of selector 710 to be connected to its
I.sub.r output. Thus, signal I corresponding to the identified word is
supplied to one address input of store 220. Signal k.sub.M corresponding
to the speaker identity is supplied to another address input of store 220.
When counter 715 is in its zero state after being cleared by control signal
JSO (operation box 942), the TH.sub.Ik signal from threshold signal
generator 215 is inserted into the I,k.sub.M, j.sub.r =0 location of store
220 by write pulse WTR from training control 803 (operation box 945).
Counter 715 is successively incremented by signal IJ1 (operation box 950).
The X.sub.Ij outputs of distance signal processor 210 are then inserted
into store 220 by signals IJ1 from controller 803. The distance signals
are thereby successively inserted into the I,k.sub.M locations of store
220 by the write pulses WTR. The insertion of X.sub.Ij pulses follows the
loop including operation boxes 953 and 955, index box 960, and decision
box 965.
Upon termination of the storage of correspondence signals for identified
word I of speaker k.sub.M, the circuit of FIG. 2 is reset to its initial
state by signal ETR from controller 803 so that correspondence signals for
additional words can be obtained from the same speaker or from other
speakers of the identified speaker set. The training is completed when
store 220 contains a set of correspondence signals and a threshold signal
for every reference word spoken by each identified speaker.
The recognizer of FIG. 2 may be switched into its identification mode after
a sufficient number of identified speaker correspondence and threshold
signals have been placed in store 220. In the identification mode, an
unknown speaker utters a sequence of reference words such as a personal
identification number.
The identification mode is started by the generation of signal ID in input
device 801 of FIG. 8. When the circuit of FIG. 2 is placed in the
identification mode, signal ID initiates the operation of identification
signal storage controller 805. Controller 805 first produces signals GR,
MS1, JSO, KS1, and NS1. Signal GR presets the registers and latches of
FIG. 2 to their initial states as per box 1001 in the flow diagram of FIG.
10. Control signal MS1 sets counter 720 to its m.sub.r =1 state. Control
signal JSO sets counters 715 and 730 to their zero states. Control signal
KS1 sets counter 701 to its k.sub.r =1 state and control signal NS1 sets
counter 501 to its n=1 state. Signal RW is then applied to recognizer 205
to initiate the recognition of the utterance of the unknown speaker.
Responsive to the speech signal of the unknown speaker from microphone
201, utterance analyzer 208 generates and stores the feature signals for
the successive digits. Each successive digit is recognized in feature
signal recognizer 209 which recognizer provides recognized word
identification signals I.sup.1,I.sup.2, . . . I.sup.M and a set of
distance signals d.sub.Ij representative of the distance between the
reference word templates for recognized words I and the feature signals of
the unknown speaker as per operation box 1004. The single set of distance
signals d.sub.Ij for each reference word is supplied to the inputs of
distance signal processor 210 and threshold signal generator 215.
In the distance signal processor, minimum detector 601 is operative to
determine the d.sub.Ijmin code from the single set of 12 distance signals
as described with respect to the training mode (operation box 1005). The
d.sub.Ijmin code is placed in latch 609. Shift register 605 is initially
cleared by control signal GR and the succession of distance signals
d.sub.I1,d.sub.I2, . . . d.sub.Ij, . . . d.sub.I,12 is transferred into
the shift register via Adder 603. As each successive d.sub.ij signal
appears at the output of the shift register, subtractor 611 is operative
to form the difference signal of Equation 1 (operation box 1010).
The normalized distance signals d'.sub.Ij for the unknown speaker from
subtractor 611 are successively supplied to quantizer 615 in which the
X.sub.Ij correspondence signals are formed (operation box 1022). Referring
to FIG. 5, counter 501 was placed in its first state responsive to control
signal NS1. Consequently, the outputs of multipliers 505, 515, 525, and
535 are 0.4, 0.3, 0.2 and 0.1, respectively. Comparators 507, 517, 527,
537, and 547 are operative to form an X.sub.Ij code for each normalized
distance signal applied thereto. As aforementioned with respect to the
training mode, each successive normalized distance signal is assigned to a
group for which there is a unique quantized code X.sub.Ij.
The X.sub.Ij correspondence signals from distance processor 210 relating
the unknown speaker's features to the reference templates for the
identified word I are supplied to input speaker correspondence store 230
together with the word identification signal I.
Store 230 is addressed by signals m.su | | |