WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Multiple parameter speaker recognition system and methods    
United States Patent4837830   
Link to this pagehttp://www.wikipatents.com/4837830.html
Inventor(s)Wrench, Jr.; Edwin H. (San Diego, CA); Wohlford; Robert (Germantown, MD); Naylor; Joe (San Diego, CA)
AbstractAn apparatus operates to identify the speech signal of an unknown speaker as one of a finite number of speakers. Each speaker is modeled and recognized with any example of their speech. The input to the system is analog speech and the output is a list of scores that measure how similar the input speaker is to each of the speakers whose models are stored in the system. The system includes front end processing means which is responsive to the speech signal to provide digitized samples of the speech signal at an output which are stored in a memory. The stored digitized samples are then retrieved and divided into frames. The frames are processed to provide a series of speech parameters indicative of the nature of the speech content in each of the frames. The processor for producing the speech parameters is coupled to either a speaker modeling means, whereby a model for each speaker is provided and consequently stored, or a speaker recognition mode, whereby the speech parameters are again processed with current parameters and compared with the stored parameters during each speech frame. The comparison is accomplished over a predetermined number of frames whereby a favorable comparison is indicative of a known speaker for which a model is stored.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 4837830
Multiple parameter speaker recognition system and methods - US Patent 4837830 Drawing
Multiple parameter speaker recognition system and methods
Inventor     Wrench, Jr.; Edwin H. (San Diego, CA); Wohlford; Robert (Germantown, MD); Naylor; Joe (San Diego, CA)
Owner/Assignee     ITT Defense Communications, A Division of ITT Corporation (Nutley, NJ)
Patent assignment
All assignments
Publication Date     June 6, 1989
Application Number     07/003,971
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     January 16, 1987
US Classification    
Int'l Classification    
Examiner     Shoop Jr.; William M.
Assistant Examiner     Young; Brian
Attorney/Law Firm     Walsh; Robert A. Twomey; Thomas N. Werner; Mary C.
Address
Parent Case    
Priority Data    
USPTO Field of Search    
Patent Tags     multiple parameter speaker recognition methods
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4718093
Brown
704/243
Jan,1988

[0 after 0 votes]
4624008
Vensko
704/253
Nov,1986

[0 after 0 votes]
4405838
Nitta
704/254
Sep,1983

[0 after 0 votes]
4092493
Rabiner
704/237
May,1978

[0 after 0 votes]
4032711
Sambur
704/246
Jun,1977

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. Speaker recognition apparatus for identifying a speaker by identifying the speech signal of an unknown speaker as one of a finite number of speakers comprising:

front end processing means responsive to said speech signal to provide digitized samples of said speech at an output, said front end processing means including lowpass filter means responsive to said speech signal to limit the band width thereof to about 3 KHz at an output of said filter means,

storage means coupled to said processing means and having a first plurality of storage locations for storing said digitized samples,

means included in said front end processing means and coupled to said storage means and responsive to said stored digitized samples for dividing said samples into frames, each frame containing a given number of samples,

signal processing means included in said front end processing means and coupled to said logic means and responsive to said samples in said frame to provide at an output a series of speech parameters indicative of the nature of said speech content in each of said frames and including means for determining which of said frames contain speech by providing a smoothed histogram of the input energy in each of said frame to determine which of said frames contain speech according to said input energy,

speaker modeling means coupled to said output of said signal processing means in a first selectable mode and operative to provide a model of speech characteristics for said speaker in said first mode, said signal modeling means including processor means responsive to said speech parameters within each speech frame to provide a covariance matrix indicative of said speech parameters and coupled to said storage means to store at a second plurality of locations said matrix to employ said matrix as a model during a second selectable mode of operation,

speaker recognition means coupled to the output of said signal processing means in a second selectable mode operative to identify the speaker from the model which has been stored in said first mode and responsive to said parameters including comparison means for comparing the average current parameter with said stored speaker models during said speech frames as provided by said front end processing means, over a predetermined number of frames whereby a favorable comparison is indicative of a known speaker for which a model is stored, and

means coupled to said output of said signal processing means for selecting either said first or second modes.

2. The speaker recognition system according to claim 1, further including analog-to-digital converter means having an input coupled to said low pass filter for providing at an output said digitized samples.

3. The speaker recognition system according to claim 2, wherein said samples are digitized at 8K samples per second with 16 bits per sample.

4. The speaker recognition system according to claim 3, wherein said storage means is a disk storage for storing said digitized samples.

5. The speaker recognition system according to claim 3, wherein said logic means includes means for reading said stored samples from said disk to provide a frame for a given number of stored samples and including Hamming window means providing a given number of samples for each frame.

6. The speaker recognition apparatus according to claim 1, wherein said signal processing means includes auto-correlation means responsive to said samples in said frames to provide a multi-point FAST FOURIER TRANSFORM (FFT) for each frame, including means for multiplying said FFT with a given transfer function to provide a power spectrum at the output and means responsive to said power spectrum to provide an inverse FFT, indicative of auto-correlation coefficients, a linear predictive code analyzer(LPC) means responsive to said auto-correlation coefficients for providing a first given number of said speech parameters indicative of reflection coefficients and a second given number of said speech parameters indicative of cepstral coefficients.

7. The speaker recognition apparatus according to claim 6, wherein said linear predictive code analyzer includes means for implementing an algorithm to provide ten reflection coefficients with means for recursively deriving each cepstral coefficients from said derived reflection coefficients.

8. Speaker recognition apparatus according to claim 1, wherein said comparison means includes means for calculating the Mahalanobis distance from said parameters and said stored parameters and to output a given number of low distances indicative of a speaker model as stored.

9. Speaker recognition apparatus according to claim 1, further including digital-to-analog converter means coupled to said front end processing means and operative to convert a digital speech signal to an analog speech signal for application as a speech signal to said processing means to enable processing of the same in either said first or second modes.

10. Speaker recognition apparatus according to claim 1, wherein said storage means, said logic means, and said signal processing means are coupled via a main processor bus.

11. A method of providing a model of the speech signal of a user to enable said model to be used subsequently to identify said speaker via said speaker's speech signal, comprising steps of:

digitizing said speech signal to provide at an output a plurality of digitized samples of said signal,

storing said digitized samples,

selecting a series of frames of said samples as stored,

computing auto-correlation coefficient for said samples in each of said frames, including

providing a multi point fast fourier transform (FFT) from said samples in each of said frames,

multiplying said FFT by a subband filter spectrum,

calculating a power spectrum from said multiplied subband filter spectrum,

providing an inverse FFT from said calculated power spectrum,

deriving linear predictive code reflection coefficients from said auto-correlation samples,

recursively deriving cepstral coefficients from said reflection coefficients,

calculating a covariance matrix from said reflection and cepstral coefficients, and

storing said matrix as a model of said speaker.

12. The method according to claim 11, wherein the step of digitizing said speech includes the steps of:

first passing said speech through a lowpass filter,

then applying said passed speech to an analog-to-digital converter to obtain digitized samples.

13. The method according to claim 11, wherein the step of storing said digitized samples includes storing said samples on a disk memory.

14. The method according to claim 11, wherein the step of deriving said linear predictive code reflection coefficients includes providing ten coefficients using an algorithm for linear predictive coding.

15. The method according to claim 11, further including the step of:

detecting the energy content of each of said frames as stored to determine speech frames by providing a frame energy histogram for each frame.

16. The method according to claim 11, further including the step of:

calculating the Mahalanobis distance between said speech parameters and each of said matrixes as stored to determine the identity of a speaker from said distance and according to said model as stored,

providing an output when said calculated distance is a lowest value for one of said matrixes as stored.

17. The method according to claim 16, further including the step of:

converting said distance to a speaker confidence level calculated according to said distance and having a value greater than 0.7.

18. The method according to claim 11, wherein said multi-point FFT is a 512 point zero filled FFT as calculated for each frame.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

This invention relates to a speaker recognition system and more particularly to a system which is capable of identifying an unknown talker or speaker as being one of a finite number of speakers.

As one will understand, the art of speech recognition in general has been vastly developed within the last few years and speech recognition systems have been employed in many forms. The concept of recognizing speech recognizes that the information obtained in the spoken sound can be utilized directly to activate a computer or other means.

Essentially, the prior art understood that a key element in recognizing information in a spoken sound is the distribution of the energy with frequency. The format frequencies which are those at which the energy peaks are particularly important. The format frequencies are the acoustic resonances of the mouth cavity and are controlled by the tongue, jaw and lips. For a human listener the termination of the first two or three format frequencies is usually enough to characterize the sound. In this manner machine recognizers of the prior art included some means of determining the amplitude spectrum of the incoming speech signal. This first step in speech recognition is referred to as preprocessing as it transforms the speech signal into features or parameters that are recognizable and reduces the data flow to manageable proportions.

In regard to such, one means of accomplishing this is the measurement of the zero crossing rate of the signal in several broad frequency bands to give an estimate of the format frequencies in these bands. Another means is representing the speech signal in terms of the parameters of the filter whose spectrum best fits that of the input speech signal. This technique is known as linear predictive coding (LPC). Linear predictive coding or LPC has gained popularity because of its efficiency, accuracy and simplicity. The recognition features extracted from speech are typically averaged over 10 to 20 milliseconds then sampled 50 to 100 times per second.

At this point, the data which is digitized and the ensuing recognition steps are performed by a programmable digital processor. In any event, there are many problems associated with the concept of recognizing speech in regard to the information content. In any event, as one can ascertain, the general problem of speech recognition has been described in many articles and patents. Apart from the problem of recognizing speech in general, another major concern is to recognize or verify a speaker. Speaker recognition is a generic term which refers to a system which discriminates between speakers according to their voice characteristics. Speaker recognition can involve speaker identification or speaker verification. Speaker identification is a system which can classify an unlabeled voice as belonging to one of a set of N reference speakers. Speaker verification implies the determination that an unlabeled voice belongs to a specific reference speaker. For a description of both speaker recognition systems and speech recognition system reference is made to the November, 1985 issue of the Proceedings of the I.E.E.E., Volume 73, No. 11, pages 1537-1696. In particular an article entitled "Speaker Recognition-Indentifying People By Their Voices", by G. R. Doddington. See also Linear Prediction of Speech, Spring-Verlag (1976) by J. D. Markal and A. H. Gray for additional background. In this respect a system which can identify unknown speakers in real time using a small sample of their speech has great applicability.

Essentially, the applicability or usefulness of such a system should be apparent in regard to military systems whereby only authorized or identified speakers would be allowed to communicate with certain other authorized or identified individuals. In such a system an operator will be able to specify those speakers who are of interest at a particular time. Such a system could then route to the operator only speech that it identifies as spoken by specified talkers.

Such systems may also be used in security applications as recognizing certain individual's voices to gain access to premises, identification and so on. Essentially, as one can ascertain, any such system prior to executing a recognition task will have to obtain samples of the speech from each of the talkers that may later be recognized.

A major aspect or specification for any such system is that it shall correctly identify speakers whose training data has been preprocessed and using a small percentage of time in order to accomplish such recognition. Thus in regard to any such system it is immediately ascertained that there is application for speaker recognition in many different systems that attempt to identify the users of the system by their voices. In certain applications a system which can identify particular speakers would identify current speakers which are using a communications channel and therefore selectively route speech from selected authorized talkers to the user.

In this manner the system will serve to automatically identify and recognize individual speakers and to therefore under certain considerations either indicate that the speaker is authorized to use a certain communication channel or that the speaker is one whose presence in a conference or conversation is authorized. Hence as one can ascertain, there are many uses for speaker recognition systems which presently exist. As one can also ascertain, the problems of individual speaker recognition is a substantial problem and while there have been many attempts to achieve such in the prior art, none of these attempts have been successful in that such systems have been extremely complicated and are associated with low accuracy.

It is therefore an object of the present invention to provide an improved multiple parameter speaker recognition system which system exhibits a high accuracy and which system is capable of identifying any one of a plurality of finite authorized speakers to thereby afford speaker recognition to authorized system users.

A further object of this invention is to provide apparatus and methods used to identify an unknown talker as one of a finite number of speakers. The apparatus and methods allow the speaker to be modeled and recognized with any examples of their speech as the speakers do not have to repeat a particular phrase in order to achieve recognition.

Hence a further object of the present invention is to therefore provide a text independent speaker recognition system.

BRIEF DESCRIPTION OF THE PREFERRED EMBODIMENT

Speaker recognition apparatus for identifying the speech signal of an unknown speaker as one of a finite number of speakers to thereafter enable the identification of said speaker comprising front end processing means responsive to said speech signal to provide digitized samples of said speech at an output, storage means coupled to said processing means and having a first plurality of storage locations for storing said digitized samples, logic means included in said front end processing means and coupled to said storage means and responsive to said stored digitized samples to divide said samples into frames each frame containing a given number of samples, signal processing means included in said front end processing means and coupled to said logic means and responsive to said samples in said frames to provide at an output a series of speech parameters indicative of the nature of said speech content in each of said frames and including means for determining which of said frames contain speech by providing a smoothed histogram of the input energy in each of said frames to determine which of said frames contain speech according to said input energy, speaker modeling means coupled to said output of said signal processing means in a first selectable mode and operative to provide a model for said speaker in said first mode, said signal modeling means including processor means responsive to said speech parameters within each speech frame to provide a covariance matrix indicative of said speech parameters and coupled to said storage means to store at a second plurality of locations said matrix to employ said matrix as a model during a second selectable mode of operation, speaker recognition means coupled to the output of said signal processing means in a second selectable mode operative to identify a speaker whose model has been stored in said first mode and responsive to said parameters including comparison means for comparing the average current parameter with said stored speaker models during said speech frames as provided by said front end processing means, over a predetermined number of frames whereby a favorable comparison is indicative of a known speaker for which a model is stored and, means coupled to said output of said signal processing means for selecting either said first or second modes.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a simple block diagram depicting a multiple parameter speaker recognition system according to this invention.

FIG. 2 is a simplified block diagram showing a front end processing circuit according to this invention.

FIG. 3 is a block diagram depicting an auto-correlation module employed in FIG. 2.

FIG. 4 is a diagram depicting the determination of a speech energy threshold from a smoothed frame energy histogram.

FIG. 5 is a detailed block diagram showing the speaker authentication system according to this invention.

FIG. 6 is a block diagram in flow form depicting the storing of digitized speech on a disk.

FIG. 7 is a block diagram in flow chart form depicting initialization of an analog-to-digital converter by a I/O controller.

FIG. 8 is a block diagram showing the initialization of an analog-to-digital converter clock by the I/O controller.

FIG. 9 is a flow chart showing the transfer of analog-to-digital data to a disk by the I/O controller.

FIG. 10 is a flow chart depicting the playback of digitized speech/from a disk.

FIG. 11 is a flow chart depicting the initialization of a digital-to-analog converter by the I/O controller.

FIG. 12 a flow chart depicting the transfer of disk data to the digital-to-analog converters by the I/O controller.

FIG. 13 is a flow chart depicting the processes required for recognition from live speech.

FIG. 14 is a flow chart depicting the I/O processor control for speaker recognition.

FIG. 15 is a flow chart depicting signal processor operation during speaker recognition.

FIG. 16 is a flow chart in block diagram form showing speaker recognition from external digital data.

FIG. 17 is a flow chart depicting model generation in order to enable the system to implement speaker recognition.

FIG. 18 is a flow chart depicting signal processor operation during model generation.

FIG. 19 is a flow chart showing the storing of speech data on a disk from an external source.

FIG. 20 is a block diagram of the digital-to-analog converter apparatus employed in this invention.

FIG. 21 is a block diagram of the analog-to-digital converter employed in this invention.

FIG. 22 is a block diagram depicting an analog conditioning board according to this invention.

DETAILED DESCRIPTION OF THE FIGURES

Referring to FIG. 1, there is shown the three main components which are necessary in implementing a speaker recognition system according to this invention.

As one can ascertain from FIG. 1, analog speech is directed to a front end processing circuit 10 whereby the speech, as will be explained, is processed according to particular algorithms which serve to determine or recognize speech. As seen schematically in FIG. 1, the output of the front end processing unit 10 is coupled to a switch 11. The switch 11 is capable of being positioned in a first position or mode designated as MODEL or switched to a second position or mode designated as a RECOGNIZE. As one will ascertain in the MODEL position, processed output speech from processor 10 is directed to a speaker modeling system 12 whereby the system 12 functions to provide various characteristics or a model associated with a particular speaker and to store the model in memory for further utilization by the system. The system also contains a speaker recognition module 14 whereby when the output from the front end processing unit 10 is coupled to the recognize input, the system operates to determine a speaker's identity.

The first step to be performed by the front end processing circuit 10 employed in the speaker recognition system is to digitize the input analog speech and to produce frames of speech parameters. Essentially, this function is performed by the front end processing unit 10 of FIG. 1.

Referring to FIG. 2, there is shown a more detailed block diagram of a typical front end processing unit which as one will ascertain will be defined in greater detail in the specification. Essentially, analog speech is applied to the input of a lowpass filter 15 having an upper frequency cutoff of 3.3 KHz. The analog speech which is lowpass filtered is then digitized at 8,000 samples per second each of which consists of 6 bits per sample. Analog-to-digital converters which can operate to do so are known in the art and operate to convert the analog speech into digital samples. Once digitized, the digital data is then stored on a disk storage 16 for use by either the speaker modeling unit 12 or the speaker recognition unit 14, as will be explained. For either model generation or recognition previously digitized speech samples are read from the disk 16 and processed to produce frames of speech parameters.

In order to do this, the speech samples from the disk are divided into predetermined frames with a new frame starting every 100 samples. Each frame consists of 200 samples and is subjected to a hamming window operation as evidenced by module 17. As one will ascertain, the hamming window approach is a well known technique which is utilized in speech recognition system in general and as can be evidenced from the above-noted references in the I.E.E.E. publication.

Essentially, the function of the hamming window is to take frames of speech and to provide smooth transitions. Since speech samples in a frame are indicative of a short interval, the Hamming window serves to multiply the speech data to achieve smooth rise and fall times. This is typically done by the use of a sine wave or other smooth transition waveform to enable one to obtain a smooth transition at the start and the end of a given length speech sample. The hamming window technique, as will be further explained, is utilized in conjunction with a fast Fourier transform technique (FFT) as well as utilized in conjunction with a linear predictive coding (LPC) algorithm all of which are well known to those skilled in the state of the art.

Thus a speech sample which is subjected to the hamming window process is then applied to an auto-correlation module 18.

Referring to FIG. 3, there is shown a more detailed function of the auto-correlation module 18. In any event, the output from the auto-correlation module operates to derive speech parameters for each input frame. Essentially, the output from the auto-correlation circuit 18 is directed to an LPC analyzer module 20. The function of the LPC analyzer is to preform a linear predictive code on the samples from the auto-correlation circuit. In this manner the LPC circuit 20 performs and operates according to a given algorithm which may utilize a 10th order LPC analysis. Thus the output from the LPC analyzer 20 produces 10 reflection coefficients at one output. The output of the LPC analyzer 20 is also directed to a cepstral analyzer 21 which essentially provides 10 cepstral coefficients which are derived from the reflection coefficients. These techniques are well known in the state of the art.

Referring to FIG. 3, there is shown a more detailed block diagram of the function of the auto-correlation circuit 18 of FIG. 2. As shown in FIG. 3, the windowed speech samples are applied to a 512 zero-filled fast Fourier transform (FFT analyzer 30 where the FFT for each frame is calculated. The resulting spectrum obtained from the analyzer 30 is multiplied in a multiplier 31 by the transfer function of a stored subband filter. This transfer function which is stored in module 32 is used to eliminate out-of-band components of the spectrum and as indicated in the Figure is represented between the frequencies of 350 to 2,800 Hz. The power spectrum obtained from the multiplier 31 is then derived from the complex spectrum. The magnitude of the spectrum is squared in a circuit 33 which operates to square the magnitude obtained from the output of multiplier 31 and then an inverse FFT is generated in module 34 to provide the autocorrelation coefficients. It is these coefficients which are sent to the LPC analyzer 20 to determine the reflection coefficients as well as the cepstral coefficients as explained.

As one can ascertain, the techniques of producing fast Fourier transforms in regard to windowed speech samples are also known as well as the technique for producing the inverse FFT. Essentially, the next step in regard to the processing technique is to derive speech parameters for each input speech frame. In order to accomplish this, one utilizes an algorithm. In this case a 10th order LPC analysis is implemented in module 20 whereby one obtains 10 reflection coefficients and 10 cepstral coefficients which are recursively derived from the reflection coefficients as seen in FIG. 2 and accomplished by module 21.

The auto-correlation coefficients are used to calculate LPC reflection coefficients by using one of many available algorithms. A particular useful algorithm is Levinson's recursive algorithm. This is a well-known algorithm in the speech processing art. Essentially, the 10 cepstral coefficients are derived recursively from the reflective coefficients, as will be shown mathematically.

The calculation starts with auto-correlation coefficients {r.sub.1 } and proceeds in two steps. First the reflection coefficients {k.sub.1 } and scaled filter coefficients {a.sub.1 } are found using Levinson's recursion . The energy of the prediction residual is also obtained in the first step. In the second step the cepstral coefficients {c.sub.1 } are found using their recursive relation to the scale filter coefficients The mathematics are given below.

Step 1 - Reflection coefficients and scaled filter coefficients from auto-correlation coefficients.

A. Initialize ##EQU1##

B. Levinson Recursion. Do for m=1 . . . M-1 ##EQU2##

Step 2. Cepstral coefficients from scaled filter coefficients

A. Initialize

Do for m.sub.1 =2 . . . M ##EQU3##

The final processing performed by the front end module is to determine if the current input frame contains speech. This is done using a simple adaptive energy thresholding technique. The speech energy threshold is estimated from a smooth histogram of the input frame energy. An ad hoc algorithm is used to determine this threshold. The first low energy peak in the histogram which is at least 20 percent as large as the largest histogram peak is assumed to contain the non-speech frames. The speech energy threshold is then set equal to the first minimum after the non-speech peak. This can be ascertained by referring to FIG. 4 whereby there is shown a graph of frame energy and the frequency of occurrence of energy to determine the speech energy threshold from the smooth frame energy histogram.

The 100 point frame energy histogram is continuously updated with each new input frame. Each bin in the histogram is passed through a lowpass filter that causes the values in the histogram to decay towards zero as a function of time. The lowpass filters for each histogram bin are implemented as single pole digital recursive filters with a time constant of approximately 2 seconds as defined below. ##EQU4## where:

Histbin[i].sub.t is the i.sub.th histogram bin at time t, k is the 1/frame-rate=0.01 seconds. T is the time constant of the filter=2 seconds, and Val is: 1 if the current frame energy falls in Histbin[i],0 otherwise.

The histogram is then smoothed using a 3-point smoothing kernel.

SPEAKER MODELING IN MODULE 12

Speaker recognition models are generated by collecting statistics over the coefficients in the modeled data. The front end processor as processor 10 identifies where the frames of the data contain speech as described above. Coefficients from the speech frames (reflection and cepstral) are accumulated and the means and a covariance matrix are calculated. These statistics are used in the MAHALANOBIS distance computation during recognition.

Thus, as one can ascertain, models are generated in the speaker modeling module 12 of FIG. 1. The speaker recognition module which is 14 of FIG. 1 implements recognition of speakers. The recognition module 14 makes use of both speech and non-speech frames. The speech frames are used to characterize the talker for recognition and the non-speech frames are used to detect possible changes in talkers. Recognition is performed by comparing the common average parameter vector as derived from the coefficients with each of the active speaker models as stored. Once per second the identity of the three models that are closest to the speech being recognized are output with their corresponding scores. The current average parameter vector is the average over the last N seconds of speech. Each second the frames from the last second are accumulated and added to the average. At the same time, frames for the Nth second in the past are eliminated from the average.

The distance is then computed using the Mahalanobis metric:

where

D is the Maholonobis distance

X is the input parameter vector

M.sub.i is the parameter vector from the i.sup.th model, and

C.sub.i is the covariance matrix for the i.sup.th model.

The recognition module also monitors non-speech frames to detect pauses in the input speech that are associated with possible changes in talkers. When non-speech frames are input, the recognition module ignores the frame but increments the silence-frame-in-a-row counter. This counter is cleared anytime a speech frame is input. If the silence-frames-in-a-row counter exceeds a silence threshold (user selectable default value of 0.5 seconds), the recognition module signals a possible change in talker. The data in the current average parameter vector is then zeroed so that any further recognitions will be based only on data received after the silence gap.

The distances are converted to speaker confidences using the following equation. ##EQU5## where

a=7.0 (Emperically determined)

.beta.=Min. (.theta..a)

.theta.=Max. (d.0.0)

d=Mahalanobis distance

.delta.=Min. (f .mf) mf

f=frames used in recognition

mf=150 (Emperically determined)

For the system a low confidence was defined to be a confidence value less than 0.7.

Again, briefly summarizing the above and referring again to FIG. 1, it is seen that the front end processing circuit 10 which essentially, as will be explained, may include a digital computer operates to digitize and buffer the input analog speech wherein in the front end the speech is lowpass filtered at 3.3 Kz via the lowpass filter 15 of FIG. 2. It is sampled at 8,000 samples per second and is converted into 16 bit samples by means of a linear analog-to-digital converter. The suitable parameters are extracted by utilizing a 200 point Hamming window which is overlapped by 50 percent. The output of the Hamming window analyzer is directed to an auto-correlation circuit whereby a 512 point fast Fourier transform is provided. The transform output is multiplied by an input spectrum utilizing the subband filter spectrum.

This is squared in order to calculate a power spectrum and then an inverse FFT is formed. From the inverse FFT which emanates from the auto-correlation circuit 18, one now derives the reflection coefficients by using a linear predictive code. This is implemented by means of the Levinson recursion algorithm. From these reflection coefficients, the cepstral coefficients, as for example 1 to 10, are recursively derived. Speech frame detection occurs if the frame energy is greater than the current speech energy threshold then that frame is marked as a speech frame. As will be explained, the current speech energy threshold is updated. This is accomplished by updating the frame energy histogram and one then estimates the current speech energy threshold from the histogram. The histogram is that as shown for example in FIG. 4. In regard to speaker modeling all non-speech frames are ignored then the average speech frame parameters are used until the end of the model data file. Once there is an end to the model data file, one calculates the covariance matrix which is then inverted and one then stores the average parameters and the inverse matrix as modeled in memory.

In order to achieve recognition of speakers, the following occurs. For all non-speech frames obtained from the front end processing, one increments the number of non-speech frames which occur in a row. If the number of non-speech frames in a row is greater than the silent threshold, one then clears all one second parameter accumulators. For all speech frames obtained from the front end processor, one sets the number of non-speech frames in a row to zero. One then increments the speech frame counter. If the speech frame counter is greater than the number of frames in one second, the current one second parameter which is in the accumulator is saved and one then initializes a new current one second parameter accumulator. One then operates to average the past N one second parameter accumulator. The Mahalanobis distance between the average parameters, and each of the active speaker models is then calculated.

The system then operates to output the lowest three distances and the corresponding speaker numbers while adding the current frame to the current parameter accumulator. In this manner one can recognize each speaker by means of the measured distances and furthermore one can do this without regard to any speaker being required to utter a predetermined pattern. Thus, as will be explained, this technique scales frame parameters as a function of the frame power. Hence in this technique all available speech frames are accumulated but those frames having low power are deemphasized as not being speech frames.

The complete hardware implementation of the system will be described in greater detail.

Referring to FIG. 5, there is shown a complete block diagram of a speaker authentication system according to this invention. As one can see from FIG. 5, there is a main processor unit 40 designated as a CPU. The main processor unit 40 has a bidirectional bus 41 connected to a main processor bus 42 which essentially enables the main processor unit to control all modules that are connected to the main processor bus 42 as well as to enable the various modules to communicate with the CPU as will be further explained. The individual moules as well as the CPU 40 are coupled to the bus via a multibus interface logic module which modules are supplied by many companies.

The CPU or main processor unit coordinates the activities of the major subsystems and serves to provide the proper interface between the operator and the authentication system. The CPU 40 contains the operating system software and enables interaction with an operator. As can be seen from FIG. 5, the CPU is connected via a typical fiber optic link or communications link to a CRT keyboard terminal 43 which for example may be an operator's terminal.

Essentially, as will be described and as indicated above, the main function of the CPU 40 is to schedule all processes required to implement the various recognition algorithms as discussed. The CPU also serves to provide access to mass storage elements that are required to store digitized speech as well as speaker models and recognition results. As indicated, the operator interacts with the system through the CRT and keyboard terminal 43 which is coupled to the CPU. This terminal may be part of an operator's console associated with the system which requires speaker identification.

Also shown coupled to the main processor bus 42 is a disk drive or disk subsystem 45. The disk is a memory which is available from many suppliers and operates to store system software as well as digitized speech and speaker models. The disk system 45 is a relatively rapid system to enable and accommodate high speed data transfer rates which are associated with real time digitizing and playing of speech. The disk subsystem 45 is used to store all the digital speech necessary to produce the above-described speaker models.

Hence the entire operating system for the speaker recognition system is stored on the disk subsystem 45. This can include all the necessary compilers, assemblers and so on necessary to generate the proper operation of software for each of the subsystems included in the main system. The disk memory 45 also stores the system source code which is employed throughout the system. Also shown coupled to the main processor bus 42 via a bidirectional bus is a tape subsystem 46. The system 46 is a conventional magnetic tape system and is employed to provide backup for the critical information stored on the system disk. This provides protection against loss of speech data and software due to hardware failure or operator error. It also provides storage of speech or model data which is not needed in the system on a daily basis.

Also shown coupled to the main processor bus 42 is a recognition algorithm front end system or a signal processor subsystem 50. Essentially, the recognition algorithm front end system 50 is a dedicated processor which functions to execute a large portion of the speaker recognition algorithms. The major computational tasks as converting the input speech wave into the LPC coefficient or parameter representation of the speech and comparing the speech input parameters with the stored speaker models is accomplished in the recognition algorithm front end module 50.

Also shown coupled to the main processor bus 42 is a random access memory 52. The random access memory 52 may for example be a 1 megabyte memory and is utilized for peripheral storage of data and also operates in conjunction with the disk memory 45 and the magnetic tape system 46.

As seen in FIG. 5, there is shown an analog conditioning board 60. The analog board 60 as indicated receives audio at its various inputs or speech to be processed and can direct output audio or process speech from the output terminals. Hence as seen, there is a remote audio IN which consists of a series of terminals and an audio output section which also consists of a plurality of terminals. The analog conditioning board 60 interfaces with three-channel analog-to-digital converters 62 and also interfaces with three-channel digital-to-analog converters 63. Both the analog-to-digital converter 62 and the digital-to-analog converters 63 are coupled to the main processor bus via bidirectional buses as shown in the diagram. Further coupled or connected to the main processor bus is an input/output (I/O) controller 61 and an interface controller 64. The main function of the analog conditioning board 60 in conjunction with the analog-to-digital converters 62 and the digital-to-analog converters 63 is to perform analog-to-digital and digital-to-analog conversion. The board may also contain appropriate filters, amplifiers and automatic gain control circuitry in order to assure that the signal levels for the system are proper.

The function of the input/output controller or I/O controller 61 is to interface with the digital data from the analog-to-digital converters 62 and to enable the transfer of digital data to the digital-to-analog converters 63. The I/O controller 61 assures the rapid movement of large amounts of data. As can be seen, the main processor unit or CPU 40 accomplishes data movement via the main processor bus 42. In any event, based on the huge amount of data to be moved, a substantial portion of the CPU 40 time would be diverted making it unavailable to respond to operator requests in a timely manner.

Hence the I/O controller 61 is provided to allow all data transfers to the digital to analog converters 63 to occur while further coordinating the movement of data from the analog-to-digital converters 62 to the signal processing subsystems. The I/O controller 61 typically includes necessary data buffers which are required to store digitized speech prior to recognition and serves to control the transfer of speech to the operator via the digital-to-analog converters 63. The I/O processor 61 also interacts with the disk subsystem 45 via the main processor bus 42 enabling it to transfer data to and from the analog-to-digital and digital-to-analog converters as 62 and 63.

As indicated above, an interface controller 64 is also bidirectionally coupled to the main processor bus 42. Speech to be transferred includes new model material to be stored on the disk's system 45 for later use by the system to generate or update speaker models and speech to be identified is also stored on this disk subsystem. The interface controller 64 provides high speed digital data paths between the disk system and the recorder systems to enable the high speed requirements to be implemented.

As one can ascertain from the block diagram of FIG. 5 and a relatively simple explanation given thereof, all of the components as depicted in FIG. 5 are the subject of conventional commercially available components and descriptions of suitable types of components will be given subsequently in this specification. As one can ascertain from FIG. 5, the majority of all interactions between the CPU 40 is coordinated by the same through the main processor bus 41 which enables the CPU 40 to interface with the various system modules. The CPU 40 controls software, writes commands into registers in the desired subsystems and in addition reads status registers to monitor the status and progress of the subsystems. The major functions which are implemented by the system include (1) digitizing and storing data on the disk 45, (2) storing of speech data on the disk 45, (3) playing back digitized speech from the disk 45, (4) recognizing speakers from live speech, (5) recognizing speakers from stored digital data, (6) generating models.

In order to explain each of the above-noted processes and to further determine how they are implemented, a series of flow diagrams will be given showing the implementation of the above-described operations. Number in parenthesis indicate the logic module employed in the description.

Referring to FIG. 6, there is shown a flow diagram depicting the process of storing digitized speech on the disk subsystem 45. Essentially, as will be ascertained and again referring to FIG. 5, storing digitized speech data on the disk 45 involves the control of the disk subsystem 45, the I/O control processor 61 and the A/D converter system 62. The direct control of the A/D converters 62 is provided by the I/O control processor 61.

FIG. 6 is again a block diagram in flow chart form showing the digitizing process. As indicated by module 70, the CPU 40 sends a digitized command to the I/O controller 61. Essentially, the control software of the CPU writes commands to the mail box registers in the I/O control processor 61 instructing it to begin digitizing a particular channel of A/D data and to store the data on the system disk 45 in specified blocks. As shown by module 70, in order to accomplish this, the CPU has to specify the particular A/D channel, specify the number of bytes required and also specify the disk address. The terminology utilized module 70 is sufficient for one skilled in the art.

After receiving the command from the CPU, the I/O control processor 61 interprets the commands, sets a busy flag in a selected male box register contained in the I/O processor and begins processing the command. The operation of an I/O processor as processor 61 is also well known in the art. The I/O control processor 61 accesses the control and status register of the A/D converter board to clear the input registers which for example are first-in, first-out registers (FIFO) of any old data. This process is briefly shown and described in FIG. 7.

Thus, referring to FIG. 7, there is shown the initialization of the A/D board by the I/O controller. The I/O controller sends the A/D board a digitized command 78 which is acknowledged received by the board and commands the A/D channels to flush or reset the registers 79 which are normally first-in, first-out or FIFO devices. After implementing this instruction, the A/D converters start to sample data via the sample rate clock as evidenced by module 80 of FIG. 7. The sample rate clock on the I/O control processor board, as indicated, is started and an on-board counter is used to count the number of sample clock pulses which are issued. This procedure is briefly shown in FIG. 8.

As seen in FIG. 8, the I/O processor 61 starts its clock via the command as evidenced by module 81. This start control signal is directed to a clock 82 which commences to produce output sa