WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Recognition apparatus using articulation positions for recognizing a voice    
United States Patent5175793   
Link to this pagehttp://www.wikipatents.com/5175793.html
Inventor(s)Sakamoto; Kenji (Nara, JP); Yamaguchi; Kouichi (Tenri, JP)
AbstractA first voice recognition apparatus includes a device for analyzing frequencies of the input voice and a device coupled to the analyzing unit for determining vowel zones and consonant zones of the analyzed input voice. The apparatus further includes a device for determining positions of articulation of an input voice determined from the vowel zones by calculating from frequency components of the input voice in accordance with a predetermined algorithm based on frequency components of monophthongs having known phonation contents and positions of articulation. A second voice recognition apparatus includes a device for analyzing frequencies of the input voice so as to derive acoustic parameters from the input voice. A pattern converting unit is coupled to the analyzing unit and uses a neural network for converting the acoustic parameters to articulartory vectors. The neural network is capable of learning, by the error back propagation method using target data produced by a predetermined sequence based on the acoustic parameters, to create rules for converting the acoustic parameters of the input voice to articulatory vectors having at least two vector elements. A recognizing unit is coupled to the pattern converting unit for recognizing the input voice by comparing a feature pattern of the analyzed input voice having the articulatory vector with reference feature patterns in a predetermined sequence. A storage unit is coupled to the recognizing unit for storing the reference feature patterns having the articulatory vectors created by the pattern converting unit.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5175793
Recognition apparatus using articulation positions for recognizing a

     voice - US Patent 5175793 Drawing
Recognition apparatus using articulation positions for recognizing a voice
Inventor     Sakamoto; Kenji (Nara, JP); Yamaguchi; Kouichi (Tenri, JP)
Owner/Assignee     Sharp Kabushiki Kaisha (Osaka, JP)
Patent assignment
All assignments
Publication Date     December 29, 1992
Application Number     07/473,238
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     January 31, 1990
US Classification     704/200
Int'l Classification     G10L 009/10
Examiner     Fleming; Michael R.
Assistant Examiner     Doerrler; Michelle
Attorney/Law Firm     O'Connell; Robert F.
Address
Parent Case    
Priority Data     Feb 01, 1989[JP]1-23377 Feb 02, 1989[JP]1-26033
USPTO Field of Search     381/41 381/42 381/43 381/44 381/45 364/513 364/513.5
Patent Tags     recognition articulation positions recognizing a voice
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4980917
Hutchins
704/254
Dec,1990

[0 after 0 votes]
4975961
Sakoe
704/232
Dec,1990

[0 after 0 votes]
4876731
Loris
382/229
Oct,1989

[0 after 0 votes]
4856067
Yamada
704/234
Aug,1989

[0 after 0 votes]
4829572
Kong
704/249
May,1989

[0 after 0 votes]
4805225
Clark
382/161
Feb,1989

[0 after 0 votes]
4802103
Faggin
706/38
Jan,1989

[0 after 0 votes]
4760604
Cooper
382/155
Jul,1988

[0 after 0 votes]
4712243
Ninomiya
704/250
Dec,1987

[0 after 0 votes]
4624010
Takebayashi
704/249
Nov,1986

[0 after 0 votes]
4087632
Hafer
704/251
May,1978

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A speech recognition apparatus using position of articulation for recognizing an input voice, said apparatus comprising:

means for analyzing frequency components of said input voice;

means for determining a vowel zone and a consonant zone of said input voice from said frequency components analyzed by said analyzing means; and

means for calculating a position of articulation of said input voice by using said vowel zone and said consonant zone determined by said determining means by using a calculation process based on frequency components obtained from monophthongs whose phonation contents and articulation positions are known; and

means for recognizing said input voice in accordance with said position of articulation calculated by said calculating means so as to output a recognized result.

2. An apparatus according to claim 1, wherein said calculating means has a neural network capable of self-producing rules by learning processes using an error back propagation method, and said positions of articulation of said input voice are calculated by said neural network from said frequency components in accordance with said self-produced rules.

3. An apparatus according to claim 2, wherein said apparatus further comprises means for obtaining consonant patterns from said consonant zone of said input voice, and comparing means having a storage means for storing reference consonant patterns and means for matching consonants patterns of said input voice determined from said consonant zone with said reference consonant patterns stored in said storage means.

4. An apparatus according to claim 3, wherein said storage means further stores reference articulation positions and said matching means is coupled to said calculating means for further comparing articulation positions calculated by said calculating means with said stored reference articulation positions in said stored means so as to calculate a similarity therebetween, and said apparatus further comprises a display means for displaying said calculated similarity obtained by said matching means.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a voice recognition apparatus which can extract feature variables from an input voice independently of speakers and languages, and which can absorb fluctuations dependent on speakers and effectively reduce an amount of calculations in matching for voice recognition.

2. Description of the Related Art

Voice recognition apparatus are generally divided into two systems. One system is a word voice recognition system in which word voices are recognized through matching by the use of reference patterns composed of words as units. The other system is a phoneme recognition system in which word phonemes are recognized through matching by the use of standard patterns composed of phonemes or syllables, smaller than words, as units.

The word voice recognition system has no problem of false recognition due to articulatory coupling and can provide a high rate of recognition. However, the word voice recognition system has a problem such that the number of reference patterns is increased with increasing the number of vocabularies, which requires a large memory capacity and a great deal of calculations in matching. Particularly, in the case of recognizing many and unspecified speakers, a plurality of reference patterns (multi-templates) are needed for each word, because voices are largely fluctuated dependent on individual speakers. Such voice fluctuations are attributable to various factors. Thus, since speakers have their own physiological factors such as sex, age, and length of a vocal tract, voices are fluctuated as speakers change. In the case of a single speaker, voice fluctuations are also caused if the speaker makes voices in a different manner (loudness, voice production speed, etc.) dependent on circumstances, or if the surrounding noise is varied.

Therefore, the problem arisen by increasing the number of vocabularies has been dealt with as follows. In order to reduce the number of reference patterns for use in matching, preliminary selection of the reference patterns is performed before executing principal matching, based on the intermediate result of DP matching among the reference patterns, durations, global features and local features of the input voices.

However, there has not yet been found an approach of completely eliminating voice fluctuations due to change of speakers.

Applicants know that, to some extent, sound source characteristics among fluctuations depending on speakers can be compensated by passing voices through primary to tertiary adaptive inverted filters of the critical damping type. It has also been attempted to normalize a difference between the individual speakers by making a voice signal subjected to simple conversion using first formant through third formant.

In the case of recognizing an input voice signal by a voice recognition apparatus of the phoneme recognition system, the input voice signal is frequency-analyzed by a feature extracting device to extract several feature variables of phonemes relating to the recognized object in advance. These plural feature variables of phoneme are stored in a storage section as reference patterns for the respective phonemes. Then, each of words is expressed by a series of such phoneme reference patterns, and the resulting series of phoneme reference patterns are stored in a storage device in association with phoneme series of words using word-by-word for being stocked as a word dictionary. On the other hand, when an unknown voice is input, the aforesaid feature extracting device extracts feature variables from the input voice for each frame in a like manner as mentioned above. A check is then made to similarity between the extracted feature variables of the unknown voice for each frame and the phoneme reference patterns stored in the storage section. As a result, the phoneme corresponding to the phoneme reference pattern with the maximum similarity is determined as a phoneme of that frame. Likewise, phonemes of subsequent frames are determined successively to express the unknown voice as a series of phonemes. Afterward, a check is made to similarity between the phoneme series obtained from the unknown voice and the series of phoneme reference patterns for respective words in the word dictionary which are stored in the storage section. As a result, the word corresponding to the series of phoneme reference patterns with the maximum similarity is determined as a word of the input voice.

In acoustic analysis and feature extraction, a voice can be expressed with a less number of parameters through the linear prediction analysis (LPC) by supposing the voice to be an all-polar model. There has been proposed an attempt to utilize such a model approach to directly express the structure of articulatory organs and motional characteristics thereof, thereby effectively describing vocal tract functions cross-section area with the aid of a model. This is called an articulartory model using an articulatory parameter x (Shirai and Honda: "Estimation of Articulatory Parameters from Speech Waves", Trans. IECE Japan, 61-A, 5, pp. 409-416, 1978). The articulatory parameter x composed of an opening/closing angle of the lower jaw: X(J), an antero-posterior (longitudinal) deformation of the tongue surface: X(T1), a vertical deformation of the tongue: X(T2), an opening area/extension of the lip: X(L), a shape of glottis: X(G), and an opening of the velum (degree of nasalization): X(N). Thus, the articulatory parameter can be expressed by: x=[X(T1), X(T2), X(J), X(L), X(G), X(N)]. Assuming that a non-linear articulatory model for converting the articulatory parameter x to an acoustic parameter is given, the articulatory parameter x can be derived by solving the non-linear optimization problem from the acoustic parameter in a reversed manner. While the number of parameter dimensions is normally 12-20 in the aforementioned LPC, the number of dimensions for the articulatory parameter x is 6. This means that in the case of using the articulatory parameter x, information is compressed down to a half or less level compared with the LPC parameter.

Meanwhile, a narrow degree C at a point of articulation in the vocal tract has difficulties to express with high accuracy using the articulatory parameter x, but it is deeply related to the types of articulation such as vowel, fricative and closure. For the reason, the narrow degree is extracted separately from the articulatory parameter x and the coordinates (x, y) of a narrowed position so that it is utilized for voice recognition and the like. Further, both of the narrow degree C and the vector (x, y) of the narrowed position can be calculated simply from the acoustic parameter by using a neural network, while avoiding the non-linear optimization problem in the tone parameter x.

However, the above-mentioned conventional voice recognition apparatus has problems as follows. In the method based on the phoneme reference patterns, the feature variable of the phoneme extracted by the feature extracting device may be different depending on not only a physiological difference (e.g., a length difference in the vocal tract) between individual speakers but also an influence of articulatory coupling in the successive phonemic environment in the case of vowel(s) in a word, even if the voice is produced to express a phoneme symbol of the same representation. Stated otherwise, if voice recognition is made using the feature variable of phoneme, even the voice produced to express the same phoneme symbol may be determined as a different phoneme, whereby it is rejected or incorrectly recognized. Accordingly, high recognition ability cannot be obtained. This problem is attributable to the fact that voice recognition is performed using the feature variables of phonemes which may be fluctuated dependent on speakers and phonemic environment.

Speaker independent word recognition has a problem, as mentioned above, that an amount of calculations necessary for matching between the feature patterns of an input voice and the reference patterns is increased.

Further, the method of predicting an articulatory parameter from an acoustic parameter using an articulatory model is also problematic that the non-linear optimization problem must be solved, which is disadvantageous in amount of calculations and stability of convergence. To avoid this problem, there have been attempted several methods such as taking into account a fluctuation range and continuity of the parameter, utilizing a table lookup, etc. However, an amount of calculations remains essentially large. Another problem is in that the articulatory parameter x is directed to specified speakers and prediction can be well succeeded only in a vowel steady portion.

There have also been proposed various methods using formant frequencies to be adapted for many and unspecified speakers. But, no decisive method has been found.

SUMMARY OF THE INVENTION

An object of this invention is to provide a voice recognition apparatus which can extract a position of articulation (i.e., a narrowed position formed in a vocal tract) as a feature variable specific to phonation of a voice, which position is not dependent on speakers and languages.

Another object of this invention is to provide a voice recognition apparatus in which when making speaker independent voice recognition, feature variables of smaller dimensions capable of removing voice fluctuations caused by and dependent on a physiological difference between individual speakers are set by further developing feature variables of the narrow degree C and the coordinates (x, y) of a point of articulation, thereby to reduce the number of reference patterns and an amount of calculations necessary for matching by using the feature variables thus set.

The object of the invention can be achieved by a voice recognition apparatus for analyzing frequencies of an input voice inputted from a input device, for extracting feature variables of the input voice from the analyzed frequencies to recognize the input voice, the apparatus includes:

an unit for analyzing frequencies of the input voice;

an unit coupled to the analyzing unit for determining a vowel and a consonant zone of the analyzed input voice; and

an unit for determining a position of articulation of a member of the input voice determined as a vowel zone by calculating from frequency components of the input voice in accordance with a predetermined algorithm based on frequency components of monophthongs having known phonation contents and position of articulation.

Preferably, the vowel position of articulation determining unit has a neural network capable of self-producing rules by learning processes using the error back propagation method, the position of articulation of vowel is calculated from the frequency components of the member in accordance with the rules.

The vowel position of articulation determining unit further has a storage unit for storing transform equations used for the algorithm so as to calculate the vowel position of articulation.

The recognition apparatus may further includes an unit coupled to the vowel/consonant zone determining unit for comparing the input voice determined as a consonant zone with memorized consonant patterns so as to make matching of consonants.

Preferably, the comparing unit includes a storage unit for storing the consonant patterns.

The voice recognition apparatus may further include an unit coupled to the determining unit for discriminating the determined input voice outputted from the determining unit by comparing with stored reference patterns therein.

The discriminating unit preferably include a matching unit coupled to the determining unit for comparing the input voice outputted from the determining unit with memorized reference patterns therein so as to calculate similarity therebetween, and a display unit for displaying the calculated similarity obtained by the matching unit.

The analyzing unit is preferably formed of either a group of band pass filter or a high-speed Fourier transformer.

In a voice feature extracting system of the first invention, a vowel/consonant zone determining device and a position of articulation extracting device are provided. Based on frequency components of a plurality of monophthongs whose phonation contents and positions of articulation are known, the input phonation content determined by the vowel/consonant zone determining device as a vowel zone is processed by the position of articulation extracting device to calculate a position of articulation of the vowel in problem from frequency components thereof. This allows to derive the position of articulation as a feature variable of the voice independent of speakers and languages. In accordance with the present invention, therefore, the position of articulation can be extracted through the simple processing with high accuracy.

Alternatively, a vowel/consonant zone determining device and a position of articulation extracting device which includes a neural network are provided. Based on frequency characteristics of the zones which have been determined by the vowel/consonant zone determining device as vowel zones, the neural network creates by itself rules to derive positions of articulation of vowels, through learning. In accordance with those rules, a position of articulation of an input vowel is derived by the position of articulation extracting device from frequency components in the zone determined as a vowel zone. This allows to derive the position of articulation as a feature variable of the voice independent of speakers and languages, based on the frequency components of the vowel. With the present invention, therefore, the position of articulation can be extracted through the simple processing with high accuracy.

The another object of the invention can be achieved by a voice recognition apparatus for analyzing frequencies of an input voice inputted from a input device, for determining feature variables of the input voice from the analyzed frequencies to recognize the input voice, and for indicating the recognized input voice, the apparatus includes:

an unit for analyzing frequencies of the input voice so as to derive acoustic parameters from the input voice;

a pattern converting unit coupled to the analyzing unit and having a neural network for converting the acoustic parameters to articulatory vectors, the neural network capable of learning by the error back propagation method using target data produced by a predetermined sequence based on the acoustic parameters so as to create rules in order to convert the acoustic parameters of the input voice to the articulatory vector having at least two vector elements.

a recognizing unit coupled to the pattern converting unit for recognizing the input voice by comparing a feature pattern of the analyzed input voice having the articulatory vector with reference patterns in predetermined sequence; and

a storage unit coupled to the recognizing unit for storing the reference patterns having the articulatory vectors created by the pattern converting unit.

Preferably, the two vector elements are selected from among respective positions of a point of articulation in an antero-posterior and vertical directions, a degree of narrowness of a vocal tract at the point of articulation, presence or absence of vibrations of a vocal cords, a degree of nasalization, and a rounded degree.

Furthermore, the pattern converting unit is to convert the acoustic parameters to the articulatory vector by frame.

The neural network is preferably a multi-layered perceptron.

The voice recognition further includes an unit coupled to the recognizing unit for displaying similarity between the analyzed input voice and the reference patterns obtained by the recognizing unit.

Preferably, the storage unit is to memorize the reference patterns each of which has a time series of the articulatory vector created by the pattern converting unit for an acoustic sample.

Furthermore, the recognizing unit includes a preliminary selecting unit coupled to the pattern converting unit for selecting patterns from the reference patterns, and a discriminating unit coupled to the preliminary selecting unit for discriminating the input voice vector in accordance with a distance between respective time series of the articulatory vectors for the feature pattern and the reference pattern.

The analyzing unit preferably is formed of either a group of band pass filter or a fast Fourier transformer.

In a voice recognition system of a second invention, an acoustic parameter is first derived from an input voice signal by an acoustic analyzing device. From this acoustic parameter of the input voice, an articulatory vector is then produced which includes as its element at least two among antero-posterior/vertical positions of a point of articulation, a narrow degree of a vocal tract, presence or absence of vibrations of the vocal cords, a degree of nasalization, and a rounded degree (or degree of labilaization). A feature pattern of the input voice represented by a time series of the above articulatory vector and a reference pattern of a voice sample represented by a time series of the articulatory vector thereof are subjected to matching in a discriminating section by using a distance between those two time series of the articulatory vectors. Therefore, by expressing a voice by the articulatory vector, it becomes possible to eliminate voice fluctuations caused by a physiological difference between individual speakers, and to reduce the number of templates for the reference patterns, thereby reducing an amount of calculations necessary for matching, when voice recognition is made to many and unspecified speakers.

When producing the above articulatory vector, a parameter converting device makes learning with the error back propagation method by using target data which has been formed in advance through a predetermined sequence, so that the acoustic parameter of the input voice is converted to the articulatory vector by a neural network which has by itself created rules for converting the acoustic parameter of the input voice to the articulatory vector. Therefore, the articulatory vector can be produced through only simple operations of multiplication/summation, and an amount of calculations required for producing the articulatory vector can be reduced. In addition, by making learning of the neural network using the target data which has been obtained from voice samples of many speakers, the articulatory vector can stably be produced for all sorts of input voices.

Further objects and advantages of the present invention will be apparent from the following description, reference being had to the accompanying drawings wherein preferred embodiments of the present invention are clearly shown.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of one embodiment of a voice recognition apparatus according to this invention;

FIG. 2 is a representation showing positions of articulation of various vowels;

FIG. 3 is a representation showing the relationship between a first formant frequency and a second formant frequency for Japanese vowels;

FIG. 4 is a representation showing the relationship between a first formant frequency and a second formant frequency for other various vowels phonated by some speaker;

FIG. 5 is a flowchart of the position of articulation calculating operation of one vowel in a word;

FIG. 6 is a flowchart of a position of articulation calculating routine for a vowel / [a]/ in the word of FIG. 5;

FIG. 7 is a flowchart of a position of articulation calculating routine for a vowel / [i]/ in the word of FIG. 5;

FIG. 8 is a flowchart of a position of articulation calculating routine for a vowel / [u]/ in the word of FIG. 5;

FIG. 9 is a flowchart of a position of articulation calculating routine for a vowel / [e]/ in the word of FIG. 5;

FIG. 10 is a flowchart of a position of articulation calculating routine for a vowel / [o]/ in the word of FIG. 5;

FIG. 11 is an illustration for explaining the structure of a neural network;

FIG. 12 is a block diagram showing another embodiment of a voice recognition apparatus according to the present invention;

FIG. 13a is a chart showing one example of the waveform of an input voice;

FIG. 13b is a chart showing a time series of phoneme symbols corresponding to the voice waveform FIG. 13a;

FIG. 13c is a chart showing a time series of an element C of the articulatory vector corresponding to the voice waveform FIG. 13a;

FIG. 13d is a group of charts showing time series of elements x, y, n, g and l of the articulatory vector corresponding to the voice waveform FIG. 13a;

FIG. 14 is an illustration showing a neural network;

FIG. 15a is a sectional view of the mouth (or oral cavity) of a human being for explaining points of articulation of consonants; and

FIG. 15b is a view showing values of the element x for consonants corresponding to points of articulation in the section of the mouth shown in FIG. 15a.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

Hereinafter, the invention will be described in detail with reference to the illustrated embodiments.

FIG. 1 is a block diagram of a voice recognition apparatus according to a first invention. A voice signal input from a microphone 1 is amplified by an amplifier 2 and applied to an acoustic analyzing device 3. The acoustic analyzing device 3 performs frequency analysis of the input voice signal through a group of band-pass filters (hereinafter referred to as BPF) or fast Fourier transform (made for values obtained by multiplying the data of voice waveform by a window).

A vowel/consonant zone determining device 4 determines whether the input voice signal is a vowel zone or a consonant zone. This determination of vowel zone or consonant zone is carried out by referring changes in power and spectra of the input voice, or the like. If the input voice is determined as a vowel zone, a position of articulation of the vowel is extracted by a position of articulation extracting device 5. This extraction of the position of articulation of the vowel is carried out by reading transform equations and rules necessary for calculating the position of articulation from frequency components of the voice from a transform equation storage device 6, and then employing those transform equations and rules thus read in. On the other hand, if the input voice is determined as a consonant zone by the vowel/consonant zone determining device 4, matching between frequency components of the consonant zone and consonant patterns stored in a consonant pattern storage device 8 is made in a consonant pattern converting device 7 to output a candidate of the consonant pattern. In this manner, the input voice is converted to time series of the position of articulation of the vowels and the consonant patterns.

A pattern matching device 9 calculates similarity between the time series of the position of articulation of the vowels and the consonant patterns derived from the input voice as mentioned above and each of reference patterns for words derived by the similar method as mentioned above for respective known words and stored in reference pattern storage device 10. Based on the result of similarity calculation, the word is recognized and the recognition result is indicated on a result display section 11.

A first embodiment of the first invention will be described below in detail.

This embodiment is concerned with the position of articulation extracting device 5 for vowels, which calculates positions of articulation of vowels in a word from frequency components of the voice in accordance with predetermined algorithm, based on frequency components of monophthongs whose phonation contents and positions of articulation are known. In this embodiment, Japanese vowels (/ ;a/, / ;i/, / ;u/, / ;e/, / ;o/) are used as monophthongs whose positions of articulation are known. FIG. 2 is a representation showing positions of articulation of various vowels. In the FIG. 2, x represents position of articulation in an antero-posterior (longitudinal) direction, with the larger value being nearer to the anterior (front) end. Also, y represents the position of articulation in a vertical direction, with the larger value being nearer to the lower side. In FIG. 2, a Katakana notation encircled by .circle. indicates each of the aforesaid Japanese vowels. The toning position is now represented by the coordinates (x, y) within the following range:

1.ltoreq.x.ltoreq.7, 1.ltoreq.y.ltoreq.7

where x, y: integers

In this embodiment, it is assumed that each position of articulation is located on a lattice point of the coordinates. This is reasonable from the standpoint of auditory accuracy of a human being. Specifically, it is here assumed that the monophthong / ;a/ has a position of articulation (2, 7), the monophthong / ;i/has a position of articulation (6, 2), the monophthong / ;u/ has a position of articulation (2, 2), the monophthong / ;e/ has a position of articulation (5, 4), and the monophthong / ;o/ has a position of articulation (1, 4). Based on the positions of articulation of the monophthongs thus set, positions of articulation of vowels in a word are each expressed by the coordinates (x, y).

The relationship between positions of articulation of vowels and frequency components of vowels phonated at those positions of articulation will now be described. FIG. 3 is a representation which indicates respective ranges of a first formant frequency (hereinafter expressed by F(1)) and a second formant frequency (hereinafter expressed by F(2)) of the Japanese vowels shown in FIG. 2 for males and females. FIG. 4 is a representation showing the relationship between F(1) and F(2) for various vowels other than the Japanese vowels, which are phonated by a specific speaker. From FIGS. 2, 3 and 4, it is found that the relationship between the formant frequencies and the positions of articulation is generally given by proportional relations between F(1) and y and between F(2) and x. For some vowels (/ ;i/ and / ;e/), the values of x, y are affected upon an increase and decrease in a third formant frequency (hereinafter expressed by F(3)). Based on those relationships, positions of articulation of vowels in a word are predicted from frequency components of the vowels in the word.

Next, a method of predicting positions of articulation of vowels in a word will be explained in more detail.

As described in connection with FIG. 1, the waveform of an input voice is previously sectioned into vowel zones and consonant zones for being labeled by the acoustic analyzing device 3 and the vowel/consonant zone determining device 4, and also subjected to acoustic analysis to extract formant frequencies. In this embodiment, the formant frequencies of the phoneme zones thus labeled as vowels are used for prediction.

FIG. 5 is a flowchart of the position of articulation calculating operation of one vowel in a word to be executed in the position of articulation extracting device 5 of FIG. 1.

In step S1, the formant frequencies of the zone determined by the vowel/consonant zone determining device 4 as a vowel zone are input and the kind of label (i.e., the phonation content) which is added to the formant frequencies of the input vowel zone is determined. In accordance with the label determined, the process goes to any one of steps S2, S3, S4, S5 and S6.

In step S2, a position of articulation calculating routine for a vowel / ;a/, described later in detail, is executed to complete the position of articulation calculating operation for one vowel.

In step S3, a position of articulation calculating routine for a vowel / ;i/, described later in detail, is executed to complete the position of articulation calculating operation for one vowel.

In step S4, a position of articulation calculating routine for a vowel / ;u/, described later in detail, is executed to complete the position of articulation calculating operation for one vowel.

In step S5, a position of articulation calculating routine for a vowel / ;e/, described later in detail, is executed to complete the position of articulation calculating operation for one vowel.

In step S6, a position of articulation calculating routine for a vowel / ;o/, described later in detail, is executed to complete the position of articulation calculating operation for one vowel.

The position of articulation calculating routines for respective vowels executed in steps S2 to step S6 will be explained below in more detail. (A) Position of Articulation Calculating Routine for Vowel / ;a/

In the vicinity of the position of articulation of the monophthong, F(1) and F(2) are varied non-linearly upon changes in the position of articulation. Therefore, a table for directly converting the values of F(1), F(2) of the vowel / ;a/ in the word to a position of articulation (hereinafter referred to as a conversion table) is prepared (one example shown in the following Table 1) and stored in the transform equation storage device 6.

TABLE 1 ______________________________________ x 12345678910111213141516 ______________________________________ 1 23, 23, 24, 24, 24, 25, 25, 26, 26, 27, 27, 27, 27, 28, 28, 28 2 23, 23, 24, 24, 24, 25, 25, 26, 26, 27, 27, 27, 27, 28, 28, 28 3 30, 30, 24, 24, 24, 25, 25, 26, 26, 27, 27, 27, 27, 28, 28, 28 4 30, 30, 31, 31, 31, 32, 32, 33, 33, 34, 34, 34, 34, 35, 35, 28 5 30, 30, 31, 31, 31, 32, 32, 33, 33, 34, 34, 34, 34, 35, 35, 35 6 30, 30, 31, 31, 31, 39, 39, 40, 40, 40, 41, 41, 41, 42, 35, 35 7 30, 30, 38, 38, 38, 39, 39, 40, 40, 41, 41, 41, 41, 42, 42, 35 8 37, 37, 38, 38, 38, 39, 39, 40, 40, 48, 48, 48, 42, 42, 42, 35 9 37, 37, 38, 38, 38, 38, 39, 40, 47, 47, 48, 48, 42, 42, 42, 42 10 37, 37, 38, 38, 38, 38, 39, 47, 47, 47, 48, 48, 48, 49, 49, 42 11 45, 45, 45, 45, 45, 45, 45, 46, 47, 47, 48, 48, 48, 48, 49, 49 12 45, 45, 45, 45, 45, 45, 45, 45, 46, 47, 47, 48, 48, 48, 48, ______________________________________ 49

This conversion table was prepared by asking many speakers to produce voices at various positions of articulation, and then considering the relationship between the positions of articulation and the formant frequencies. The coordinates of the monophthongs on the conversion table (hereinafter referred to as table positions) are expressed by (I, J) as follows. The table position of the monophthong / ;a/ is given by (8, 11), the table position of the monophthong / ;e/ is given by (2, 4), and the table position of the monophthong / ;o/ is given by (2, 15). I is increased and decreased upon an increase and decrease in F(1) (i.e., upon an increase and decrease in y), while J is increased and decreased upon an increase and decrease in F(2) (i.e., upon an increase and decrease in x).

The positions of articulation of vowels in a word are calculated from F(1), F(2) thereof using the above conversion table in a manner below. When F(2) of the vowel / ;a/ in the word is higher than F(2) of the monophthong / ;a/, the former's position of articulation is shifted toward the position of articulation of the monophthong / ;e/. Accordingly, F(1) of the vowel / ;a/ in the word is normalized from F(1) of the monophthong / ;a/ and F(1) of the monophthong / ;e/ to derive I of the table position (I, J) of the vowel / ;a/ in the word. Then, F(2) of the vowel / ;a/ in the word is normalized from F(2) of the monophthong / ;a/ and F(2) of the monophthong / ;e/ to derive J of the table position (I, J) of the vowel / ;a/ in the word. The table position (I, J) of the vowel / ;a/ in the word is thus calculated. When F(2) of the vowel / ;a/ in the word is lower than F(2) of the monophthong / ;a/, the former's position of articulation is shifted toward the position of articulation of the monophthong / ;o/. Accordingly, F(1) of the vowel / ;a/ in the word is normalized from F(1) of the monophthong / ;a/ and F(1) of the monophthong / ;o/ to derive I of the table position (I, J) of the vowel / ;a/ in the word. Then, F(2) of the vowel / ;a/ in the word is normalized from F(2) of the monophthong / ;a/ and F(2) of the monophthong / ;o/ to derive J of the table position (I, J) of the vowel / ;a/ in the word. The table position (I, J) of the vowel / ;a/ in the word is thus calculated.

Afterward, the values of the thus-calculated table positions (I, J) on the conversion table (hereinafter referred to as TABLE (I, J)) are derived from the conversion table. Based on the TABLE (I, J) derived from the conversion table, the position of articulation (x, y) of the vowel / ;a/ in the word using the relative equation (1) between the TABLE (I, J) and the position of articulation (x, y) below: ##EQU1## where [N] is a maximum integer not exceeding N, N=[TABLE (I, J)-1)/7].

FIG. 6 is a flowchart of a position of articulation calculating routine for a vowel / ;a/ in the word employed in the flowchart of FIG. 5. Those variables which are used in the following explanation of the toning position calculating routines for respective vowels are defined below:

F.sup.V (n) (V=a, i, u, e, o, n=1, 2, 3) . . . n-th formant frequency of a vowel V in the word

F.sup.V.sub.lV (n) (V=a, i, u, e, o, n=1, 2, 3) . . . n-th formant frequency of a monophthong V

(I, J) (V=a, i, u, e, o) . . . table postion of the vowel V in the word

(I.sub.V, J.sub.V) (V=a, i, u, e, o) . . . table position of the monophthong V

The position of articulation calculating routine for a vowel / ;a/ in the word will be described below in more detail with reference to FIG. 6.

Step S11 determines whether