|
Claims  |
|
|
We claim:
1. A speech coding apparatus comprising:
means for storing a plurality of classes each having an identifier
represented by at least two of a plurality of prototypes, each of the
plurality of prototypes having at least one prototype value;
transducer means for extracting from an utterance a feature vector signal
having at least one feature value;
means for establishing a match between the feature vector signal and at
least one of the classes by selecting from the plurality of prototypes at
least one prototype having a prototype value that best matches the feature
value of the feature vector signal; and
means for coding the feature vector signal with the identifier of the class
represented by the selected at least one prototype vector.
2. Speech coding apparatus of claim 1, wherein the prototype value of the
at least one prototype is computed from at least means, variances and a
priori probabilities of a set of acoustic feature vectors associated with
the prototype.
3. Speech coding apparatus of claim 1, wherein the prototype value of the
at least one prototype is computed by associating location of the feature
value of the one feature vector signal on a probability distribution
function of the prototype.
4. Speech coding apparatus of claim 1, wherein each class of the plurality
of classes is represented by a plurality of prototypes whose respective
prototype values are considered as a whole against the feature value of
the feature vector signal to determine whether the feature vector signal
corresponds to the class.
5. Speech coding apparatus of claim 1, further comprising:
means for storing a plurality of training classes;
means for measuring and transforming training utterances into a series of
training feature vectors each having a feature value: and
means for correlating each of the series of training feature vectors with
one of the training classes to generate the plurality of stored classes.
6. Speech coding apparatus of claim 5, further comprising:
means for measuring and extracting from utterances over successive
predetermined time periods corresponding successive sets of feature
vectors, each feature vector of the successive sets of feature vectors
having a dimensionality of at least one feature value;
means for merging the feature vectors in each of the successive sets of
feature vectors to form a plurality of consolidated feature vectors whose
respective dimensionalities being the sum of the dimensionalities of the
corresponding merged feature vectors, the consolidated feature vectors
being more adaptable for discrimination between the stored training
classes; and
means for spatially reorienting the consolidated feature vectors to reduce
their dimensionality to thereby effect easier manipulation thereof.
7. Speech coding apparatus of claim 6, wherein each of the training classes
is divided into training subclasses, further comprising:
means for configuring the training subclasses as respective training
distribution functions having corresponding means, variances and a priori
probabilities; and
means for storing the training distribution functions, each of the training
distribution functions representing a training prototype.
8. Speech coding apparatus of claim 7, wherein each of the stored classes
has at least one subcomponent; and
wherein the correlating means correlates the series of feature vectors with
the at least one subcomponent to generate a plurality of stored component
classes.
9. Speech coding apparatus of claim 8, wherein the configuring means
further configures the plurality of component classes as respective
distribution functions each having corresponding means, variances and a
priori probabilities; further comprising:
means for storing the distribution functions representing the component
classes, each of the distribution functions of the component classes
representing a prototype.
10. Speech coding apparatus of claim 1, wherein the coding means comprises:
a quantizing means for outputting a label corresponding to the coded
feature vector signal.
11. Speech coding apparatus of claim 1, wherein the establishing means
comprises:
means for grouping a plurality of speech feature vectors into a
predetermined number of prototypes each having respective means, variances
and a priori probabilities; and
means for dividing each of the predetermined number of prototypes into at
least two sub-prototypes to better differentiate the feature vector signal
from other feature vector signals.
12. A speech coding apparatus comprising:
means for storing a plurality of prototypes representative of a plurality
of classes, each class having an identifier represented by at least two of
the plurality of prototypes, each of the plurality of prototypes having at
least one prototype value;
transducer means for extracting from an utterance a feature vector signal
having at least one feature value;
means for establishing a match between the feature vector signal and at
least one class by comparing the feature value of the feature vector
signal against the respective prototype values of the prototypes;
means for coding the feature vector signal with the identifier of the class
represented by any of the prototypes having a prototype value most closely
matching the feature value of the feature vector signal.
13. Speech coding apparatus of claim 12, wherein each class is represented
by a number of prototypes of the plurality of prototypes, the respective
prototype values of the prototypes of each class being considered as a
whole against the feature value of the feature vector signal to determine
which class of the plurality of classes the feature vector signal best
corresponds to.
14. Speech coding apparatus of claim 12, wherein the prototype value of
each prototype is computed from at least means, variances and a priori
probabilities of a set of acoustic feature vectors associated with the
prototype.
15. Speech coding apparatus of claim 12, wherein each prototype has a score
value computed by associating location of the feature value of the one
feature vector signal on a probability distribution function of the
prototype.
16. Speech coding apparatus of claim 12, further comprising:
means for storing a plurality of training classes;
means for measuring and transforming training utterances into a series of
training feature vectors each having a feature value: and
means for correlating each of the series of training feature vectors with
one of the training classes to generate the plurality of stored classes.
17. Speech coding apparatus of claim 16, further comprising:
means for measuring and extracting from utterances over successive
predetermined time periods corresponding successive sets of feature
vectors, each feature vector of the successive sets of feature vectors
each having a dimensionality and at least one feature value;
means for merging the feature vectors in each of the successive sets of
feature vectors to form a plurality of consolidated feature vectors whose
respective dimensionalities being the sum of the dimensionalities of the
corresponding merged feature vectors, the consolidated feature vectors
being more adaptable for discrimination between the stored training
classes; and
means for spatially reorienting the consolidated feature vectors to reduce
their dimensionality to thereby afford easier manipulation thereof.
18. Speech coding apparatus of claim 17, wherein each of the training
classes is divided into training subclasses, further comprising:
means for configuring the training subclasses as respective training
distribution function having corresponding means, variances and a priori
probabilities; and
means for storing the training distribution functions, each of the training
distribution functions representing a training prototype.
19. Speech coding apparatus of claim 18, wherein each of the stored classes
has at least one subcomponent; and
wherein the correlating means correlates the series of feature vectors with
the at least one subcomponent to generate a plurality of stored component
classes.
20. Speech coding apparatus of claim 19, wherein the configuring means
further configures the plurality of component classes as respective
distribution functions each having corresponding means, variances and a
priori probabilities; further comprising:
means for storing the distribution functions representing the component
classes, each of the distribution functions of the component classes
representing a prototype.
21. Speech coding apparatus of claim 12, wherein the coding means
comprises:
a quantizing means for outputting a label corresponding to the coded
feature vector signal.
22. Speech coding apparatus of claim 12, wherein the establishing means
comprises:
means for grouping a plurality of speech feature vectors into a
predetermined number of prototype each having respective means, variances
and a priori probabilities; and
means for dividing each of the predetermined number of prototype into at
least two sub-prototypes to better differentiate the feature vector signal
from other feature vector signals.
23. A method of coding speech comprising the steps of:
(a) storing in a memory means a plurality of classes each having an
identifier represented by at least two of a plurality of prototypes, each
of the plurality of prototypes having at least one prototype value;
(b) using transducer means to extract from an utterance a feature vector
signal having at least one feature value;
(c) establishing a correspondence between the feature vector signal and at
least one class of the plurality of classes by selecting from among a
plurality of prototypes at least one prototype whose prototype value most
closely matches the feature value of the feature vector signal; and
(d) coding the feature vector signal with the identifier of class
represented by the selected at least one prototype.
24. The method of claim 23, wherein prior to step (a), the method further
comprising the steps of:
establishing an inventory of training classes;
extracting training feature vectors from a string of training text; and
correlating each of the feature vectors with one of the training classes.
25. The method of claim 24, further comprising the steps of:
measuring and extracting from utterances over successive predetermined
periods of time corresponding successive sets of feature vectors, each
feature vector of the successive sets of feature vectors having a
dimensionality of at least one feature value;
merging the feature vectors in each of the successive sets of feature
vectors to form a plurality of consolidated feature vectors whose
respective dimensionalities being the sum of the dimensionalities of the
corresponding merged feature vectors, the consolidated feature vectors
being more adaptable for discrimination between the stored training
classes; and
spatially reorienting the consolidated feature vectors to reduce their
dimensionalities to thereby effect easier manipulation thereof.
26. The method of claim 25, further comprising the steps of:
establishing the number of prototypes required to provide adequate
representation of a class; and wherein for each of the training classes,
the method further comprising the steps of:
selecting a number of training prototypes;
calculating respective new training prototypes by averaging the respective
values of feature vectors situated proximate to each of the training
prototypes until the average distance between the feature vectors remains
substantially constant; and
successively replacing the two closest new training prototypes with another
new training prototype whose value is the average of the values of the
replaced training prototypes until a predetermined number of another
training prototypes remains.
27. The method of claim 26, further comprising the steps of:
using a distribution analysis on the predetermined number of training
prototypes to calculate a corresponding set of new training prototypes
each having an estimated means, variances and a priori probabilities; and
dividing each new training prototype into corresponding additional training
prototypes.
28. The method of claim 24, wherein the correlating step comprises
utilizing a viterbi alignment technique.
29. The method of claim 23, wherein step (c) further comprises the steps
of:
establishing the number of prototypes required to provide adequate
representation for a class; and wherein for each of the classes, the
method further comprising the steps of:
selecting a number of prototypes;
calculating respective new prototypes by averaging the respective values of
feature vectors situated proximate to each of the prototypes until the
average distance between the feature vectors remains substantially
constant; and
successively replacing the two closest new prototypes with another new
prototype whose value is the average of the values of the replaced
prototypes until a predetermined number of another prototypes remains.
30. The method of claim 29, further comprising the steps of:
using a distribution analysis on the predetermined number of the another
prototypes to calculate a corresponding set of prototypes each having
estimated means, variances and a priori probabilities;
dividing each prototype having the estimated means, variances and a priori
probabilities into additional prototypes to provide a greater number of
prototypes for comparison with the feature vector signal.
31. The method of claim 23, wherein the prototype value of the at least one
prototype is computed from means, variances and a priori probabilities of
a set of acoustic feature vectors associated with the prototype.
32. The method of claim 23, wherein the prototype value of the at least one
prototype is computed by associating the location of the feature value of
the one feature vector signal on a probability distribution function of
the prototype.
33. A method of coding speech comprising the steps of:
(a) storing in a memory means a plurality of prototype vectors
representative of a plurality of classes, each class having an identifier
represented by at least one of the plurality of prototype vectors, each of
the plurality of prototype vectors having at least one prototype value;
(b) using transducer means to extract from an utterance a feature vector
signal having a feature value;
(c) establishing a correspondence between the feature vector signal and at
least one class by comparing the feature value of the feature vector
signal against the respective prototype values of the prototype vectors;
(d) coding the feature vector signal with the identifier of the class
represented by any of the prototype vectors having a prototype value that
most closely matches the feature value of the feature vector signal.
34. The method of claim 33, wherein each class is represented by a number
of prototype vectors of the plurality of prototype vectors, and wherein
the method further comprising the step of:
considering the respective prototype values of the prototype vectors of
each class as a whole against the feature value of the feature vector
signal to determine which class of the plurality of classes the feature
vector signal best corresponds to.
35. The method of claim 33, wherein prior to step (a), the method further
comprising the steps of:
establishing an inventory of training classes;
extracting training feature vectors from a string of training text; and
correlating each of the feature vectors with one of the training classes.
36. The method of claim 35, further comprising the steps of:
measuring and extracting from utterances over successive predetermined
periods of time corresponding successive sets of feature vectors, each
feature vector of the successive sets of feature vectors having a
dimensionality of at least one feature value;
merging the feature vectors in each of the successive sets of feature
vectors to form a plurality of consolidated feature vectors whose
respective dimensionalities being the sum of the dimensionalities of the
corresponding merged feature vectors, the consolidated feature vectors
being more adaptable for discrimination between the stored training
classes; and
spatially reorienting the consolidated feature vectors to reduce their
dimensionalities to thereby effect easier manipulation thereof.
37. The method of claim 36, further comprising the steps of:
establishing the number of prototype vectors required to provide adequate
representation of a class; and wherein for each of the training classes,
the method further comprising the steps of:
selecting a number of training prototype vectors;
calculating respective new training prototype vectors by averaging the
respective values of feature vectors situated proximate to each of the
training prototype vectors until the average distance between the feature
vectors remains substantially constant; and
successively replacing the two closest new training prototype vectors with
another new training prototype vector whose value is the average of the
values of the replaced training prototype vectors until a predetermined
number of another training prototype vectors remains.
38. The method of claim 37, further comprising the steps of:
using a distribution analysis on the predetermined number of training
prototype vectors to calculate a corresponding set of new training
prototype vectors each having estimated means, variances and a priori
probabilities; and
dividing each new training prototype vector into corresponding additional
training prototype vectors.
39. The method of claim 33, wherein step (c) further comprises the steps
of:
establishing the number of prototype vectors required to provide adequate
representation for a class; and wherein for each of the classes, the
method further comprising the steps of:
selecting a number of prototype vectors;
calculating respective new prototype vectors by averaging the respective
values of feature vectors situated proximate to each of the prototype
vectors until the average distance between the feature vectors remains
substantially constant; and
successively replacing the two closest new prototype vectors with another
new prototype vector whose value is the average of the values of the
replaced prototype vectors until a predetermined number of another
prototype vectors remains.
40. The method of claim 39, further comprising the steps of:
using a distribution analysis on the predetermined number of the another
prototype vectors to calculate a corresponding set of prototype vectors
each having estimated means, variances and a priori probabilities;
dividing each prototype vector having the estimated means, variances and a
priori probabilities into additional prototype vectors to provide a
greater number of prototype vectors for comparison with the feature vector
signal.
41. The method of claim 33, wherein the correlating step comprises
utilizing a Viterbi alignment technique.
42. The method of claim 33, wherein the prototype value of the at least one
prototype vector is computed from means, variances and a priori
probabilities of a set of acoustic feature vectors associated with the
prototype.
43. The method of claim 33, wherein the prototype value of the at least one
prototype vector is computed by associating location of the feature value
of the one feature vector signal on a probability distribution function of
the prototype vector.
44. A speech coding apparatus comprising:
means for storing two or more prototype vector signals, each prototype
vector signal representing a prototype vector having an identifier and at
least two partitions, each partition having at least one partition value;
transducer means for measuring value of at least one feature of an
utterance during a time interval to produce a feature vector signal
representing the value of the at least one feature of the utterance;
means for calculating a match score for each partition, each partition
match score representing the value of a match between the partition value
of the partition and the feature value of the feature vector signal;
means for calculating a prototype match score for each prototype vector,
each prototype match score representing a function of the partition match
scores for all partitions in the prototype vector; and
means for coding the feature vector signal with the identifier of the
prototype vector signal having a best prototype match score.
45. An apparatus as claimed in claim 44, characterized in that:
each partition match score is proportional to the joint probability of
occurrence of the feature value of the feature vector signal and the
partition value of the partition; and
the prototype match score represents the sum of the partition match scores
for all partitions in the prototype vector.
46. An apparatus as claimed in claim 45, further comprising means for
generating prototype vector signals, said prototype vector signal
generating means comprising:
means for measuring the value of at least one feature of a training
utterance during each of a series of successive first time intervals to
produce a series of training corresponding to a first time interval, each
training feature vector signal representing the value of at least one
feature of the training utterance during a second time interval containing
the corresponding first time interval, each second time interval being
greater than or equal to the corresponding first time interval;
means for providing a network of elemental models corresponding to the
training utterance;
means for correlating the training feature vector signals in the series of
training feature vector signals to the elemental models in the network of
elemental models corresponding to the training utterance so that each
training feature vector signal in the series of training feature vector
signals corresponds to one elemental model in the network of elemental
models corresponding to the training utterance;
means for selecting a fundamental set of all training feature vector
signals which correspond to all occurrences of a first elemental model in
the network of elemental models corresponding to the training utterance;
means for selecting at least first and second different subsets of the
fundamental set of training feature vector signals to form a first label
set of training feature vector signals;
means for calculating centroid of the feature values of the training
feature vector signals of each of the first and second subsets of the
fundamental set; and
means for storing a first prototype vector signal corresponding to the
first label set of training feature vector signals, said first prototype
vector signal representing a first prototype vector having at least first
and second partitions, each partition having at least one partition value,
the first partition having a partition value equal to the value of the
centroid of the feature values of the training feature vector signals in
the first subset of the fundamental set, the second partition having a
partition value equal to the value of the centroid of the feature values
of the training feature vector signals in the second subset of the
fundamental set.
47. An apparatus as claimed in claim 46, characterized in that the centroid
is arithmetic average.
48. An apparatus as claimed in claim 47, characterized in that the network
of elemental models is a series of elemental models.
49. An apparatus as claimed in claim 48, characterized in that:
the fundamental set of training feature vector signals is divided into at
least first, second and third subsets of training feature vector signals;
the calculating means further calculates the centroid of the feature values
of the training feature vector signals in the third subset; and
the apparatus further comprises means for storing a second prototype vector
signal, said second prototype vector signal representing the value of the
centroid of the feature values of the training feature vector signals in
the third subset of the fundamental set.
50. An apparatus as claimed in claim 49, characterized in that:
the feature values of the training feature vector signals in each subset of
the fundamental set have a feature value variance and a a priori
probability;
the apparatus further comprises means for calculating the variance and a
priori probability of the feature values of the training feature vector
signals in each subset of the fundamental set;
the first partition of the first prototype vector has a further partition
value equal to the value of the variance and a priori probability of the
feature values of the training feature vector signals in the first subset
of the fundamental set;
the second partition of the first prototype vector has a further partition
value equal to the value of the variance and a priori probability of the
feature values of the training feature vector signals in the second subset
of the fundamental set; and
the second prototype signal represents the value of the variance and a
priori probability of the feature values of the training feature vector
signals in the third subset of the fundamental set.
51. An apparatus as claimed in claim 50, characterized in that:
the apparatus further comprises means for estimating conditional
probability of occurrence of each subset of the fundamental set of
training feature vector signals given the occurrence of the first label
set;
the apparatus further comprises means for estimating the probability of
occurrence of the first label set of training feature vector signals;
the first prototype vector further represents the estimated probability of
occurrence of the first label set of training feature vector signals;
the first partition of the first prototype vector has a further partition
value equal to the estimated conditional probability of occurrence of the
first subset of the fundamental set of training feature vector signals
given the occurrence of the first label set; and
the second partition of the first prototype vector has a further partition
value equal to the estimated conditional probability of occurrence of the
second subset of the fundamental set of training feature vector signals
given the occurrence of the first label set.
52. An apparatus as claimed in claim 51, characterized in that:
each second time interval is equal to at least two first time intervals;
and
each feature vector signal comprises at least two feature values of the
utterance at two different times.
53. An apparatus as claimed in claim 52, characterized in that each feature
vector signal represents values of m features, where m is an integer
greater than or equal to two;
each partition has n partition values, where n is less than m; and
the apparatus further comprises means for transforming the m values of each
feature vector signal to n values prior to calculating the centroids, and
variances and a priori probability of the subsets.
54. An apparatus as claimed in claim 53, characterized in that:
the elemental models are elemental probabilistic models;
the correlating means comprises means for aligning the feature vector
signals and the elemental probabilistic models.
55. A speech coding method comprising the steps of:
storing two or more prototype vector signals, each prototype vector signal
representing a prototype vector having an identifier and at least two
partitions, each partition having at least one partition value;
using transducer means to measure a value of at least one feature of an
utterance during a time interval to produce a feature vector signal
representing the value of the at least one feature of the utterance;
calculating a match score for each partition, each partition match score
representing the value of a match between the partition value of the
partition and the feature value of the feature vector signal;
calculating a prototype match score for each prototype vector, each
prototype match score representing a function of the partition match
scores for all partitions in the prototype vector; and
coding the feature vector signal with the identifier of the prototype
vector signal having the a prototype match score.
56. A method as claimed in claim 55, characterized in that:
each partition match score is proportional to the joint probability of
occurrence of the feature value of the feature vector signal and the
partition value of the partition; and
the prototype match score represents the sum of the partition match scores
for all partitions in the prototype vector.
57. A method as claimed in claim 56, further comprising a method of
generating prototype vector signals, said prototype vector signal
generating method comprising:
measuring the value of at least one feature of a training utterance during
each of a series of successive first time intervals to produce a series of
training feature vector signals, each training feature vector signal
corresponding to a first time interval, each training feature vector
signal representing the value of at least one feature of the training
utterance during a second time interval containing the corresponding first
time interval, each second time interval being greater than or equal to
the corresponding first time interval;
providing a network of elemental models corresponding to the training
utterance;
correlating the training feature vector signals in the series of training
feature vector signals to the elemental models in the network of elemental
models corresponding to the training utterance so that each training
feature vector signal in the series of training feature vector signals
corresponds to one elemental model in the network of elemental models
corresponding to the training utterance;
selecting a fundamental set of all training feature vector signals which
correspond to all occurrences of a first elemental model in the network of
elemental models corresponding to the training utterance;
selecting at least first and second different subsets of the fundamental
set of training feature vector signals to form a first label set of
training feature vector signals;
calculating centroid of the feature values of the training feature vector
signals of each of the first and second subsets of the fundamental set;
and
storing a first prototype vector signal corresponding to the first label
set of training feature vector signals, said first prototype vector signal
representing a first prototype vector having at least first and second
partitions, each partition having at least one partition value, the first
partition having a partition value equal to the value of the centroid of
the feature values of the training feature vector signals in the first
subset of the fundamental set, the second partition having a partition
value equal to the value of the centroid of the feature values of the
training feature vector signals in the second subset of the fundamental
set.
58. A method as claimed in claim 57, characterized in that the centroid is
arithmetic average.
59. A method as claimed in claim 58, characterized in that the network of
elemental models is a series of elemental models.
60. A method as claimed in claim 59, characterized in that:
the fundamental set of training feature vector signals is divided into at
least first, second and third subsets of training feature vector signals;
the calculating step further calculates the centroid of the feature values
of the training feature vector signals in the third subset; and
the method further comprises the step of storing a second prototype vector
signal, said second prototype vector signal representing the value of the
centroid of the feature values of the training feature vector signals in
the third subset of the fundamental set.
61. A method as claimed in claim 60, characterized in that:
the feature values of the training feature vector signals in each subset of
the fundamental set have a feature value variance and a priori
probability;
the method further comprises the step of calculating the variance and a
priori probability of the feature values of the training feature vector
signals in each subset of the fundamental set;
the first prototype signal represents the values of the variance and a
priori probability of the feature values of the training feature vector
signals in the first and second subsets of the fundamental set; and
the second prototype signal represents the value of the variance and a
priori probability of the feature values of the training feature vector
signals in the third subset of the fundamental set.
62. A method as claimed in claim 61, characterized in that:
the method further comprises the step of estimating conditional probability
of occurrence of each subset of the fundamental set of training feature
vector signals given the occurrence of the first label set;
the method further comprises the step of estimating the probability of
occurrence of the first label set of training feature vector signals;
the first prototype vector further represents the estimated probability of
occurrence of the first label set of training feature vector signals;
the first partition of the first prototype vector has a further partition
value equal to the estimated conditional probability of occurrence of the
first subset of the fundamental set of training feature vector signals
given the occurrence of the first label set; and
the second partition of the first prototype vector has a further partition
value equal to the estimated conditional probability of occurrence of the
second subset of the fundamental set of training feature vector signals
given the occurrence of the first label set.
63. A method as claimed in claim 62, characterized in that:
each second time interval is equal to at least two first time intervals;
and
each feature vector signal comprises at least two feature values of the
utterance at two different times.
64. A method as claimed in claim 63, characterized in that:
each feature vector signal represents values of m features, where m is an
integer greater than or equal to two;
each partition has n partition values, where n is less than m; and
the method further comprises the step of transforming the m values of each
feature vector signal to n values prior to calculating the centroids and
variance and a priori probability of the subsets.
65. A method as claimed in claim 64, characte | | |