The present invention relates to the management of voice data. Voice messages left on a recipient's answerphone or delivered via a voicemail system are a popular form of person-to-person communication. Such voice messages are quick to generate for the sender but are relatively difficult to review for the recipient; speech is slow to listen to and, unlike inherently visual forms of messages such as electronic mail or handwritten notes, cannot be quickly scanned for the relevant information. The present invention aims to make it easier for users to find relevant information in voice messages, and other kinds of voice record, such as recordings of meetings and recorded dictation. According to the present invention we provide a method of speech segmentation comprising processing speech data so as to detect putative pauses and characterised by forming speech block boundaries at a selected subset of the pauses, said selection being based on a preselected target speech block length. The invention may be applied in an application where speech is represented visually.
The invention describes a graphical method for detecting and adjusting audio overload conditions. The graphical user interface provides a user complete playback control of several audio tracks, detection of overload conditions such as audio clipping, and graphical methods to correct the overload conditions. The graphical interface provides drag handles which the user can use to adjust the various characteristics of an audio file. The characteristics, such as amplitude and temp, may be adjusted as a function of time.
A method (200) and apparatus (100) for classifying a homogeneous audio segment are disclosed. The homogeneous audio comprises a sequence of audio samples (x(n)). The method (200) starts by forming a sequence of frames (701-704) along the sequence of audio samples (x(n)), each frame (701-704) comprising a plurality of the audio samples (x(n)). The homogeneous audio segment is next divided (206) into a plurality of audio clips (711-714), with each audio clip being associated with a plurality of the frames (701-704). The method (200) then extracts (208) at least one frame feature for each clip (711-714). A clip feature vector (f) is next extracted from frame features of frames associated with the audio clip (711-714). Finally the segment is classified based on a continuous function during the distribution of the clip feature vectors (f).
A system and method for managing voicemails using metadata is presented. The metadata includes an audible introduction which a user records and associates the audible introduction to a voicemail. The user is also able to associate other types of metadata to a voicemail, such as a description data flag, a reminder flag, and a retention flag. Once the user associates the metadata with a voicemail, the user is able to retrieve the audible introduction and the voicemail in a sequential manner or through search criteria. The user is able to customize metadata based upon the user's requirements.
A voicemail system includes a voicemail bookmarking procedure that permits users to bookmark voicemail messages during message playback. Upon receiving a bookmark request from a user, the procedure generates a bookmark pointer defining a starting point for subsequent playback of the message. The bookmark pointer can be based, in part, on a timing offset value entered by the user while making the bookmark request. The timing offset value defines a user-selected playback starting point that occurs before the message time at which the bookmark request was made. The value of the timing offset can be user selected.
In a method for performing a segmentation operation upon a synthesizing speech signal and an input speech signal, a synthesized speech signal and a speech element duration signal are generated from the synthesizing speech signal A first feature parameter is extracted from the synthesized speech signal, and a second feature parameter is extracted from the input speech signal. A dynamic programming matching operation is performed upon the second feature parameter with reference to the first feature parameter and the speech element duration signal to obtain segmentation points of the input speech signal.