A vehicle system for classifying spoken utterance within a vehicle cabin as one of system-directed and non-system directed, the system may include at least one microphone configured to detect at least one acoustic utterance from at least one occupant of a vehicle, at least one sensor to detect user behavior data indicative of user behavior, and a processor programmed to: receive the acoustic utterance, classify the acoustic utterance as one of a system-directed utterance and a non-system directed utterance, determine whether the acoustic utterance was properly classified based on user behavior observed via data received from the sensor after the classification, and apply a mitigating adjustment to classifications of subsequent acoustic utterances based on an improper classification.
G10L 15/22 - Procedures used during a speech recognition process, e.g. man-machine dialog
G10L 15/18 - Speech classification or search using natural language modelling
G10L 15/30 - Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
G10L 25/63 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use for comparison or discrimination for estimating an emotional state
2.
System and method for temporal and power based zone detection in speaker dependent microphone environments
A method, computer program product, and computer system for receiving, by a computing device, a speech signal from a speaker via a plurality of microphone zones. A temporal cue based confidence may be determined for at least a portion of the plurality of microphone zones. A power cue based confidence may be determined for at least a portion of the plurality of microphone zones. A microphone zone of the plurality of microphone zones from which to use an output signal of the speaker may be identified based upon, at least in part, a combination of the temporal cue based confidence and the power cue based confidence.
H04R 1/40 - Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
G10L 15/22 - Procedures used during a speech recognition process, e.g. man-machine dialog
G10L 25/21 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being power information
G10L 25/78 - Detection of presence or absence of voice signals
G10L 25/84 - Detection of presence or absence of voice signals for discriminating voice from noise
G10L 21/0216 - Noise filtering characterised by the method used for estimating noise
3.
System and method for combined non-linear and late echo suppression
A method, computer program product, and computer system for receiving, by a computing device, an input signal. A first power spectral density estimate may be generated for a linear reverberant component associated with the input signal. A second power spectral density estimate may be generated for a non-linear reverberant component associated with the input signal. A power spectral density estimate may be generated by combining the first power spectral density estimate for the linear reverberant component and the second power spectral density estimate for the non-linear reverberant component. One or more parameters for at least one of the linear reverberant component and the non-linear reverberant component may be updated. One or more undesired signal components in an output signal resulting from the input signal may be reduced via residual echo suppression based upon, at least in part, updating the one or more parameters.
A method for residual echo suppression is provided. Embodiments may include receiving an original reference signal and applying a distortion function to the original reference signal to generate a second signal. Embodiments may include generating a non-linear signal from the distortion function that does not include linear components of the original reference signal. Embodiments may also include calculating a residual echo power of a linear component and a non-linear component, wherein the linear component is based upon the original reference signal and the non-linear component is based upon the non-linear signal. Embodiments may further include applying a room model to each of the original reference signal and the non-linear signal and estimating a power associated with the original reference signal and the non-linear signal. Embodiments may include calculating a combined echo power estimate as a weighted sum of a weighted original reference signal power and a weighted non-linear signal power.
H04M 9/08 - Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
G10L 21/0264 - Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
A system and method for speech enhancement of a portable electronic device. Embodiments may include receiving an audio signal at a portable electronic device having a first microphone and a second microphone. Embodiments may also include receiving an input from a proximity detector associated with the portable electronic device and controlling a processing component associated with at least one of the first microphone and the second microphone based upon, at least in part, the input from the proximity detector.
The invention relates to a system and method for integrating domain information into state transitions of a Finite State Transducer (“FST”) for natural language processing. A system may integrate semantic parsing and information retrieval from an information domain to generate an FST parser that represents the information domain. The FST parser may include a plurality of FST paths, at least one of which may be used to generate a meaning representation from a natural language input. As such, the system may perform domain-based semantic parsing of a natural language input, generating more robust meaning representations using domain information. The system may be applied to a wide range of natural language applications that use natural language input from a user such as, for example, natural language interfaces to computing systems, communication with robots in natural language, personalized digital assistants, question-answer query systems, and/or other natural language processing applications.
A system and method of tagging utterances with Named Entity Recognition (“NER”) labels using unmanaged crowds is provided. The system may generate various annotation jobs in which a user, among a crowd, is asked to tag which parts of an utterance, if any, relate to various entities associated with a domain. For a given domain that is associated with a number of entities that exceeds a threshold N value, multiple batches of jobs (each batch having jobs that have a limited number of entities for tagging) may be used to tag a given utterance from that domain. This reduces the cognitive load imposed on a user, and prevents the user from having to tag more than N entities. As such, a domain with a large number of entities may be tagged efficiently by crowd participants without overloading each crowd participant with too many entities to tag.
A method, computer program product, and computer system for addressing acoustic signal reverberation is provided. Embodiments may include receiving, at one or more microphones, a first audio signal and a reverberation audio signal. Embodiments may further include processing at least one of the first audio signal and the reverberation audio signal. Embodiments may also include limiting a model based reverberation equalizer using a temporal constraint for direct sound distortions, the model based reverberation equalizer configured to generate one or more outputs, based upon, at least in part, at least one of the first audio signal and the reverberation audio signal.
G10K 15/08 - Arrangements for producing a reverberation or echo sound
H04B 3/20 - Reducing echo effects or singingOpening or closing transmitting pathConditioning for transmission in one direction or the other
G10L 21/0216 - Noise filtering characterised by the method used for estimating noise
H04M 9/08 - Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
A system and method is provided of disambiguating natural language processing requests based on smart matching, request confirmations that are used until ambiguities are resolved, and machine learning. Smart matching may match entities (e.g., contact names, place names, etc.) based on user information such as call logs, user preferences, etc. If multiple matches are found and disambiguation has not yet been learned by the system, the system may request that the user identify the intended entity. On the other hand, if disambiguation has been learned by the system, the system may execute the request without confirmations. The system may use a record of confirmations and/or other information to continuously learn a user's inputs in order to reduce ambiguities and no longer prompt for confirmations.
In certain implementations, follow-up responses may be provided for prior natural language inputs of a user. As an example, a natural language input associated with a user may be received at a computer system. A determination of whether information sufficient for providing an adequate response to the natural language input is currently accessible to the computer system may be effectuated. A first response to the natural language input (that indicates that a follow-up response will be provided) may be provided based on a determination that information sufficient for providing an adequate response to the natural language input is not currently accessible. Information sufficient for providing an adequate response to the natural language input may be received. A second response to the natural language input may then be provided based on the received sufficient information.
Methods and apparatus for estimating the power spectral density (PSD) of a residual interference having first and second components after adaptive interference cancellation (AIC). The first component can be estimated using a real-valued FIR filter operating on a time series of PSD estimates of a reference signal, and the second component can be estimated using an exponential decay over time corresponding to a reverberation time using the PSD of the reference signal.
G10L 25/21 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being power information
A conversational, natural language voice user interface may provide an integrated voice navigation services environment. The voice user interface may enable a user to make natural language requests relating to various navigation services, and further, may interact with the user in a cooperative, conversational dialogue to resolve the requests. Through dynamic awareness of context, available sources of information, domain knowledge, user behavior and preferences, and external systems and devices, among other things, the voice user interface may provide an integrated environment in which the user can speak conversationally, using natural language, to issue queries, commands, or other requests relating to the navigation services provided in the environment.
A multi-mode speech communication system is described that has different operating modes for different speech applications. A signal processing module is in communication with the speech applications and includes an input processing module and an output processing module. The input processing module processes microphone input signals to produce a set user input signals for each speech application that are limited to currently active system users for that speech application. The output processing module processes application output communications from the speech applications to produce loudspeaker output signals to the system users, wherein for each different speech application, the loudspeaker output signals are directed only to system users currently active in that speech application.
An automotive text display arrangement is described which includes a driver text display positioned directly in front of an automobile driver and displaying a limited amount of text to the driver without impairing forward visual attention of the driver. The arrangement may include a boundary insertion mode wherein when the active text position is an active text boundary, new text is inserted between the text items separated by the active text boundary, and when the active text position is an active text item, new text replaces the active text item. In addition or alternatively, there may be a multifunctional text control knob offering multiple different user movements, each performing an associated text processing function.
G06F 3/0481 - Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
15.
System and method of recording utterances using unmanaged crowds for natural language processing
A system and method of recording utterances for building Named Entity Recognition (“NER”) models, which are used to build dialog systems in which a computer listens and responds to human voice dialog. Utterances to be uttered may be provided to users through their mobile devices, which may record the user uttering (e.g., verbalizing, speaking, etc.) the utterances and upload the recording to a computer for processing. The use of the user's mobile device, which is programmed with an utterance collection application (e.g., configured as a mobile app), facilitates the use of crowd-sourcing human intelligence tasking for widespread collection of utterances from a population of users. As such, obtaining large datasets for building NER models may be facilitated by the system and method disclosed herein.
Systems and methods gathering text commands in response to a command context using a first crowdsourced are discussed herein. A command context for a natural language processing system may be identified, where the command context is associated with a command context condition to provide commands to the natural language processing system. One or more command creators associated with one or more command creation devices may be selected. A first application one the one or more command creation devices may be configured to display command creation instructions for each of the one or more command creators to provide text commands that satisfy the command context, and to display a field for capturing a user-generated text entry to satisfy the command creation condition in accordance with the command creation instructions. Systems and methods for reviewing the text commands using second and crowdsourced jobs are also presented herein.
The invention relates to a system and method of automatically distinguishing between computers and human based on responses to enhanced Completely Automated Public Turing test to tell Computers and Humans Apart (“e-captcha”) challenges that do not merely challenge the user to recognize skewed or stylized text. A given e-captcha challenge may be specific to a particular knowledge domain. Accordingly, e-captchas may be used not only to distinguish between computers and humans, but also determine whether a respondent has demonstrated knowledge in the particular knowledge domain. For instance, participants in crowd-sourced tasks, in which unmanaged crowds are asked to perform tasks, may be screened using an e-captcha challenge. This not only validates that a participant is a human (and not a bot, for example, attempting to game the crowd-source task), but also screens the participant based on whether they can successfully respond to the e-captcha challenge.
A system and method of tagging utterances with Named Entity Recognition (“NER”) labels using unmanaged crowds is provided. The system may generate various annotation jobs in which a user, among a crowd, is asked to tag which parts of an utterance, if any, relate to various entities associated with a domain. For a given domain that is associated with a number of entities that exceeds a threshold N value, multiple batches of jobs (each batch having jobs that have a limited number of entities for tagging) may be used to tag a given utterance from that domain. This reduces the cognitive load imposed on a user, and prevents the user from having to tag more than N entities. As such, a domain with a large number of entities may be tagged efficiently by crowd participants without overloading each crowd participant with too many entities to tag.
A system and method for generating a video sequence having mouth movements synchronized with speech sounds are disclosed. The system utilizes a database of n-phones as the smallest selectable unit, wherein n is larger than 1 and preferably 3. The system calculates a target cost for each candidate n-phone for a target frame using a phonetic distance, coarticulation parameter, and speech rate. For each n-phone in a target sequence, the system searches for candidate n-phones that are visually similar according to the target cost. The system samples each candidate n-phone to get a same number of frames as in the target sequence and builds a video frame lattice of candidate video frames. The system assigns a joint cost to each pair of adjacent frames and searches the video frame lattice to construct the video sequence by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.
Methods and apparatus for broadening the beamwidth of beamforming and postfiltering using a plurality of beamformers and signal and power spectral density mixing, and controlling a postfilter based on spatial activity detection such that de-reverberation or noise reduction is performed when a speech source is between the first and second beams.
G10L 25/21 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being power information
G10L 21/0216 - Noise filtering characterised by the method used for estimating noise
21.
System and method of providing and validating enhanced CAPTCHAs
The invention relates to a system and method of automatically distinguishing between computers and human based on responses to enhanced Completely Automated Public Turing test to tell Computers and Humans Apart (“e-captcha”) challenges that do not merely challenge the user to recognize skewed or stylized text. A given e-captcha challenge may be specific to a particular knowledge domain. Accordingly, e-captchas may be used not only to distinguish between computers and humans, but also determine whether a respondent has demonstrated knowledge in the particular knowledge domain. For instance, participants in crowd-sourced tasks, in which unmanaged crowds are asked to perform tasks, may be screened using an e-captcha challenge. This not only validates that a participant is a human (and not a bot, for example, attempting to game the crowd-source task), but also screens the participant based on whether they can successfully respond to the e-captcha challenge.
A system and method configured for use in a text-to-speech (TTS) system is provided. Embodiments may include identifying, using one or more processors, a word or phrase as a named entity and identifying a language of origin associated with the named entity. Embodiments may further include transliterating the named entity to a script associated with the language of origin. If the TTS system is operating in the language of origin, embodiments may include passing the transliterated script to the TTS system. If the TTS system is not operating in the language of origin, embodiments may include generating a phoneme sequence in the language of origin using a grapheme to phoneme (G2P) converter.
G06F 13/00 - Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
G06F 17/27 - Automatic analysis, e.g. parsing, orthograph correction
G06F 17/28 - Processing or translating of natural language
G10L 13/04 - Details of speech synthesis systems, e.g. synthesiser structure or memory management
23.
System and method for processing out of vocabulary compound words
A system and method for out-of-vocabulary compound word handling is provided. Embodiments may include storing a plurality of compound word rules and compound word dictionaries in a database. Embodiments may also include evaluating membership criteria associated with a received compound word, wherein membership criteria includes at least one of dictionary based or part of speech (POS) based criteria. Embodiments may further include applying one or more filtering rules to the received compound word.
A system and method for concatenative speech synthesis is provided. Embodiments may include accessing, using one or more computing devices, a plurality of speech synthesis units from a speech database and determining a similarity between the plurality of speech synthesis units. Embodiments may further include retrieving two or more speech synthesis units having the similarity and pruning at least one of the two or more speech synthesis units based upon, at least in part, the similarity.
A system and method of recording utterances for building Named Entity Recognition (“NER”) models, which are used to build dialog systems in which a computer listens and responds to human voice dialog. Utterances to be uttered may be provided to users through their mobile devices, which may record the user uttering (e.g., verbalizing, speaking, etc.) the utterances and upload the recording to a computer for processing. The use of the user's mobile device, which is programmed with an utterance collection application (e.g., configured as a mobile app), facilitates the use of crowd-sourcing human intelligence tasking for widespread collection of utterances from a population of users. As such, obtaining large datasets for building NER models may be facilitated by the system and method disclosed herein.
Method and apparatus to determine a speaker activity detection measure from energy-based characteristics of signals from a plurality of speaker-dedicated microphones, detect acoustic events using power spectra for the microphone signals, and determine a robust speaker activity detection measure from the speaker activity measure and the detected acoustic events.
G10L 25/21 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being power information
G10L 25/78 - Detection of presence or absence of voice signals
Systems and methods of validating transcriptions of natural language content using crowdsourced validation jobs are provided herein. In various implementations, a transcription pair comprising natural language content and text corresponding to a transcription of the natural language content may be gathered. A first group of validation devices may be selected for reviewing the transcription pair. A first crowdsourced validation job may be created for the first group of validation devices. The first crowdsourced validation job may be provided to the first group of validation devices. A vote representing whether or not the text accurately represents the natural language content may be received from each of the first group of validation devices. A validation score may be assigned to the transcription pair based, at least in part, on the votes from each of the first group of validation devices.
The invention relates to a system and method for integrating domain information into state transitions of a Finite State Transducer (“FST”) for natural language processing. A system may integrate semantic parsing and information retrieval from an information domain to generate an FST parser that represents the information domain. The FST parser may include a plurality of FST paths, at least one of which may be used to generate a meaning representation from a natural language input. As such, the system may perform domain-based semantic parsing of a natural language input, generating more robust meaning representations using domain information. The system may be applied to a wide range of natural language applications that use natural language input from a user such as, for example, natural language interfaces to computing systems, communication with robots in natural language, personalized digital assistants, question-answer query systems, and/or other natural language processing applications.
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating a synthetic voice. A system configured to practice the method combines a first database of a first text-to-speech voice and a second database of a second text-to-speech voice to generate a combined database, selects from the combined database, based on a policy, voice units of a phonetic category for the synthetic voice to yield selected voice units, and synthesizes speech based on the selected voice units. The system can synthesize speech without parameterizing the first text-to-speech voice and the second text-to-speech voice. A policy can define, for a particular phonetic category, from which text-to-speech voice to select voice units. The combined database can include multiple text-to-speech voices from different speakers. The combined database can include voices of a single speaker speaking in different styles. The combined database can include voices of different languages.
G10L 13/06 - Elementary speech units used in speech synthesisersConcatenation rules
G10L 13/04 - Details of speech synthesis systems, e.g. synthesiser structure or memory management
H04B 7/04 - Diversity systemsMulti-antenna systems, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas
H04B 7/06 - Diversity systemsMulti-antenna systems, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station
G10L 25/63 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use for comparison or discrimination for estimating an emotional state
30.
System and method for providing words or phrases to be uttered by members of a crowd and processing the utterances in crowd-sourced campaigns to facilitate speech analysis
Systems and methods of providing text related to utterances, and gathering voice data in response to the text are provide herein. In various implementations, an identification token that identifies a first file for a voice data collection campaign, and a second file for a session script may be received from a natural language processing training device. The first file and the second file may be used to configure the mobile application to display a sequence of screens, each of the sequence of screens containing text of at least one utterance specified in the voice data collection campaign. Voice data may be received from the natural language processing training device in response to user interaction with the text of the at least one utterance. The voice data and the text may be stored in a transcription library.
A method for entering keys in a small key pad is provided. The method comprising the steps of: providing at least a part of keyboard having a plurality of keys; and predetermining a first probability of a user striking a key among the plurality of keys. The method further uses a dictionary of selected words associated with the key pad and/or a user.
G06F 3/023 - Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
G06F 17/27 - Automatic analysis, e.g. parsing, orthograph correction
G06F 3/0488 - Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
32.
System and method for providing follow-up responses to prior natural language inputs of a user
In certain implementations, follow-up responses may be provided for prior natural language inputs of a user. As an example, a natural language input associated with a user may be received at a computer system. A determination of whether information sufficient for providing an adequate response to the natural language input is currently accessible to the computer system may be effectuated. A first response to the natural language input (that indicates that a follow-up response will be provided) may be provided based on a determination that information sufficient for providing an adequate response to the natural language input is not currently accessible. Information sufficient for providing an adequate response to the natural language input may be received. A second response to the natural language input may then be provided based on the received sufficient information.
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for reducing latency in web-browsing TTS systems without the use of a plug-in or Flash® module. A system configured according to the disclosed methods allows the browser to send prosodically meaningful sections of text to a web server. A TTS server then converts intonational phrases of the text into audio and responds to the browser with the audio file. The system saves the audio file in a cache, with the file indexed by a unique identifier. As the system continues converting text into speech, when identical text appears the system uses the cached audio corresponding to the identical text without the need for re-synthesis via the TTS server.
G10L 13/00 - Speech synthesisText to speech systems
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
G10L 13/04 - Details of speech synthesis systems, e.g. synthesiser structure or memory management
G10L 13/10 - Prosody rules derived from textStress or intonation
34.
In-view and out-of-view request-related result regions for respective result categories
The invention relates to systems and methods of providing in-view and out-of-view request-related result regions for respective result categories. The system may facilitate result presentation by providing, in response to a user request, at least one region that is designated to initially be in-view and at least one region that is designated to initially be out-of-view where: (i) the initial in-view region comprises one or more results related to the user request and a first category; and (ii) the initial out-of-view region comprises one or more results related to the user request and the second category. A result related to a category may comprise a result related to a specific topic, a result of a specific type, a result from a specific source, or other result. A user request may comprise a query, a command, or other user request.
The present disclosure is directed towards a system and method for reducing tandeming effects in a communications system. The method may include receiving, at a speech decoder, an input bitstream associated with an incoming initial speech signal from a speech encoder. The method may further include determining whether or not coding is required and if coding is required, modifying an excitation signal associated with the bitstream. The method may also include providing the modified excitation signal to an adaptive encoder.
G10L 19/12 - Determination or coding of the excitation functionDetermination or coding of the long-term prediction parameters the excitation function being a code excitation, e.g. in code excited linear prediction [CELP] vocoders
A system and method for addressing acoustic signal reverberation is provided. Embodiments may include receiving, at one or more microphones, a first audio signal and a reverberation audio signal. Embodiments may further include processing at least one of the first audio signal and the reverberation audio signal. Embodiments may also include limiting a model based reverberation equalizer using a temporal constraint for direct sound distortions, the model based reverberation equalizer configured to generate one or more outputs, based upon, at least in part, at signal least one of the first audio signal and the reverberation audio signal.
G10K 15/12 - Arrangements for producing a reverberation or echo sound using electronic time-delay networks
G10L 21/0216 - Noise filtering characterised by the method used for estimating noise
H04M 9/08 - Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
Embodiments disclosed herein may include determining a signal parameter of a first microphone and a second microphone associated with a computing device. Embodiments may include generating a reference parameter based upon at least one of the parameter of the first microphone and the parameter of the second microphone. Embodiments may include adjusting a tolerance of at least one of the first microphone and the second microphone, based upon the reference parameter. Embodiments may include receiving, at the first microphone, a first speech signal, the first speech signal having a first speech signal magnitude and receiving, at the second microphone, a second speech signal, the second speech signal having a second speech signal magnitude. Embodiments may include comparing at least one of the first speech signal magnitude and the second speech signal magnitude with a third speech signal magnitude and detecting an obstructed microphone based upon the comparison.
A system and method for acoustic echo cancellation is provided. Embodiments may include receiving, at one or more microphones, an audio reference signal from an audio speaker. Embodiments may also include filtering the audio reference signal using one or more adaptive audio filters. Embodiments may further include analyzing a level of signal energy of the audio reference signal with regard to time, frequency and audio channel to identify at least one maximum error contribution point. Embodiments may also include updating the one or more adaptive audio filters based upon, at least in part, the analyzed audio reference signal.
H04M 9/08 - Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
An arrangement is described for speech signal processing. An input microphone signal is received that includes a speech signal component and a noise component. The microphone signal is transformed into a frequency domain set of short-term spectra signals. Then speech formant components within the spectra signals are estimated based on detecting regions of high energy density in the spectra signals. One or more dynamically adjusted gain factors are applied to the spectra signals to enhance the speech formant components.
G10L 21/00 - Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
G10L 25/18 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
G10L 21/02 - Speech enhancement, e.g. noise reduction or echo cancellation
G10L 19/06 - Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients
G10L 19/00 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocodersCoding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
40.
System for automatic speech recognition and audio entertainment
In one aspect, the present application is directed to a device for providing different levels of sound quality in an audio entertainment system. The device includes a speech enhancement system with a reference signal modification unit and a plurality of acoustic echo cancellation filters. Each acoustic echo cancellation filter is coupled to a playback channel. The device includes an audio playback system with loudspeakers. Each loudspeaker is coupled to a playback channel. At least one of the speech enhancement system and the audio playback system operates according to a full sound quality mode and a reduced sound quality mode. In the full sound quality mode, all of the playback channels contain non-zero output signals. In the reduced sound quality mode, a first subset of the playback channels contains non-zero output signals and a second subset of the playback channels contains zero output signals.
H04M 9/08 - Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
G10L 15/20 - Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise or of stress induced speech
G10L 15/22 - Procedures used during a speech recognition process, e.g. man-machine dialog
41.
Methods and apparatus for dynamic low frequency noise suppression
Methods and apparatus for dynamically suppressing low frequency non-speech audio events, such as road bumps, without suppressing speech formants. In exemplary embodiments of the invention, maximum powers in first and second windows are computed and used to determine whether dampening should be applied, and if so, to what extent.
G10L 25/18 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
42.
System and method for speech enhancement on compressed speech
The present disclosure is directed towards a method for speech intelligibility. The method may include receiving, at one or more computing devices, a first speech input from a first user and performing voice activity detection upon the first speech input. The method may also include analyzing a spectral tilt associated with the first speech input, wherein analyzing includes computing an impulse response of a linear predictive coding (“LPC”) synthesis filter in a linear pulse code modulation (“PCM”) domain and wherein the one or more computing devices includes an adaptive high pass filter configured to recalculate one or more linear prediction coefficients.
G10L 25/12 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being prediction coefficients
G10L 19/12 - Determination or coding of the excitation functionDetermination or coding of the long-term prediction parameters the excitation function being a code excitation, e.g. in code excited linear prediction [CELP] vocoders
G10L 25/78 - Detection of presence or absence of voice signals
G10L 25/93 - Discriminating between voiced and unvoiced parts of speech signals
G10L 25/21 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being power information
A speech signal processing system is described for use with automatic speech recognition and hands free speech communication. A signal pre-processor module transforms an input microphone signal into corresponding speech component signals. A noise suppression module applies noise reduction to the speech component signals to generate noise reduced speech component signals. A speech reconstruction module produces corresponding synthesized speech component signals for distorted speech component signals. A signal combination block adaptively combines the noise reduced speech component signals and the synthesized speech component signals based on signal to noise conditions to generate enhanced speech component signals for automatic speech recognition and hands free speech communication.
G10L 21/00 - Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
G10L 13/02 - Methods for producing synthetic speechSpeech synthesisers
G10L 15/20 - Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise or of stress induced speech
G10L 19/04 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocodersCoding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
44.
Reduced keyboard with prediction solutions when input is a partial sliding trajectory
A reduced virtual keyboard system for text input on electronic devices is disclosed. Text input is performed by creating a tracing trajectory. Dynamic prediction solutions are created during the tracing process, thus avoiding the need for a user to complete the entire word trajectory. The system also allows a mixture of tapping actions and sliding motions for the same word. The system may comprise a Long Words Dictionary database having first letters corresponding to predetermined keys of the keyboard. Alternatively, the system uses a Dictionary and a database management tool to find long words.
G06F 3/0488 - Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
G06F 17/27 - Automatic analysis, e.g. parsing, orthograph correction
45.
System and method for cloud-based text-to-speech web services
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating speech. One variation of the method is from a server side, and another variation of the method is from a client side. The server side method, as implemented by a network-based automatic speech processing system, includes first receiving, from a network client independent of knowledge of internal operations of the system, a request to generate a text-to-speech voice. The request can include speech samples, transcriptions of the speech samples, and metadata describing the speech samples. The system extracts sound units from the speech samples based on the transcriptions and generates an interactive demonstration of the text-to-speech voice based on the sound units, the transcriptions, and the metadata, wherein the interactive demonstration hides a back end processing implementation from the network client. The system provides access to the interactive demonstration to the network client.
A speech processing method and arrangement are described. A dynamic noise adaptation (DNA) model characterizes a speech input reflecting effects of background noise. A null noise DNA model characterizes the speech input based on reflecting a null noise mismatch condition. A DNA interaction model performs Bayesian model selection and re-weighting of the DNA model and the null noise DNA model to realize a modified DNA model characterizing the speech input for automatic speech recognition and compensating for noise to a varying degree depending on relative probabilities of the DNA model and the null noise DNA model.
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating a synthetic voice. A system configured to practice the method combines a first database of a first text-to-speech voice and a second database of a second text-to-speech voice to generate a combined database, selects from the combined database, based on a policy, voice units of a phonetic category for the synthetic voice to yield selected voice units, and synthesizes speech based on the selected voice units. The system can synthesize speech without parameterizing the first text-to-speech voice and the second text-to-speech voice. A policy can define, for a particular phonetic category, from which text-to-speech voice to select voice units. The combined database can include multiple text-to-speech voices from different speakers. The combined database can include voices of a single speaker speaking in different styles. The combined database can include voices of different languages.
G10L 13/06 - Elementary speech units used in speech synthesisersConcatenation rules
H04B 7/04 - Diversity systemsMulti-antenna systems, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas
H04B 7/06 - Diversity systemsMulti-antenna systems, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for detecting and correcting abnormal stress patterns in unit-selection speech synthesis. A system practicing the method detects incorrect stress patterns in selected acoustic units representing speech to be synthesized, and corrects the incorrect stress patterns in the selected acoustic units to yield corrected stress patterns. The system can further synthesize speech based on the corrected stress patterns. In one aspect, the system also classifies the incorrect stress patterns using a machine learning algorithm such as a classification and regression tree, adaptive boosting, support vector machine, and maximum entropy. In this way a text-to-speech unit selection speech synthesizer can produce more natural sounding speech with suitable stress patterns regardless of the stress of units in a unit selection database.
G10L 13/00 - Speech synthesisText to speech systems
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
G10L 13/10 - Prosody rules derived from textStress or intonation
G10L 15/18 - Speech classification or search using natural language modelling
G10L 25/00 - Speech or voice analysis techniques not restricted to a single one of groups
49.
Wind noise detection for in-car communication systems with multiple acoustic zones
An in-car communication (ICC) system has multiple acoustic zones having varying acoustic environments. At least one input microphone within at least one acoustic zone develops a corresponding microphone signal from one or more system users. At least one loudspeaker within at least one acoustic zone provides acoustic audio to the system users. A wind noise module makes a determination of when wind noise is present in the microphone signal and modifies the microphone signal based on the determination.
A speech communication system includes a speech service compartment for holding one or more system users. The speech service compartment includes a plurality of acoustic zones having varying acoustic environments. At least one input microphone is located within the speech service compartment, for developing microphone input signals from the one or more system users. At least one loudspeaker is located within the service compartment. An in-car communication (ICC) system receives and processes the microphone input signals, forming loudspeaker output signals that are provided to one or more of the at least one output loudspeakers. The ICC system includes at least one of a speaker dedicated signal processing module and a listener specific signal processing module, that controls the processing of the microphone input signal and/or forming of the loudspeaker output signal based, at least in part, on at least one of an associated acoustic environment(s) and resulting psychoacoustic effect(s).
G10L 21/00 - Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
G10L 25/48 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use
G10L 21/02 - Speech enhancement, e.g. noise reduction or echo cancellation
A multi-mode speech communication system is described that has different operating modes for different speech applications. A speech service compartment contains multiple system users, multiple input microphones that develop microphone input signals from the system users to the system, and multiple output loudspeakers that develop loudspeaker output signals from the system to the system users. A signal processing module is in communication with the speech applications and includes an input processing module and an output processing module. The input processing module processes the microphone input signals to produce a set user input signals for each speech application that are limited to currently active system users for that speech application. The output processing module processes application output communications from the speech applications to produce loudspeaker output signals to the system users, wherein for each different speech application, the loudspeaker output signals are directed only to system users currently active in that speech application. The signal processing module dynamically controls the processing of the microphone input signals and the loudspeaker output signals to respond to changes in currently active system users for each application.
Techniques for providing speech output for speech-enabled applications. A synthesis system receives from a speech-enabled application a text input including a text transcription of a desired speech output. The synthesis system selects one or more audio recordings corresponding to one or more portions of the text input. In one aspect, the synthesis system selects from audio recordings provided by a developer of the speech-enabled application. In another aspect, the synthesis system selects an audio recording of a speaker speaking a plurality of words. The synthesis system forms a speech output including the one or more selected audio recordings and provides the speech output for the speech-enabled application.
G10L 13/02 - Methods for producing synthetic speechSpeech synthesisers
G10L 13/04 - Details of speech synthesis systems, e.g. synthesiser structure or memory management
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
53.
Method and system for text-to-speech synthesis with personalized voice
A method and system are provided for text-to-speech synthesis with personalized voice. The method includes receiving an incidental audio input (403) of speech in the form of an audio communication from an input speaker (401) and generating a voice dataset (404) for the input speaker (401). The method includes receiving a text input (411) at the same device as the audio input (403) and synthesizing (312) the text from the text input (411) to synthesized speech including using the voice dataset (404) to personalize the synthesized speech to sound like the input speaker (401). In addition, the method includes analyzing (316) the text for expression and adding the expression (315) to the synthesized speech. The audio communication may be part of a video communication (453) and the audio input (403) may have an associated visual input (455) of an image of the input speaker. The synthesis from text may include providing a synthesized image personalized to look like the image of the input speaker with expressions added from the visual input (455).
Network communications, Web-based services and customized services using the Web-based services may be provided over a peer-to-peer network from a first peer to a second peer (e.g., automobile head unit) wherein the first peer has a separate connection to a more general server-based network such as the Internet. A communications device application based on a peer communications framework component in communication with a peer network stack on the communications device may work as middleware, with a connection to both a more general server-based network such as the Internet and to an external device, such as a head unit of an automobile. Although the communications device has a separate connection out to the Internet via a general network stack co-existing on the same communications device, the peer network stack and the general network stack are not directly connected.
Visual information is used to alter or set an operating parameter of an audio signal processor, other than a beamformer. A digital camera captures visual information about a scene that includes a human speaker and/or a listener. The visual information is analyzed to ascertain information about acoustics of a room. A distance between the speaker and a microphone may be estimated, and this distance estimate may be used to adjust an overall gain of the system. Distances among, and locations of, the speaker, the listener, the microphone, a loudspeaker and/or a sound-reflecting surface may be estimated. These estimates may be used to estimate reverberations within the room and adjust aggressiveness of an anti-reverberation filter, based on an estimated ratio of direct to indirect (reverberated) sound energy expected to reach the microphone. In addition, orientation of the speaker or the listener, relative to the microphone or the loudspeaker, can also be estimated, and this estimate may be used to adjust frequency-dependent filter weights to compensate for uneven frequency propagation of acoustic signals from a mouth, or to a human ear, about a human head.
A handwriting recognition apparatus facilitates user entry of strokes one on top of another. The apparatus, which includes a processor and a display integrated with a touch sensitive screen, receives a series of strokes via the screen. Each stroke is defined by contact, trace, and lift occurrences. Each stroke appears on the display until occurrence of a prescribed event, and then disappears. The apparatus accumulates strokes into a buffer and interprets all accumulated strokes collectively against a character database and optionally a linguistic database, to identify multiple candidate strings that could be represented by the accumulated strokes. The apparatus displays candidate strings for user selection after all strokes have faded, or after receiving a user submitted delimiter, or after a given delay has elapsed following user entry of the latest stroke. Alternatively, candidate strings are displayed after each stroke without waiting for timeout or explicit delimiter.
G06F 3/041 - Digitisers, e.g. for touch screens or touch pads, characterised by the transducing means
G06K 9/00 - Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
G06F 3/023 - Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
G06F 3/0488 - Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
G06F 3/0354 - Pointing devices displaced or positioned by the userAccessories therefor with detection of 2D relative movements between the device, or an operating part thereof, and a plane or surface, e.g. 2D mice, trackballs, pens or pucks
57.
Systems and methods for an automated personalized dictionary generator for portable devices
A system and method for automated dictionary population is provided to facilitate the entry of textual material in dictionaries for enhancing word prediction. The automated dictionary population system is useful in association with a mobile device including at least one dictionary which includes entries. The device receives a communication which is parsed and textual data extracted. The text is compared to the entries of the dictionaries to identify new words. Statistical information for the parsed words, including word usage frequency, recency, or likelihood of use, is generated. Profanities may be processed by identifying profanities, modifying the profanities, and asking the user to provide feedback. Phrases are identified by phrase markers. Lastly, the new words are stored in a supplementary word list as single words or by linking the words of the identified phrases to preserve any phrase relationships. Likewise, the statistical information may be stored.
A conversational, natural language voice user interface may provide an integrated voice navigation services environment. The voice user interface may enable a user to make natural language requests relating to various navigation services, and further, may interact with the user in a cooperative, conversational dialogue to resolve the requests. Through dynamic awareness of context, available sources of information, domain knowledge, user behavior and preferences, and external systems and devices, among other things, the voice user interface may provide an integrated environment in which the user can speak conversationally, using natural language, to issue queries, commands, or other requests relating to the navigation services provided in the environment.
Software, firmware, and systems are described for identifying characters in a handwritten input received from a user on an input device, irrespective of an angle that the input is received at. In one implementation, the system establishes an anchor point and distances from the anchor point to reference support lines. A set of candidate characters is identified based on received handwritten input. The system estimates support lines for each of the candidate characters. The system ranks the candidate characters based on a total deviation measurement from the expectation for each candidate, where the expectation in part is based on the established distance from the established anchor point to reference support lines, and identifies a best-ranked candidate based at least in part on a smallest total deviation measurement.
G06K 9/00 - Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
G06F 3/0488 - Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
G06K 9/62 - Methods or arrangements for recognition using electronic means
60.
System and method for synthetic voice generation and modification
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating a synthetic voice. A system configured to practice the method combines a first database of a first text-to-speech voice and a second database of a second text-to-speech voice to generate a combined database, selects from the combined database, based on a policy, voice units of a phonetic category for the synthetic voice to yield selected voice units, and synthesizes speech based on the selected voice units. The system can synthesize speech without parameterizing the first text-to-speech voice and the second text-to-speech voice. A policy can define, for a particular phonetic category, from which text-to-speech voice to select voice units. The combined database can include multiple text-to-speech voices from different speakers. The combined database can include voices of a single speaker speaking in different styles. The combined database can include voices of different languages.
G10L 13/00 - Speech synthesisText to speech systems
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
G10L 13/027 - Concept to speech synthesisersGeneration of natural phrases from machine-based concepts
G10L 13/06 - Elementary speech units used in speech synthesisersConcatenation rules
H04B 7/04 - Diversity systemsMulti-antenna systems, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas
H04B 7/06 - Diversity systemsMulti-antenna systems, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for speech synthesis. A system practicing the method receives a set of ordered lists of speech units, for each respective speech unit in each ordered list in the set of ordered lists, constructs a sublist of speech units from a next ordered list which are suitable for concatenation, performs a cost analysis of paths through the set of ordered lists of speech units based on the sublist of speech units for each respective speech unit, and synthesizes speech using a lowest cost path of speech units through the set of ordered lists based on the cost analysis. The ordered lists can be ordered based on the respective pitch of each speech unit. In one embodiment, speech units which do not have an assigned pitch can be assigned a pitch.
The system and method described herein may dynamically generate a recognition grammar associated with a conversational voice user interface in an integrated voice navigation services environment. In particular, in response to receiving a natural language utterance that relates to a navigation context at the voice user interface, a conversational language processor may generate a dynamic recognition grammar that organizes grammar information based on one or more topological domains. For example, the one or more topological domains may be determined based on a current location associated with a navigation device, whereby a speech recognition engine may use the grammar information organized in the dynamic recognition grammar according to the one or more topological domains to generate one or more interpretations associated with the natural language utterance.
Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings.
H04B 1/00 - Details of transmission systems, not covered by a single one of groups Details of transmission systems not characterised by the medium used for transmission
H04B 1/10 - Means associated with receiver for limiting or suppressing noise or interference
A system and method for receiving character input from a user includes a programmed processor that receives inputs from the user and disambiguates the inputs to present character sequence choices corresponding to the input characters. In one embodiment, a first character input is received and a corresponding first recognized character is stored in a temporary storage buffer and displayed to the user for editing. After a predetermined number of subsequent input characters and/or predetermined amount of time without being edited, the system determines that the first recognized character is the intended character input by the user and removes the first recognized character from the buffer, thereby inhibiting future editing.
G06F 3/0488 - Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
G06F 3/01 - Input arrangements or combined input and output arrangements for interaction between user and computer
G06F 3/023 - Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
66.
Text browsing, editing and correction methods for automotive applications
An automotive text display arrangement is described which includes a driver text display positioned directly in front of an automobile driver and displaying a limited amount of text to the driver without impairing forward visual attention of the driver. The arrangement may include a boundary insertion mode wherein when the active text position is an active text boundary, new text is inserted between the text items separated by the active text boundary, and when the active text position is an active text item, new text replaces the active text item. In addition or alternatively, there may be a multifunctional text control knob offering multiple different user movements, each performing an associated text processing function.
Methods and apparatus for reducing impulsive interferences in a signal, without necessarily ascertaining a pitch frequency in the signal, detect onsets of the impulsive interferences by searching a spectrum of high-energy components for large temporal derivatives that are correlated along frequency and extend from a very low frequency up, possibly to about several kHz. The energies of the impulsive interferences are estimated, and these estimates are used to suppress the impulsive interferences. Optionally, techniques are employed to protect desired speech signals from being corrupted as a result of the suppression of the impulsive interferences.
G10L 21/00 - Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
G10L 19/00 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocodersCoding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
An embodiment according to the invention provides a capability of automatically predicting how favorable a given speech signal is for statistical modeling, which is advantageous in a variety of different contexts. In Multi-Form Segment (MFS) synthesis, for example, an embodiment according to the invention uses prediction capability to provide an automatic acoustic driven template versus model decision maker with an output quality that is high, stable and depends gradually on the system footprint. In speaker selection for a statistical Text-to-Speech synthesis (TTS) system build, as another example context, an embodiment according to the invention enables a fast selection of the most appropriate speaker among several available ones for the full voice dataset recording and preparation, based on a small amount of recorded speech material.
G10L 13/06 - Elementary speech units used in speech synthesisersConcatenation rules
G10L 13/04 - Details of speech synthesis systems, e.g. synthesiser structure or memory management
G10L 19/00 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocodersCoding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
G10L 25/48 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use
G10L 25/18 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
69.
Using emoticons for contextual text-to-speech expressivity
Techniques disclosed herein include systems and methods that improve audible emotional characteristics used when synthesizing speech from a text source. Systems and methods herein use emoticons identified from a source text to provide contextual text-to-speech expressivity. In general, techniques herein analyze text and identify emoticons included within the text. The source text is then tagged with corresponding mood indicators. For example, if the system identifies an emoticon at the end of a sentence, then the system can infer that this sentence has a specific tone or mood associated with it. Depending on whether the emoticon is a smiley face, angry face, sad face, laughing face, etc., the system can infer use or mood from the various emoticons and then change or modify the expressivity of the TTS output such as by changing intonation, prosody, speed, pauses, and other expressivity characteristics.
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
70.
Systems, methods and articles for a server providing communications and services involving automobile head units
Network communications, Web-based services and customized services using the Web-based services may be provided to drivers and users via the automobile head unit in the vehicle and via their mobile device. The automobile head unit in the vehicle and the mobile device are communicatively linked via a short range wireless connection. Also, these devices may communicate over a network such as a cellular network to a service provider that provides entertainment and informational services to the mobile device and the head unit of the vehicle. The user's profile and preferences are able to follow the user to various locations and into vehicles because this information is stored at a server accessible by the user's mobile device, and in some embodiments, also the head unit. The mobile device may provide services to the head unit if it does not have wider network connectivity over the short range wireless connection.
H04L 29/08 - Transmission control procedure, e.g. data link level control procedure
B60K 35/00 - Instruments specially adapted for vehiclesArrangement of instruments in or on vehicles
B60R 16/023 - Electric or fluid circuits specially adapted for vehicles and not otherwise provided forArrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for electric for transmission of signals between vehicle parts or subsystems
71.
Systems, methods and articles for a communications device providing communications and services involving automobile head units
Network communications, Web-based services and customized services using the Web-based services may be provided to drivers and users via the automobile head unit in the vehicle and via their mobile device. The automobile head unit in the vehicle and the mobile device are communicatively linked via a short range wireless connection. Also, these devices may communicate over a network such as a cellular network to a service provider that provides entertainment and informational services to the mobile device and the head unit of the vehicle. The user's profile and preferences are able to follow the user to various locations and into vehicles because this information is stored at a server accessible by the user's mobile device, and in some embodiments, also the head unit. The mobile device may provide services to the head unit if it does not have wider network connectivity over the short range wireless connection.
An embodiment of the invention is a software tool used to convert text, speech synthesis markup language (SSML), and/or extended SSML to synthesized audio. Provisions are provided to create, view, play, and edit the synthesized speech, including editing pitch and duration targets, speaking type, paralinguistic events, and prosody. Prosody can be provided by way of a sample recording. Users can interact with the software tool by way of a graphical user interface (GUI). The software tool can produce synthesized audio file output in many file formats.
G10L 13/06 - Elementary speech units used in speech synthesisersConcatenation rules
G10L 13/033 - Voice editing, e.g. manipulating the voice of the synthesiser
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
73.
Interchangeable input modules associated with varying languages
Interchangeable input modules, such as keypads, having user input devices configured to mate with base devices are described herein. The user input devices may include pluralities of inputs, such as input keys, associated with languages. The interchangeable input modules may further include storage components configured to store configuration data, linguistic structures, and/or predictive logic. Additionally, the interchangeable input modules may have interfaces configured to electrically couple the interchangeable input modules to the base devices after the interchangeable input modules are mated with the base devices.
G06F 3/023 - Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
74.
Method and apparatus for generating synthetic speech with contrastive stress
Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings.
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
G10L 13/02 - Methods for producing synthetic speechSpeech synthesisers
G10L 13/10 - Prosody rules derived from textStress or intonation
75.
Method and apparatus for combining text to speech and recorded prompts
An arrangement provides for improved synthesis of speech arising from a message text. The arrangement stores prerecorded prompts and speech related characteristics for those prompts. A message is parsed to determine if any message portions have been recorded previously. If so then speech related characteristics for those portions are retrieved. The arrangement generates speech related characteristics for those parties not previously stored. The retrieved and generated characteristics are combined. The combination of characteristics is then used as the input to a speech synthesizer.
G10L 13/00 - Speech synthesisText to speech systems
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
76.
Technology for entering data using patterns of relative directions
A method and apparatus for entering words into a computer system. Letters contained in a desired word are entered by giving approximate location and directional information relative to any specified keyboard layout. The inputs need not correspond to specific keys on the keyboard, a sequence of ambiguous key entries corresponding to individual words can be used to retrieve a word from the dictionary. The system tracks directional information of movement relative to a/the specific keyboard layout, reducing it to predetermined primary directions and translates this seemingly ambiguous information into accurate words from the dictionary. The system may also capture the user's intention (with regard to text entry) by observing the movements on the keyboard.
Waveform concatenation speech synthesis with high sound quality. Prosody with both high accuracy and high sound quality is achieved by performing a two-path search including a speech segment search and a prosody modification value search. An accurate accent is secured by evaluating the consistency of the prosody by using a statistical model of prosody variations (the slope of fundamental frequency) for both of two paths of the speech segment selection and the modification value search. In the prosody modification value search, a prosody modification value sequence that minimizes a modified prosody cost is searched for. This allows a search for a modification value sequence that can increase the likelihood of absolute values or variations of the prosody to the statistical model as high as possible with minimum modification values.
Network communications, Web-based services and customized services using the Web-based services may be provided over a peer-to-peer network from a first peer to a second peer (e.g., automobile head unit) wherein the first peer has a separate connection to a more general server-based network such as the Internet. A communications device application based on a peer communications framework component in communication with a peer network stack on the communications device may work as middleware, with a connection to both a more general server-based network such as the Internet and to an external device, such as a head unit of an automobile. Although the communications device has a separate connection out to the Internet via a general network stack co-existing on the same communications device, the peer network stack and the general network stack are not directly connected.
G06F 15/16 - Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
H04W 88/06 - Terminal devices adapted for operation in multiple networks, e.g. multi-mode terminals
H04W 88/04 - Terminal devices adapted for relaying to or from another terminal or user
Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings.
The present invention relates to a signal processing system, comprising a number of microphones and loudspeakers, a hands-free set configured to receive a telephone signal from a remote party and to transmit a microphone signal supplied by at least one of the microphones to the remote party; an in-vehicle communication system configured to receive a microphone signal supplied by at least one of the microphones; receive the telephone signal; amplify the microphone signal to obtain at least one first output signal; output the at least one first output signal and/or a second output signal corresponding to the telephone signal to at least one of the loudspeakers; and wherein the signal processing systems is configured to detect speech activity in the telephone signal and to control the in-vehicle communication system to reduce amplification of the microphone signal by a damping factor, if it is detected that speech activity is present in the telephone signal.
A conversational, natural language voice user interface may provide an integrated voice navigation services environment. The voice user interface may enable a user to make natural language requests relating to various navigation services, and further, may interact with the user in a cooperative, conversational dialogue to resolve the requests. Through dynamic awareness of context, available sources of information, domain knowledge, user behavior and preferences, and external systems and devices, among other things, the voice user interface may provide an integrated environment in which the user can speak conversationally, using natural language, to issue queries, commands, or other requests relating to the navigation services provided in the environment.
Embodiments of the present invention exploit redundancy of succeeding FFT spectra and use this redundancy for computing interpolated temporal supporting points. An analysis filter bank converts overlapped sequences of an audio (ex. loudspeaker) signal from a time domain to a frequency domain to obtain a time series of short-time loudspeaker spectra. An interpolator temporally interpolates this time series. The interpolation is fed to an echo canceller, which computes an estimated echo spectrum. A microphone analysis filter bank converts overlapped sequences of an audio microphone signal from the time domain to the frequency domain to obtain a time series of short-time microphone spectra. The estimated echo spectrum is subtracted from the microphone spectrum. Further signal enhancement (filtration) may be applied. A synthesis filter bank converts the filtered microphone spectra to the time domain to generate an echo compensated audio microphone signal. Computational complexity of signal processing systems can, therefore, be reduced.
G10K 11/00 - Methods or devices for transmitting, conducting or directing sound in generalMethods or devices for protecting against, or for damping, noise or other acoustic waves in general
G10L 21/02 - Speech enhancement, e.g. noise reduction or echo cancellation
G10L 19/02 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocodersCoding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
83.
Systems and methods for character correction in communication devices
Systems and methods for character error correction are provided, useful for a user of mobile appliances to produce written text with reduced errors. The system includes an interface, a word prediction engine, a statistical engine, an editing distance calculator, and a selector. A string of characters, known as the inputted word, may be entered into the mobile device via the interface. The word prediction engine may then generate word candidates similar to the inputted word using fuzzy logic and user preferences generated from past user behavior. The statistical engine may then generate variable is error costs determined by the probability of erroneously inputting any given character. The editing distance calculator may then determine the editing distance between the inputted word and each of the word candidates by grid comparison using the variable error costs. The selector may choose one or more preferred candidates from the word candidates using the editing distances.
G06F 17/27 - Automatic analysis, e.g. parsing, orthograph correction
G06K 9/72 - Methods or arrangements for recognition using electronic means using context analysis based on the provisionally recognised identity of a number of successive patterns, e.g. a word
84.
Systems and methods for an automated personalized dictionary generator for portable devices
A system and method for automated dictionary population is provided to facilitate the entry of textual material in dictionaries for enhancing word prediction. The automated dictionary population system is useful in association with a mobile device including at least one dictionary which includes entries. The device receives a communication which is parsed and textual data extracted. The text is compared to the entries of the dictionaries to identify new words. Statistical information for the parsed words, including word usage frequency, recency, or likelihood of use, is generated. Profanities may be processed by identifying profanities, modifying the profanities, and asking the user to provide feedback. Phrases are identified by phrase markers. Lastly, the new words are stored in a supplementary word list as single words or by linking the words of the identified phrases to preserve any phrase relationships. Likewise, the statistical information may be stored.
Embodiments of the present invention exploit redundancy of succeeding FFT spectra and use this redundancy for computing interpolated temporal supporting points. An analysis filter bank converts overlapped sequences of an audio (ex. loudspeaker) signal from a time domain to a frequency domain to obtain a time series of short-time loudspeaker spectra. An interpolator temporally interpolates this time series. The interpolation is fed to an echo canceller, which computes an estimated echo spectrum. A microphone analysis filter bank converts overlapped sequences of an audio microphone signal from the time domain to the frequency domain to obtain a time series of short-time microphone spectra. The estimated echo spectrum is subtracted from the microphone spectrum. Further signal enhancement (filtration) may be applied. A synthesis filter bank converts the filtered microphone spectra to the time domain to generate an echo compensated audio microphone signal. Computational complexity of signal processing systems can, therefore, be reduced.
G10K 11/00 - Methods or devices for transmitting, conducting or directing sound in generalMethods or devices for protecting against, or for damping, noise or other acoustic waves in general
G10L 21/02 - Speech enhancement, e.g. noise reduction or echo cancellation
G10L 19/02 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocodersCoding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
86.
In-car communication system for multiple acoustic zones
An In-Car Communication (ICC) system supports the communication paths within a car by receiving the speech signals of a speaking passenger and playing it back for one or more listening passengers. Signal processing tasks are split into a microphone related part and into a loudspeaker related part. A sound processing system suitable for use in a vehicle having multiple acoustic zones includes a plurality of microphone In-Car Communication (Mic-ICC) instances coupled and a plurality of loudspeaker In-Car Communication (Ls-ICC) instances. The system further includes a dynamic audio routing matrix with a controller and coupled to the Mic-ICC instances, a mixer coupled to the plurality of Mic-ICC instances and a distributor coupled to the Ls-ICC instances.
H04M 9/08 - Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
H04R 5/02 - Spatial or constructional arrangements of loudspeakers
H04R 1/40 - Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
H04R 3/02 - Circuits for transducers for preventing acoustic reaction
H04R 3/12 - Circuits for transducers for distributing signals to two or more loudspeakers
87.
System and method for information identification using tracked preferences of a user
A system and a method of retrieving information is described. In a system according to the invention, software modules may be used to provide the user with information that is most likely to be the information desired.
G06F 21/00 - Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
G06F 17/30 - Information retrieval; Database structures therefor
G06F 17/27 - Automatic analysis, e.g. parsing, orthograph correction
H04M 1/2745 - Devices whereby a plurality of signals may be stored simultaneously with provision for storing more than one subscriber number at a time using static electronic memories, e.g. chips
A text message processing arrangement is described for use in a mobile environment. A mobile messaging application processes user text messages during a user messaging session. A user state model reflects situational parameters to characterize user cognitive load. A functionality control module adjusts functional performance of the mobile messaging application based on the user state model.
H04M 1/00 - Substation equipment, e.g. for use by subscribers
H04W 48/04 - Access restriction performed under specific conditions based on user or terminal location or mobility data, e.g. moving direction or speed
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for reducing latency in web-browsing TTS systems without the use of a plug-in or Flash® module. A system configured according to the disclosed methods allows the browser to send prosodically meaningful sections of text to a web server. A TTS server then converts intonational phrases of the text into audio and responds to the browser with the audio file. The system saves the audio file in a cache, with the file indexed by a unique identifier. As the system continues converting text into speech, when identical text appears the system uses the cached audio corresponding to the identical text without the need for re-synthesis via the TTS server.
G10L 13/00 - Speech synthesisText to speech systems
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
G10L 13/10 - Prosody rules derived from textStress or intonation
90.
Method for determining a noise reference signal for noise compensation and/or noise reduction
The invention provides a method for determining a noise reference signal for noise compensation and/or noise reduction. A first audio signal on a first signal path and a second audio signal on a second signal path are received. The first audio signal is filtered using a first adaptive filter to obtain a first filtered audio signal. The second audio signal is filtered using a second adaptive filter to obtain a second filtered audio signal. The first and the second filtered audio signal are combined to obtain the noise reference signal. The first and the second adaptive filter are adapted such as to minimize a wanted signal component in the noise reference signal.
G10K 11/00 - Methods or devices for transmitting, conducting or directing sound in generalMethods or devices for protecting against, or for damping, noise or other acoustic waves in general
A speech output is generated from a text input written in a first language and containing inclusions in a second language. Words in the native language are pronounced with a native pronunciation and words in the foreign language are pronounced with a proficient foreign pronunciation. Language dependent phoneme symbols generated for words of the second language are replaced with language dependent phoneme symbols of the first language, where said replacing includes the steps of assigning to each language dependent phoneme symbol of the second language a language independent target phoneme symbol, mapping to each one language independent target phoneme symbol a language independent substitute phoneme symbol assignable to a language dependent substitute phoneme symbol of the first language, substituting the language dependent phoneme symbols of the second language by the language dependent substitute phoneme symbols of the first language.
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
G06F 17/28 - Processing or translating of natural language
G10L 13/06 - Elementary speech units used in speech synthesisersConcatenation rules
92.
System and method for dynamic noise adaptation for robust automatic speech recognition
A speech processing method and arrangement are described. A dynamic noise adaptation (DNA) model characterizes a speech input reflecting effects of background noise. A null noise DNA model characterizes the speech input based on reflecting a null noise mismatch condition. A DNA interaction model performs Bayesian model selection and re-weighting of the DNA model and the null noise DNA model to realize a modified DNA model characterizing the speech input for automatic speech recognition and compensating for noise to a varying degree depending on relative probabilities of the DNA model and the null noise DNA model.
A handwriting recognition apparatus facilitates user entry of strokes one on top of another. The apparatus, which includes a processor and a display integrated with a touch sensitive screen, receives a series of strokes via the screen. Each stroke is defined by contact, trace, and lift occurrences. Each stroke appears on the display until occurrence of a prescribed event, and then disappears. The apparatus accumulates strokes into a buffer and interprets all accumulated strokes collectively against a character database and optionally a linguistic database, to identify multiple candidate strings that could be represented by the accumulated strokes. The apparatus displays candidate strings for user selection after all strokes have faded, or after receiving a user submitted delimiter, or after a given delay has elapsed following user entry of the latest stroke. Alternatively, candidate strings are displayed after each stroke without waiting for timeout or explicit delimiter.
G06F 3/041 - Digitisers, e.g. for touch screens or touch pads, characterised by the transducing means
G06K 9/00 - Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
G06F 3/0488 - Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
G06F 3/023 - Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
94.
Efficient audio signal processing in the sub-band regime
A signal processing system enhances an audio signal. The audio signal is divided into audio sub-band signals. Some audio sub-band signals are excised. Other audio sub-band signals are processed to obtain enhanced audio sub-band signals. At least a portion of the excised audio sub-band signals are reconstructed. The reconstructed audio sub-band signals are synthesized with the enhanced audio sub-band signals to form an enhanced audio signal.
H04B 3/20 - Reducing echo effects or singingOpening or closing transmitting pathConditioning for transmission in one direction or the other
H04M 9/08 - Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
G10L 25/18 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
A method of selecting a service and inputting information to that service, in which an input device having keys is provided. When a key is pressed and released quickly, the user indicates a desire to enter a symbol on the key in order to enter symbols of an entry string. In addition, one or more of the keys may also be used to identify a service and also supply that service with the entry string. For example, by pressing and holding such a key, the entry string may be delimited and then sent to a service corresponding to the pressed key. In this manner, a single key press may be used to both delimit an entry string and also send the entry string to the service. The service may use the delimited entry string to retrieve information, which is then supplied to the input device.
A computer receives user entry of a sequence of keypresses, representing an intended series of letters collectively spelling-out some or all of a desired textual object. Resolution of the intended series of letters and the desired textual object is ambiguous, however, because some or all of the key presses individually represent multiple letters. The computer interprets the keypresses utilizing concurrent, competing strategies, including one-keypress-per-letter and multi-tap interpretations. The computer displays a combined output of proposed interpretations and completions from both strategies.
H03K 17/94 - Electronic switching or gating, i.e. not by contact-making and -breaking characterised by the way in which the control signals are generated
H03M 11/00 - Coding in connection with keyboards or like devices, i.e. coding of the position of operated keys
G06F 3/02 - Input arrangements using manually operated switches, e.g. using keyboards or dials
G09G 5/00 - Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
G06F 13/12 - Program control for peripheral devices using hardware independent of the central processor, e.g. channel or peripheral processor
An arrangement is described for measuring performance characteristics of a hands free telephone system. There is a measurement system which is coupleable over a telephone audio interface directly to the hands free telephone system for measuring the performance characteristics.
User input is received, specifying a continuous traced path across a keyboard presented on a touch sensitive display. An input sequence is resolved, including traced keys and auxiliary keys proximate to the traced keys by prescribed criteria. For each of one or more candidate entries of a prescribed vocabulary, a set-edit-distance metric is computed between said input sequence and the candidate entry. Various rules specify when penalties are imposed, or not, in computing the set-edit-distance metric. Candidate entries are ranked and displayed according to the computed metric.
G09G 5/00 - Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
G06F 3/023 - Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
G06F 3/0488 - Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
G06F 17/27 - Automatic analysis, e.g. parsing, orthograph correction
99.
Speech enhancement techniques on the power spectrum
The method provides a spectral speech description to be used for synthesis of a speech utterance, where at least one spectral envelope input representation is received. In one solution the improvement is made by manipulation an extremum, i.e. a peak or a valley, in the rapidly varying component of the spectral envelope representation. The rapidly varying component of the spectral envelope representation is manipulated to sharpen and/or accentuate extrema after which it is merged back with the slowly varying component or the spectral envelope input representation to create an enhanced spectral envelope final representation. In other solutions a complex spectrum envelope final representation is created with phase information derived from one of the group delay representation of a real spectral envelope input representation corresponding to a short-time speech signal and a transformed phase component of the discrete complex frequency domain input representation corresponding to the speech utterance.
G10L 21/00 - Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
G10L 21/02 - Speech enhancement, e.g. noise reduction or echo cancellation
A custom-content audible representation of selected data content is automatically created for a user. The content is based on content preferences of the user (e.g., one or more web browsing histories). The content is aggregated, converted using text-to-speech technology, and adapted to fit in a desired length selected for the personalized audible representation. The length of the audible representation may be custom for the user, and may be determined based on the amount of time the user is typically traveling.