A vehicle system for classifying spoken utterance within a vehicle cabin as one of system-directed and non-system directed, the system may include at least one microphone configured to detect at least one acoustic utterance from at least one occupant of a vehicle, at least one sensor to detect user behavior data indicative of user behavior, and a processor programmed to: receive the acoustic utterance, classify the acoustic utterance as one of a system-directed utterance and a non-system directed utterance, determine whether the acoustic utterance was properly classified based on user behavior observed via data received from the sensor after the classification, and apply a mitigating adjustment to classifications of subsequent acoustic utterances based on an improper classification.
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
G10L 15/18 - Classement ou recherche de la parole utilisant une modélisation du langage naturel
G10L 15/30 - Reconnaissance distribuée, p. ex. dans les systèmes client-serveur, pour les applications en téléphonie mobile ou réseaux
G10L 25/63 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes spécialement adaptées pour un usage particulier pour comparaison ou différentiation pour estimer un état émotionnel
2.
System and method for temporal and power based zone detection in speaker dependent microphone environments
A method, computer program product, and computer system for receiving, by a computing device, a speech signal from a speaker via a plurality of microphone zones. A temporal cue based confidence may be determined for at least a portion of the plurality of microphone zones. A power cue based confidence may be determined for at least a portion of the plurality of microphone zones. A microphone zone of the plurality of microphone zones from which to use an output signal of the speaker may be identified based upon, at least in part, a combination of the temporal cue based confidence and the power cue based confidence.
H04R 1/40 - Dispositions pour obtenir la fréquence désirée ou les caractéristiques directionnelles pour obtenir la caractéristique directionnelle désirée uniquement en combinant plusieurs transducteurs identiques
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
G10L 25/21 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par le type de paramètres extraits les paramètres extraits étant l’information sur la puissance
G10L 25/78 - Détection de la présence ou de l’absence de signaux de voix
G10L 25/84 - Détection de la présence ou de l’absence de signaux de voix pour différencier la parole du bruit
G10L 21/0216 - Filtration du bruit caractérisée par le procédé d’estimation du bruit
3.
System and method for combined non-linear and late echo suppression
A method, computer program product, and computer system for receiving, by a computing device, an input signal. A first power spectral density estimate may be generated for a linear reverberant component associated with the input signal. A second power spectral density estimate may be generated for a non-linear reverberant component associated with the input signal. A power spectral density estimate may be generated by combining the first power spectral density estimate for the linear reverberant component and the second power spectral density estimate for the non-linear reverberant component. One or more parameters for at least one of the linear reverberant component and the non-linear reverberant component may be updated. One or more undesired signal components in an output signal resulting from the input signal may be reduced via residual echo suppression based upon, at least in part, updating the one or more parameters.
A method for residual echo suppression is provided. Embodiments may include receiving an original reference signal and applying a distortion function to the original reference signal to generate a second signal. Embodiments may include generating a non-linear signal from the distortion function that does not include linear components of the original reference signal. Embodiments may also include calculating a residual echo power of a linear component and a non-linear component, wherein the linear component is based upon the original reference signal and the non-linear component is based upon the non-linear signal. Embodiments may further include applying a room model to each of the original reference signal and the non-linear signal and estimating a power associated with the original reference signal and the non-linear signal. Embodiments may include calculating a combined echo power estimate as a weighted sum of a weighted original reference signal power and a weighted non-linear signal power.
H04M 9/08 - Systèmes téléphoniques à haut-parleur à double sens comportant des moyens pour conditionner le signal, p. ex. pour supprimer les échos dans l'une ou les deux directions du trafic
G10L 21/0232 - Traitement dans le domaine fréquentiel
G10L 21/0264 - Filtration du bruit caractérisée par le type de mesure du paramètre, p. ex. techniques de corrélation, techniques de passage par zéro ou techniques prédictives
A system and method for speech enhancement of a portable electronic device. Embodiments may include receiving an audio signal at a portable electronic device having a first microphone and a second microphone. Embodiments may also include receiving an input from a proximity detector associated with the portable electronic device and controlling a processing component associated with at least one of the first microphone and the second microphone based upon, at least in part, the input from the proximity detector.
The invention relates to a system and method for integrating domain information into state transitions of a Finite State Transducer (“FST”) for natural language processing. A system may integrate semantic parsing and information retrieval from an information domain to generate an FST parser that represents the information domain. The FST parser may include a plurality of FST paths, at least one of which may be used to generate a meaning representation from a natural language input. As such, the system may perform domain-based semantic parsing of a natural language input, generating more robust meaning representations using domain information. The system may be applied to a wide range of natural language applications that use natural language input from a user such as, for example, natural language interfaces to computing systems, communication with robots in natural language, personalized digital assistants, question-answer query systems, and/or other natural language processing applications.
A system and method of tagging utterances with Named Entity Recognition (“NER”) labels using unmanaged crowds is provided. The system may generate various annotation jobs in which a user, among a crowd, is asked to tag which parts of an utterance, if any, relate to various entities associated with a domain. For a given domain that is associated with a number of entities that exceeds a threshold N value, multiple batches of jobs (each batch having jobs that have a limited number of entities for tagging) may be used to tag a given utterance from that domain. This reduces the cognitive load imposed on a user, and prevents the user from having to tag more than N entities. As such, a domain with a large number of entities may be tagged efficiently by crowd participants without overloading each crowd participant with too many entities to tag.
A method, computer program product, and computer system for addressing acoustic signal reverberation is provided. Embodiments may include receiving, at one or more microphones, a first audio signal and a reverberation audio signal. Embodiments may further include processing at least one of the first audio signal and the reverberation audio signal. Embodiments may also include limiting a model based reverberation equalizer using a temporal constraint for direct sound distortions, the model based reverberation equalizer configured to generate one or more outputs, based upon, at least in part, at least one of the first audio signal and the reverberation audio signal.
G10K 15/08 - Dispositions pour produire une réverbération sonore ou un écho
H04B 3/20 - Réduction des effets d'échos ou de sifflementSystèmes à ligne de transmission Détails ouverture ou fermeture de la voie d'émissionCommande de la transmission dans une direction ou l'autre
G10L 21/0216 - Filtration du bruit caractérisée par le procédé d’estimation du bruit
H04M 9/08 - Systèmes téléphoniques à haut-parleur à double sens comportant des moyens pour conditionner le signal, p. ex. pour supprimer les échos dans l'une ou les deux directions du trafic
A system and method is provided of disambiguating natural language processing requests based on smart matching, request confirmations that are used until ambiguities are resolved, and machine learning. Smart matching may match entities (e.g., contact names, place names, etc.) based on user information such as call logs, user preferences, etc. If multiple matches are found and disambiguation has not yet been learned by the system, the system may request that the user identify the intended entity. On the other hand, if disambiguation has been learned by the system, the system may execute the request without confirmations. The system may use a record of confirmations and/or other information to continuously learn a user's inputs in order to reduce ambiguities and no longer prompt for confirmations.
In certain implementations, follow-up responses may be provided for prior natural language inputs of a user. As an example, a natural language input associated with a user may be received at a computer system. A determination of whether information sufficient for providing an adequate response to the natural language input is currently accessible to the computer system may be effectuated. A first response to the natural language input (that indicates that a follow-up response will be provided) may be provided based on a determination that information sufficient for providing an adequate response to the natural language input is not currently accessible. Information sufficient for providing an adequate response to the natural language input may be received. A second response to the natural language input may then be provided based on the received sufficient information.
Methods and apparatus for estimating the power spectral density (PSD) of a residual interference having first and second components after adaptive interference cancellation (AIC). The first component can be estimated using a real-valued FIR filter operating on a time series of PSD estimates of a reference signal, and the second component can be estimated using an exponential decay over time corresponding to a reverberation time using the PSD of the reference signal.
G10L 21/0232 - Traitement dans le domaine fréquentiel
G10L 25/21 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par le type de paramètres extraits les paramètres extraits étant l’information sur la puissance
A conversational, natural language voice user interface may provide an integrated voice navigation services environment. The voice user interface may enable a user to make natural language requests relating to various navigation services, and further, may interact with the user in a cooperative, conversational dialogue to resolve the requests. Through dynamic awareness of context, available sources of information, domain knowledge, user behavior and preferences, and external systems and devices, among other things, the voice user interface may provide an integrated environment in which the user can speak conversationally, using natural language, to issue queries, commands, or other requests relating to the navigation services provided in the environment.
A multi-mode speech communication system is described that has different operating modes for different speech applications. A signal processing module is in communication with the speech applications and includes an input processing module and an output processing module. The input processing module processes microphone input signals to produce a set user input signals for each speech application that are limited to currently active system users for that speech application. The output processing module processes application output communications from the speech applications to produce loudspeaker output signals to the system users, wherein for each different speech application, the loudspeaker output signals are directed only to system users currently active in that speech application.
An automotive text display arrangement is described which includes a driver text display positioned directly in front of an automobile driver and displaying a limited amount of text to the driver without impairing forward visual attention of the driver. The arrangement may include a boundary insertion mode wherein when the active text position is an active text boundary, new text is inserted between the text items separated by the active text boundary, and when the active text position is an active text item, new text replaces the active text item. In addition or alternatively, there may be a multifunctional text control knob offering multiple different user movements, each performing an associated text processing function.
G06F 3/0481 - Techniques d’interaction fondées sur les interfaces utilisateur graphiques [GUI] fondées sur des propriétés spécifiques de l’objet d’interaction affiché ou sur un environnement basé sur les métaphores, p. ex. interaction avec des éléments du bureau telles les fenêtres ou les icônes, ou avec l’aide d’un curseur changeant de comportement ou d’aspect
15.
System and method of recording utterances using unmanaged crowds for natural language processing
A system and method of recording utterances for building Named Entity Recognition (“NER”) models, which are used to build dialog systems in which a computer listens and responds to human voice dialog. Utterances to be uttered may be provided to users through their mobile devices, which may record the user uttering (e.g., verbalizing, speaking, etc.) the utterances and upload the recording to a computer for processing. The use of the user's mobile device, which is programmed with an utterance collection application (e.g., configured as a mobile app), facilitates the use of crowd-sourcing human intelligence tasking for widespread collection of utterances from a population of users. As such, obtaining large datasets for building NER models may be facilitated by the system and method disclosed herein.
Systems and methods gathering text commands in response to a command context using a first crowdsourced are discussed herein. A command context for a natural language processing system may be identified, where the command context is associated with a command context condition to provide commands to the natural language processing system. One or more command creators associated with one or more command creation devices may be selected. A first application one the one or more command creation devices may be configured to display command creation instructions for each of the one or more command creators to provide text commands that satisfy the command context, and to display a field for capturing a user-generated text entry to satisfy the command creation condition in accordance with the command creation instructions. Systems and methods for reviewing the text commands using second and crowdsourced jobs are also presented herein.
G06F 17/00 - Équipement ou méthodes de traitement de données ou de calcul numérique, spécialement adaptés à des fonctions spécifiques
G06F 21/00 - Dispositions de sécurité pour protéger les calculateurs, leurs composants, les programmes ou les données contre une activité non autorisée
The invention relates to a system and method of automatically distinguishing between computers and human based on responses to enhanced Completely Automated Public Turing test to tell Computers and Humans Apart (“e-captcha”) challenges that do not merely challenge the user to recognize skewed or stylized text. A given e-captcha challenge may be specific to a particular knowledge domain. Accordingly, e-captchas may be used not only to distinguish between computers and humans, but also determine whether a respondent has demonstrated knowledge in the particular knowledge domain. For instance, participants in crowd-sourced tasks, in which unmanaged crowds are asked to perform tasks, may be screened using an e-captcha challenge. This not only validates that a participant is a human (and not a bot, for example, attempting to game the crowd-source task), but also screens the participant based on whether they can successfully respond to the e-captcha challenge.
G06F 21/00 - Dispositions de sécurité pour protéger les calculateurs, leurs composants, les programmes ou les données contre une activité non autorisée
G06F 21/36 - Authentification de l’utilisateur par représentation graphique ou iconique
A system and method of tagging utterances with Named Entity Recognition (“NER”) labels using unmanaged crowds is provided. The system may generate various annotation jobs in which a user, among a crowd, is asked to tag which parts of an utterance, if any, relate to various entities associated with a domain. For a given domain that is associated with a number of entities that exceeds a threshold N value, multiple batches of jobs (each batch having jobs that have a limited number of entities for tagging) may be used to tag a given utterance from that domain. This reduces the cognitive load imposed on a user, and prevents the user from having to tag more than N entities. As such, a domain with a large number of entities may be tagged efficiently by crowd participants without overloading each crowd participant with too many entities to tag.
A system and method for generating a video sequence having mouth movements synchronized with speech sounds are disclosed. The system utilizes a database of n-phones as the smallest selectable unit, wherein n is larger than 1 and preferably 3. The system calculates a target cost for each candidate n-phone for a target frame using a phonetic distance, coarticulation parameter, and speech rate. For each n-phone in a target sequence, the system searches for candidate n-phones that are visually similar according to the target cost. The system samples each candidate n-phone to get a same number of frames as in the target sequence and builds a video frame lattice of candidate video frames. The system assigns a joint cost to each pair of adjacent frames and searches the video frame lattice to construct the video sequence by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.
Methods and apparatus for broadening the beamwidth of beamforming and postfiltering using a plurality of beamformers and signal and power spectral density mixing, and controlling a postfilter based on spatial activity detection such that de-reverberation or noise reduction is performed when a speech source is between the first and second beams.
G10L 21/0232 - Traitement dans le domaine fréquentiel
G10L 25/21 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par le type de paramètres extraits les paramètres extraits étant l’information sur la puissance
G10L 21/0216 - Filtration du bruit caractérisée par le procédé d’estimation du bruit
21.
System and method of providing and validating enhanced CAPTCHAs
The invention relates to a system and method of automatically distinguishing between computers and human based on responses to enhanced Completely Automated Public Turing test to tell Computers and Humans Apart (“e-captcha”) challenges that do not merely challenge the user to recognize skewed or stylized text. A given e-captcha challenge may be specific to a particular knowledge domain. Accordingly, e-captchas may be used not only to distinguish between computers and humans, but also determine whether a respondent has demonstrated knowledge in the particular knowledge domain. For instance, participants in crowd-sourced tasks, in which unmanaged crowds are asked to perform tasks, may be screened using an e-captcha challenge. This not only validates that a participant is a human (and not a bot, for example, attempting to game the crowd-source task), but also screens the participant based on whether they can successfully respond to the e-captcha challenge.
G06F 21/00 - Dispositions de sécurité pour protéger les calculateurs, leurs composants, les programmes ou les données contre une activité non autorisée
A system and method configured for use in a text-to-speech (TTS) system is provided. Embodiments may include identifying, using one or more processors, a word or phrase as a named entity and identifying a language of origin associated with the named entity. Embodiments may further include transliterating the named entity to a script associated with the language of origin. If the TTS system is operating in the language of origin, embodiments may include passing the transliterated script to the TTS system. If the TTS system is not operating in the language of origin, embodiments may include generating a phoneme sequence in the language of origin using a grapheme to phoneme (G2P) converter.
G06F 13/00 - Interconnexion ou transfert d'information ou d'autres signaux entre mémoires, dispositifs d'entrée/sortie ou unités de traitement
G10L 13/08 - Analyse de texte ou génération de paramètres pour la synthèse de la parole à partir de texte, p. ex. conversion graphème-phonème, génération de prosodie ou détermination de l'intonation ou de l'accent tonique
A system and method for out-of-vocabulary compound word handling is provided. Embodiments may include storing a plurality of compound word rules and compound word dictionaries in a database. Embodiments may also include evaluating membership criteria associated with a received compound word, wherein membership criteria includes at least one of dictionary based or part of speech (POS) based criteria. Embodiments may further include applying one or more filtering rules to the received compound word.
A system and method for concatenative speech synthesis is provided. Embodiments may include accessing, using one or more computing devices, a plurality of speech synthesis units from a speech database and determining a similarity between the plurality of speech synthesis units. Embodiments may further include retrieving two or more speech synthesis units having the similarity and pruning at least one of the two or more speech synthesis units based upon, at least in part, the similarity.
A system and method of recording utterances for building Named Entity Recognition (“NER”) models, which are used to build dialog systems in which a computer listens and responds to human voice dialog. Utterances to be uttered may be provided to users through their mobile devices, which may record the user uttering (e.g., verbalizing, speaking, etc.) the utterances and upload the recording to a computer for processing. The use of the user's mobile device, which is programmed with an utterance collection application (e.g., configured as a mobile app), facilitates the use of crowd-sourcing human intelligence tasking for widespread collection of utterances from a population of users. As such, obtaining large datasets for building NER models may be facilitated by the system and method disclosed herein.
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
G10L 15/18 - Classement ou recherche de la parole utilisant une modélisation du langage naturel
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
26.
Methods and apparatus for robust speaker activity detection
Method and apparatus to determine a speaker activity detection measure from energy-based characteristics of signals from a plurality of speaker-dedicated microphones, detect acoustic events using power spectra for the microphone signals, and determine a robust speaker activity detection measure from the speaker activity measure and the detected acoustic events.
G10L 25/21 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par le type de paramètres extraits les paramètres extraits étant l’information sur la puissance
G10L 25/78 - Détection de la présence ou de l’absence de signaux de voix
Systems and methods of validating transcriptions of natural language content using crowdsourced validation jobs are provided herein. In various implementations, a transcription pair comprising natural language content and text corresponding to a transcription of the natural language content may be gathered. A first group of validation devices may be selected for reviewing the transcription pair. A first crowdsourced validation job may be created for the first group of validation devices. The first crowdsourced validation job may be provided to the first group of validation devices. A vote representing whether or not the text accurately represents the natural language content may be received from each of the first group of validation devices. A validation score may be assigned to the transcription pair based, at least in part, on the votes from each of the first group of validation devices.
The invention relates to a system and method for integrating domain information into state transitions of a Finite State Transducer (“FST”) for natural language processing. A system may integrate semantic parsing and information retrieval from an information domain to generate an FST parser that represents the information domain. The FST parser may include a plurality of FST paths, at least one of which may be used to generate a meaning representation from a natural language input. As such, the system may perform domain-based semantic parsing of a natural language input, generating more robust meaning representations using domain information. The system may be applied to a wide range of natural language applications that use natural language input from a user such as, for example, natural language interfaces to computing systems, communication with robots in natural language, personalized digital assistants, question-answer query systems, and/or other natural language processing applications.
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating a synthetic voice. A system configured to practice the method combines a first database of a first text-to-speech voice and a second database of a second text-to-speech voice to generate a combined database, selects from the combined database, based on a policy, voice units of a phonetic category for the synthetic voice to yield selected voice units, and synthesizes speech based on the selected voice units. The system can synthesize speech without parameterizing the first text-to-speech voice and the second text-to-speech voice. A policy can define, for a particular phonetic category, from which text-to-speech voice to select voice units. The combined database can include multiple text-to-speech voices from different speakers. The combined database can include voices of a single speaker speaking in different styles. The combined database can include voices of different languages.
G10L 13/027 - Synthétiseurs de parole à partir de conceptsGénération de phrases naturelles à partir de concepts automatisés
G10L 13/047 - Architecture des synthétiseurs de parole
G10L 13/06 - Unités élémentaires de parole utilisées dans les synthétiseurs de paroleRègles de concaténation
G10L 13/04 - Détails des systèmes de synthèse de la parole, p. ex. structure du synthétiseur ou gestion de la mémoire
H04B 7/04 - Systèmes de diversitéSystèmes à plusieurs antennes, c.-à-d. émission ou réception utilisant plusieurs antennes utilisant plusieurs antennes indépendantes espacées
H04B 7/06 - Systèmes de diversitéSystèmes à plusieurs antennes, c.-à-d. émission ou réception utilisant plusieurs antennes utilisant plusieurs antennes indépendantes espacées à la station d'émission
G10L 25/63 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes spécialement adaptées pour un usage particulier pour comparaison ou différentiation pour estimer un état émotionnel
30.
System and method for providing words or phrases to be uttered by members of a crowd and processing the utterances in crowd-sourced campaigns to facilitate speech analysis
Systems and methods of providing text related to utterances, and gathering voice data in response to the text are provide herein. In various implementations, an identification token that identifies a first file for a voice data collection campaign, and a second file for a session script may be received from a natural language processing training device. The first file and the second file may be used to configure the mobile application to display a sequence of screens, each of the sequence of screens containing text of at least one utterance specified in the voice data collection campaign. Voice data may be received from the natural language processing training device in response to user interaction with the text of the at least one utterance. The voice data and the text may be stored in a transcription library.
G10L 15/26 - Systèmes de synthèse de texte à partir de la parole
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
31.
Probability-based approach to recognition of user-entered data
A method for entering keys in a small key pad is provided. The method comprising the steps of: providing at least a part of keyboard having a plurality of keys; and predetermining a first probability of a user striking a key among the plurality of keys. The method further uses a dictionary of selected words associated with the key pad and/or a user.
G06F 3/023 - Dispositions pour convertir sous une forme codée des éléments d'information discrets, p. ex. dispositions pour interpréter des codes générés par le clavier comme codes alphanumériques, comme codes d'opérande ou comme codes d'instruction
G06F 3/0488 - Techniques d’interaction fondées sur les interfaces utilisateur graphiques [GUI] utilisant des caractéristiques spécifiques fournies par le périphérique d’entrée, p. ex. des fonctions commandées par la rotation d’une souris à deux capteurs, ou par la nature du périphérique d’entrée, p. ex. des gestes en fonction de la pression exercée enregistrée par une tablette numérique utilisant un écran tactile ou une tablette numérique, p. ex. entrée de commandes par des tracés gestuels
32.
System and method for providing follow-up responses to prior natural language inputs of a user
In certain implementations, follow-up responses may be provided for prior natural language inputs of a user. As an example, a natural language input associated with a user may be received at a computer system. A determination of whether information sufficient for providing an adequate response to the natural language input is currently accessible to the computer system may be effectuated. A first response to the natural language input (that indicates that a follow-up response will be provided) may be provided based on a determination that information sufficient for providing an adequate response to the natural language input is not currently accessible. Information sufficient for providing an adequate response to the natural language input may be received. A second response to the natural language input may then be provided based on the received sufficient information.
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for reducing latency in web-browsing TTS systems without the use of a plug-in or Flash® module. A system configured according to the disclosed methods allows the browser to send prosodically meaningful sections of text to a web server. A TTS server then converts intonational phrases of the text into audio and responds to the browser with the audio file. The system saves the audio file in a cache, with the file indexed by a unique identifier. As the system continues converting text into speech, when identical text appears the system uses the cached audio corresponding to the identical text without the need for re-synthesis via the TTS server.
G10L 13/00 - Synthèse de la paroleSystèmes de synthèse de la parole à partir de texte
G10L 13/08 - Analyse de texte ou génération de paramètres pour la synthèse de la parole à partir de texte, p. ex. conversion graphème-phonème, génération de prosodie ou détermination de l'intonation ou de l'accent tonique
G10L 13/04 - Détails des systèmes de synthèse de la parole, p. ex. structure du synthétiseur ou gestion de la mémoire
G10L 13/10 - Règles de prosodie dérivées du texteIntonation ou accent tonique
34.
In-view and out-of-view request-related result regions for respective result categories
The invention relates to systems and methods of providing in-view and out-of-view request-related result regions for respective result categories. The system may facilitate result presentation by providing, in response to a user request, at least one region that is designated to initially be in-view and at least one region that is designated to initially be out-of-view where: (i) the initial in-view region comprises one or more results related to the user request and a first category; and (ii) the initial out-of-view region comprises one or more results related to the user request and the second category. A result related to a category may comprise a result related to a specific topic, a result of a specific type, a result from a specific source, or other result. A user request may comprise a query, a command, or other user request.
The present disclosure is directed towards a system and method for reducing tandeming effects in a communications system. The method may include receiving, at a speech decoder, an input bitstream associated with an incoming initial speech signal from a speech encoder. The method may further include determining whether or not coding is required and if coding is required, modifying an excitation signal associated with the bitstream. The method may also include providing the modified excitation signal to an adaptive encoder.
G10L 19/12 - Détermination ou codage de la fonction d'excitationDétermination ou codage des paramètres de prédiction à long terme la fonction d’excitation étant l’excitation codée, p. ex. dans les vocodeurs à prédiction linéaire excités par code [CELP]
A system and method for addressing acoustic signal reverberation is provided. Embodiments may include receiving, at one or more microphones, a first audio signal and a reverberation audio signal. Embodiments may further include processing at least one of the first audio signal and the reverberation audio signal. Embodiments may also include limiting a model based reverberation equalizer using a temporal constraint for direct sound distortions, the model based reverberation equalizer configured to generate one or more outputs, based upon, at least in part, at signal least one of the first audio signal and the reverberation audio signal.
G10K 15/12 - Dispositions pour produire une réverbération sonore ou un écho utilisant des réseaux retardateurs électroniques
G10L 21/0216 - Filtration du bruit caractérisée par le procédé d’estimation du bruit
H04M 9/08 - Systèmes téléphoniques à haut-parleur à double sens comportant des moyens pour conditionner le signal, p. ex. pour supprimer les échos dans l'une ou les deux directions du trafic
Embodiments disclosed herein may include determining a signal parameter of a first microphone and a second microphone associated with a computing device. Embodiments may include generating a reference parameter based upon at least one of the parameter of the first microphone and the parameter of the second microphone. Embodiments may include adjusting a tolerance of at least one of the first microphone and the second microphone, based upon the reference parameter. Embodiments may include receiving, at the first microphone, a first speech signal, the first speech signal having a first speech signal magnitude and receiving, at the second microphone, a second speech signal, the second speech signal having a second speech signal magnitude. Embodiments may include comparing at least one of the first speech signal magnitude and the second speech signal magnitude with a third speech signal magnitude and detecting an obstructed microphone based upon the comparison.
A system and method for acoustic echo cancellation is provided. Embodiments may include receiving, at one or more microphones, an audio reference signal from an audio speaker. Embodiments may also include filtering the audio reference signal using one or more adaptive audio filters. Embodiments may further include analyzing a level of signal energy of the audio reference signal with regard to time, frequency and audio channel to identify at least one maximum error contribution point. Embodiments may also include updating the one or more adaptive audio filters based upon, at least in part, the analyzed audio reference signal.
H04M 9/08 - Systèmes téléphoniques à haut-parleur à double sens comportant des moyens pour conditionner le signal, p. ex. pour supprimer les échos dans l'une ou les deux directions du trafic
H04M 3/00 - Centraux automatiques ou semi-automatiques
An arrangement is described for speech signal processing. An input microphone signal is received that includes a speech signal component and a noise component. The microphone signal is transformed into a frequency domain set of short-term spectra signals. Then speech formant components within the spectra signals are estimated based on detecting regions of high energy density in the spectra signals. One or more dynamically adjusted gain factors are applied to the spectra signals to enhance the speech formant components.
G10L 21/00 - Techniques de traitement du signal de parole ou de voix pour produire un autre signal audible ou non audible, p. ex. visuel ou tactile, afin de modifier sa qualité ou son intelligibilité
G10L 25/18 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par le type de paramètres extraits les paramètres extraits étant l’information spectrale de chaque sous-bande
G10L 21/02 - Amélioration de l'intelligibilité de la parole, p. ex. réduction de bruit ou annulation d'écho
G10L 19/06 - Détermination ou codage des caractéristiques spectrales, p. ex. des coefficients de prédiction à court terme
G10L 21/0232 - Traitement dans le domaine fréquentiel
G10L 19/00 - Techniques d'analyse ou de synthèse de la parole ou des signaux audio pour la réduction de la redondance, p. ex. dans les vocodeursCodage ou décodage de la parole ou des signaux audio utilisant les modèles source-filtre ou l’analyse psychoacoustique
40.
System for automatic speech recognition and audio entertainment
In one aspect, the present application is directed to a device for providing different levels of sound quality in an audio entertainment system. The device includes a speech enhancement system with a reference signal modification unit and a plurality of acoustic echo cancellation filters. Each acoustic echo cancellation filter is coupled to a playback channel. The device includes an audio playback system with loudspeakers. Each loudspeaker is coupled to a playback channel. At least one of the speech enhancement system and the audio playback system operates according to a full sound quality mode and a reduced sound quality mode. In the full sound quality mode, all of the playback channels contain non-zero output signals. In the reduced sound quality mode, a first subset of the playback channels contains non-zero output signals and a second subset of the playback channels contains zero output signals.
H04M 9/08 - Systèmes téléphoniques à haut-parleur à double sens comportant des moyens pour conditionner le signal, p. ex. pour supprimer les échos dans l'une ou les deux directions du trafic
G10L 15/20 - Techniques de reconnaissance de la parole spécialement adaptées de par leur robustesse contre les perturbations environnantes, p. ex. en milieu bruyant ou reconnaissance de la parole émise dans une situation de stress
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
41.
Methods and apparatus for dynamic low frequency noise suppression
Methods and apparatus for dynamically suppressing low frequency non-speech audio events, such as road bumps, without suppressing speech formants. In exemplary embodiments of the invention, maximum powers in first and second windows are computed and used to determine whether dampening should be applied, and if so, to what extent.
G10L 21/0232 - Traitement dans le domaine fréquentiel
G10L 25/18 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par le type de paramètres extraits les paramètres extraits étant l’information spectrale de chaque sous-bande
42.
System and method for speech enhancement on compressed speech
The present disclosure is directed towards a method for speech intelligibility. The method may include receiving, at one or more computing devices, a first speech input from a first user and performing voice activity detection upon the first speech input. The method may also include analyzing a spectral tilt associated with the first speech input, wherein analyzing includes computing an impulse response of a linear predictive coding (“LPC”) synthesis filter in a linear pulse code modulation (“PCM”) domain and wherein the one or more computing devices includes an adaptive high pass filter configured to recalculate one or more linear prediction coefficients.
G10L 21/0364 - Amélioration de l'intelligibilité de la parole, p. ex. réduction de bruit ou annulation d'écho en changeant l’amplitude pour améliorer l'intelligibilité
G10L 25/12 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par le type de paramètres extraits les paramètres extraits étant des coefficients de prédiction
G10L 19/12 - Détermination ou codage de la fonction d'excitationDétermination ou codage des paramètres de prédiction à long terme la fonction d’excitation étant l’excitation codée, p. ex. dans les vocodeurs à prédiction linéaire excités par code [CELP]
G10L 25/78 - Détection de la présence ou de l’absence de signaux de voix
G10L 25/93 - Différenciation entre parties voisées et non voisées des signaux de la parole
G10L 25/21 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par le type de paramètres extraits les paramètres extraits étant l’information sur la puissance
A speech signal processing system is described for use with automatic speech recognition and hands free speech communication. A signal pre-processor module transforms an input microphone signal into corresponding speech component signals. A noise suppression module applies noise reduction to the speech component signals to generate noise reduced speech component signals. A speech reconstruction module produces corresponding synthesized speech component signals for distorted speech component signals. A signal combination block adaptively combines the noise reduced speech component signals and the synthesized speech component signals based on signal to noise conditions to generate enhanced speech component signals for automatic speech recognition and hands free speech communication.
G10L 21/00 - Techniques de traitement du signal de parole ou de voix pour produire un autre signal audible ou non audible, p. ex. visuel ou tactile, afin de modifier sa qualité ou son intelligibilité
G10L 13/02 - Procédés d'élaboration de parole synthétiqueSynthétiseurs de parole
G10L 15/20 - Techniques de reconnaissance de la parole spécialement adaptées de par leur robustesse contre les perturbations environnantes, p. ex. en milieu bruyant ou reconnaissance de la parole émise dans une situation de stress
G10L 19/04 - Techniques d'analyse ou de synthèse de la parole ou des signaux audio pour la réduction de la redondance, p. ex. dans les vocodeursCodage ou décodage de la parole ou des signaux audio utilisant les modèles source-filtre ou l’analyse psychoacoustique utilisant des techniques de prédiction
44.
Reduced keyboard with prediction solutions when input is a partial sliding trajectory
A reduced virtual keyboard system for text input on electronic devices is disclosed. Text input is performed by creating a tracing trajectory. Dynamic prediction solutions are created during the tracing process, thus avoiding the need for a user to complete the entire word trajectory. The system also allows a mixture of tapping actions and sliding motions for the same word. The system may comprise a Long Words Dictionary database having first letters corresponding to predetermined keys of the keyboard. Alternatively, the system uses a Dictionary and a database management tool to find long words.
G06F 3/0488 - Techniques d’interaction fondées sur les interfaces utilisateur graphiques [GUI] utilisant des caractéristiques spécifiques fournies par le périphérique d’entrée, p. ex. des fonctions commandées par la rotation d’une souris à deux capteurs, ou par la nature du périphérique d’entrée, p. ex. des gestes en fonction de la pression exercée enregistrée par une tablette numérique utilisant un écran tactile ou une tablette numérique, p. ex. entrée de commandes par des tracés gestuels
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating speech. One variation of the method is from a server side, and another variation of the method is from a client side. The server side method, as implemented by a network-based automatic speech processing system, includes first receiving, from a network client independent of knowledge of internal operations of the system, a request to generate a text-to-speech voice. The request can include speech samples, transcriptions of the speech samples, and metadata describing the speech samples. The system extracts sound units from the speech samples based on the transcriptions and generates an interactive demonstration of the text-to-speech voice based on the sound units, the transcriptions, and the metadata, wherein the interactive demonstration hides a back end processing implementation from the network client. The system provides access to the interactive demonstration to the network client.
A speech processing method and arrangement are described. A dynamic noise adaptation (DNA) model characterizes a speech input reflecting effects of background noise. A null noise DNA model characterizes the speech input based on reflecting a null noise mismatch condition. A DNA interaction model performs Bayesian model selection and re-weighting of the DNA model and the null noise DNA model to realize a modified DNA model characterizing the speech input for automatic speech recognition and compensating for noise to a varying degree depending on relative probabilities of the DNA model and the null noise DNA model.
G10L 15/20 - Techniques de reconnaissance de la parole spécialement adaptées de par leur robustesse contre les perturbations environnantes, p. ex. en milieu bruyant ou reconnaissance de la parole émise dans une situation de stress
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating a synthetic voice. A system configured to practice the method combines a first database of a first text-to-speech voice and a second database of a second text-to-speech voice to generate a combined database, selects from the combined database, based on a policy, voice units of a phonetic category for the synthetic voice to yield selected voice units, and synthesizes speech based on the selected voice units. The system can synthesize speech without parameterizing the first text-to-speech voice and the second text-to-speech voice. A policy can define, for a particular phonetic category, from which text-to-speech voice to select voice units. The combined database can include multiple text-to-speech voices from different speakers. The combined database can include voices of a single speaker speaking in different styles. The combined database can include voices of different languages.
G10L 13/027 - Synthétiseurs de parole à partir de conceptsGénération de phrases naturelles à partir de concepts automatisés
G10L 13/047 - Architecture des synthétiseurs de parole
G10L 13/06 - Unités élémentaires de parole utilisées dans les synthétiseurs de paroleRègles de concaténation
H04B 7/04 - Systèmes de diversitéSystèmes à plusieurs antennes, c.-à-d. émission ou réception utilisant plusieurs antennes utilisant plusieurs antennes indépendantes espacées
H04B 7/06 - Systèmes de diversitéSystèmes à plusieurs antennes, c.-à-d. émission ou réception utilisant plusieurs antennes utilisant plusieurs antennes indépendantes espacées à la station d'émission
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for detecting and correcting abnormal stress patterns in unit-selection speech synthesis. A system practicing the method detects incorrect stress patterns in selected acoustic units representing speech to be synthesized, and corrects the incorrect stress patterns in the selected acoustic units to yield corrected stress patterns. The system can further synthesize speech based on the corrected stress patterns. In one aspect, the system also classifies the incorrect stress patterns using a machine learning algorithm such as a classification and regression tree, adaptive boosting, support vector machine, and maximum entropy. In this way a text-to-speech unit selection speech synthesizer can produce more natural sounding speech with suitable stress patterns regardless of the stress of units in a unit selection database.
G10L 13/00 - Synthèse de la paroleSystèmes de synthèse de la parole à partir de texte
G10L 13/08 - Analyse de texte ou génération de paramètres pour la synthèse de la parole à partir de texte, p. ex. conversion graphème-phonème, génération de prosodie ou détermination de l'intonation ou de l'accent tonique
G10L 13/10 - Règles de prosodie dérivées du texteIntonation ou accent tonique
G10L 15/18 - Classement ou recherche de la parole utilisant une modélisation du langage naturel
G10L 25/00 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes
49.
Wind noise detection for in-car communication systems with multiple acoustic zones
An in-car communication (ICC) system has multiple acoustic zones having varying acoustic environments. At least one input microphone within at least one acoustic zone develops a corresponding microphone signal from one or more system users. At least one loudspeaker within at least one acoustic zone provides acoustic audio to the system users. A wind noise module makes a determination of when wind noise is present in the microphone signal and modifies the microphone signal based on the determination.
A speech communication system includes a speech service compartment for holding one or more system users. The speech service compartment includes a plurality of acoustic zones having varying acoustic environments. At least one input microphone is located within the speech service compartment, for developing microphone input signals from the one or more system users. At least one loudspeaker is located within the service compartment. An in-car communication (ICC) system receives and processes the microphone input signals, forming loudspeaker output signals that are provided to one or more of the at least one output loudspeakers. The ICC system includes at least one of a speaker dedicated signal processing module and a listener specific signal processing module, that controls the processing of the microphone input signal and/or forming of the loudspeaker output signal based, at least in part, on at least one of an associated acoustic environment(s) and resulting psychoacoustic effect(s).
G10L 21/00 - Techniques de traitement du signal de parole ou de voix pour produire un autre signal audible ou non audible, p. ex. visuel ou tactile, afin de modifier sa qualité ou son intelligibilité
G10L 25/48 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes spécialement adaptées pour un usage particulier
G10L 21/02 - Amélioration de l'intelligibilité de la parole, p. ex. réduction de bruit ou annulation d'écho
G10L 21/0364 - Amélioration de l'intelligibilité de la parole, p. ex. réduction de bruit ou annulation d'écho en changeant l’amplitude pour améliorer l'intelligibilité
G10L 21/0216 - Filtration du bruit caractérisée par le procédé d’estimation du bruit
H04S 7/00 - Dispositions pour l'indicationDispositions pour la commande, p. ex. pour la commande de l'équilibrage
51.
Speech communication system for combined voice recognition, hands-free telephony and in-car communication
A multi-mode speech communication system is described that has different operating modes for different speech applications. A speech service compartment contains multiple system users, multiple input microphones that develop microphone input signals from the system users to the system, and multiple output loudspeakers that develop loudspeaker output signals from the system to the system users. A signal processing module is in communication with the speech applications and includes an input processing module and an output processing module. The input processing module processes the microphone input signals to produce a set user input signals for each speech application that are limited to currently active system users for that speech application. The output processing module processes application output communications from the speech applications to produce loudspeaker output signals to the system users, wherein for each different speech application, the loudspeaker output signals are directed only to system users currently active in that speech application. The signal processing module dynamically controls the processing of the microphone input signals and the loudspeaker output signals to respond to changes in currently active system users for each application.
G10L 25/48 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes spécialement adaptées pour un usage particulier
H04M 3/56 - Dispositions pour connecter plusieurs abonnés à un circuit commun, c.-à-d. pour permettre la transmission de conférences
Techniques for providing speech output for speech-enabled applications. A synthesis system receives from a speech-enabled application a text input including a text transcription of a desired speech output. The synthesis system selects one or more audio recordings corresponding to one or more portions of the text input. In one aspect, the synthesis system selects from audio recordings provided by a developer of the speech-enabled application. In another aspect, the synthesis system selects an audio recording of a speaker speaking a plurality of words. The synthesis system forms a speech output including the one or more selected audio recordings and provides the speech output for the speech-enabled application.
G10L 13/02 - Procédés d'élaboration de parole synthétiqueSynthétiseurs de parole
G10L 13/04 - Détails des systèmes de synthèse de la parole, p. ex. structure du synthétiseur ou gestion de la mémoire
G10L 13/08 - Analyse de texte ou génération de paramètres pour la synthèse de la parole à partir de texte, p. ex. conversion graphème-phonème, génération de prosodie ou détermination de l'intonation ou de l'accent tonique
53.
Method and system for text-to-speech synthesis with personalized voice
A method and system are provided for text-to-speech synthesis with personalized voice. The method includes receiving an incidental audio input (403) of speech in the form of an audio communication from an input speaker (401) and generating a voice dataset (404) for the input speaker (401). The method includes receiving a text input (411) at the same device as the audio input (403) and synthesizing (312) the text from the text input (411) to synthesized speech including using the voice dataset (404) to personalize the synthesized speech to sound like the input speaker (401). In addition, the method includes analyzing (316) the text for expression and adding the expression (315) to the synthesized speech. The audio communication may be part of a video communication (453) and the audio input (403) may have an associated visual input (455) of an image of the input speaker. The synthesis from text may include providing a synthesized image personalized to look like the image of the input speaker with expressions added from the visual input (455).
Network communications, Web-based services and customized services using the Web-based services may be provided over a peer-to-peer network from a first peer to a second peer (e.g., automobile head unit) wherein the first peer has a separate connection to a more general server-based network such as the Internet. A communications device application based on a peer communications framework component in communication with a peer network stack on the communications device may work as middleware, with a connection to both a more general server-based network such as the Internet and to an external device, such as a head unit of an automobile. Although the communications device has a separate connection out to the Internet via a general network stack co-existing on the same communications device, the peer network stack and the general network stack are not directly connected.
H04W 88/04 - Dispositifs terminaux adapté à la retransmission à destination ou en provenance d'un autre terminal ou utilisateur
H04W 88/06 - Dispositifs terminaux adapté au fonctionnement dans des réseaux multiples, p. ex. terminaux multi-mode
H04W 4/00 - Services spécialement adaptés aux réseaux de télécommunications sans filLeurs installations
H04W 4/02 - Services utilisant des informations de localisation
H04W 8/18 - Traitement de données utilisateur ou abonné, p. ex. services faisant l'objet d'un abonnement, préférences utilisateur ou profils utilisateurTransfert de données utilisateur ou abonné
Visual information is used to alter or set an operating parameter of an audio signal processor, other than a beamformer. A digital camera captures visual information about a scene that includes a human speaker and/or a listener. The visual information is analyzed to ascertain information about acoustics of a room. A distance between the speaker and a microphone may be estimated, and this distance estimate may be used to adjust an overall gain of the system. Distances among, and locations of, the speaker, the listener, the microphone, a loudspeaker and/or a sound-reflecting surface may be estimated. These estimates may be used to estimate reverberations within the room and adjust aggressiveness of an anti-reverberation filter, based on an estimated ratio of direct to indirect (reverberated) sound energy expected to reach the microphone. In addition, orientation of the speaker or the listener, relative to the microphone or the loudspeaker, can also be estimated, and this estimate may be used to adjust frequency-dependent filter weights to compensate for uneven frequency propagation of acoustic signals from a mouth, or to a human ear, about a human head.
G10L 25/27 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par la technique d’analyse
G06K 9/00 - Méthodes ou dispositions pour la lecture ou la reconnaissance de caractères imprimés ou écrits ou pour la reconnaissance de formes, p.ex. d'empreintes digitales
G10L 15/20 - Techniques de reconnaissance de la parole spécialement adaptées de par leur robustesse contre les perturbations environnantes, p. ex. en milieu bruyant ou reconnaissance de la parole émise dans une situation de stress
G10L 17/00 - Techniques d'identification ou de vérification du locuteur
A conversational, natural language voice user interface may provide an integrated voice navigation services environment. The voice user interface may enable a user to make natural language requests relating to various navigation services, and further, may interact with the user in a cooperative, conversational dialogue to resolve the requests. Through dynamic awareness of context, available sources of information, domain knowledge, user behavior and preferences, and external systems and devices, among other things, the voice user interface may provide an integrated environment in which the user can speak conversationally, using natural language, to issue queries, commands, or other requests relating to the navigation services provided in the environment.
A handwriting recognition apparatus facilitates user entry of strokes one on top of another. The apparatus, which includes a processor and a display integrated with a touch sensitive screen, receives a series of strokes via the screen. Each stroke is defined by contact, trace, and lift occurrences. Each stroke appears on the display until occurrence of a prescribed event, and then disappears. The apparatus accumulates strokes into a buffer and interprets all accumulated strokes collectively against a character database and optionally a linguistic database, to identify multiple candidate strings that could be represented by the accumulated strokes. The apparatus displays candidate strings for user selection after all strokes have faded, or after receiving a user submitted delimiter, or after a given delay has elapsed following user entry of the latest stroke. Alternatively, candidate strings are displayed after each stroke without waiting for timeout or explicit delimiter.
G06F 3/041 - Numériseurs, p. ex. pour des écrans ou des pavés tactiles, caractérisés par les moyens de transduction
G06K 9/00 - Méthodes ou dispositions pour la lecture ou la reconnaissance de caractères imprimés ou écrits ou pour la reconnaissance de formes, p.ex. d'empreintes digitales
G06F 3/023 - Dispositions pour convertir sous une forme codée des éléments d'information discrets, p. ex. dispositions pour interpréter des codes générés par le clavier comme codes alphanumériques, comme codes d'opérande ou comme codes d'instruction
G06F 3/0488 - Techniques d’interaction fondées sur les interfaces utilisateur graphiques [GUI] utilisant des caractéristiques spécifiques fournies par le périphérique d’entrée, p. ex. des fonctions commandées par la rotation d’une souris à deux capteurs, ou par la nature du périphérique d’entrée, p. ex. des gestes en fonction de la pression exercée enregistrée par une tablette numérique utilisant un écran tactile ou une tablette numérique, p. ex. entrée de commandes par des tracés gestuels
G06F 3/0354 - Dispositifs de pointage déplacés ou positionnés par l'utilisateurLeurs accessoires avec détection des mouvements relatifs en deux dimensions [2D] entre le dispositif de pointage ou une partie agissante dudit dispositif, et un plan ou une surface, p. ex. souris 2D, boules traçantes, crayons ou palets
58.
Systems and methods for an automated personalized dictionary generator for portable devices
A system and method for automated dictionary population is provided to facilitate the entry of textual material in dictionaries for enhancing word prediction. The automated dictionary population system is useful in association with a mobile device including at least one dictionary which includes entries. The device receives a communication which is parsed and textual data extracted. The text is compared to the entries of the dictionaries to identify new words. Statistical information for the parsed words, including word usage frequency, recency, or likelihood of use, is generated. Profanities may be processed by identifying profanities, modifying the profanities, and asking the user to provide feedback. Phrases are identified by phrase markers. Lastly, the new words are stored in a supplementary word list as single words or by linking the words of the identified phrases to preserve any phrase relationships. Likewise, the statistical information may be stored.
Software, firmware, and systems are described for identifying characters in a handwritten input received from a user on an input device, irrespective of an angle that the input is received at. In one implementation, the system establishes an anchor point and distances from the anchor point to reference support lines. A set of candidate characters is identified based on received handwritten input. The system estimates support lines for each of the candidate characters. The system ranks the candidate characters based on a total deviation measurement from the expectation for each candidate, where the expectation in part is based on the established distance from the established anchor point to reference support lines, and identifies a best-ranked candidate based at least in part on a smallest total deviation measurement.
G06K 9/00 - Méthodes ou dispositions pour la lecture ou la reconnaissance de caractères imprimés ou écrits ou pour la reconnaissance de formes, p.ex. d'empreintes digitales
G06F 3/0488 - Techniques d’interaction fondées sur les interfaces utilisateur graphiques [GUI] utilisant des caractéristiques spécifiques fournies par le périphérique d’entrée, p. ex. des fonctions commandées par la rotation d’une souris à deux capteurs, ou par la nature du périphérique d’entrée, p. ex. des gestes en fonction de la pression exercée enregistrée par une tablette numérique utilisant un écran tactile ou une tablette numérique, p. ex. entrée de commandes par des tracés gestuels
G06K 9/62 - Méthodes ou dispositions pour la reconnaissance utilisant des moyens électroniques
60.
System and method for synthetic voice generation and modification
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating a synthetic voice. A system configured to practice the method combines a first database of a first text-to-speech voice and a second database of a second text-to-speech voice to generate a combined database, selects from the combined database, based on a policy, voice units of a phonetic category for the synthetic voice to yield selected voice units, and synthesizes speech based on the selected voice units. The system can synthesize speech without parameterizing the first text-to-speech voice and the second text-to-speech voice. A policy can define, for a particular phonetic category, from which text-to-speech voice to select voice units. The combined database can include multiple text-to-speech voices from different speakers. The combined database can include voices of a single speaker speaking in different styles. The combined database can include voices of different languages.
G10L 13/00 - Synthèse de la paroleSystèmes de synthèse de la parole à partir de texte
G10L 13/08 - Analyse de texte ou génération de paramètres pour la synthèse de la parole à partir de texte, p. ex. conversion graphème-phonème, génération de prosodie ou détermination de l'intonation ou de l'accent tonique
G10L 13/027 - Synthétiseurs de parole à partir de conceptsGénération de phrases naturelles à partir de concepts automatisés
G10L 13/06 - Unités élémentaires de parole utilisées dans les synthétiseurs de paroleRègles de concaténation
H04B 7/04 - Systèmes de diversitéSystèmes à plusieurs antennes, c.-à-d. émission ou réception utilisant plusieurs antennes utilisant plusieurs antennes indépendantes espacées
H04B 7/06 - Systèmes de diversitéSystèmes à plusieurs antennes, c.-à-d. émission ou réception utilisant plusieurs antennes utilisant plusieurs antennes indépendantes espacées à la station d'émission
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for speech synthesis. A system practicing the method receives a set of ordered lists of speech units, for each respective speech unit in each ordered list in the set of ordered lists, constructs a sublist of speech units from a next ordered list which are suitable for concatenation, performs a cost analysis of paths through the set of ordered lists of speech units based on the sublist of speech units for each respective speech unit, and synthesizes speech using a lowest cost path of speech units through the set of ordered lists based on the cost analysis. The ordered lists can be ordered based on the respective pitch of each speech unit. In one embodiment, speech units which do not have an assigned pitch can be assigned a pitch.
The system and method described herein may dynamically generate a recognition grammar associated with a conversational voice user interface in an integrated voice navigation services environment. In particular, in response to receiving a natural language utterance that relates to a navigation context at the voice user interface, a conversational language processor may generate a dynamic recognition grammar that organizes grammar information based on one or more topological domains. For example, the one or more topological domains may be determined based on a current location associated with a navigation device, whereby a speech recognition engine may use the grammar information organized in the dynamic recognition grammar according to the one or more topological domains to generate one or more interpretations associated with the natural language utterance.
Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings.
H04B 1/00 - Détails des systèmes de transmission, non couverts par l'un des groupes Détails des systèmes de transmission non caractérisés par le milieu utilisé pour la transmission
H04B 1/10 - Dispositifs associés au récepteur pour limiter ou supprimer le bruit et les interférences
A system and method for receiving character input from a user includes a programmed processor that receives inputs from the user and disambiguates the inputs to present character sequence choices corresponding to the input characters. In one embodiment, a first character input is received and a corresponding first recognized character is stored in a temporary storage buffer and displayed to the user for editing. After a predetermined number of subsequent input characters and/or predetermined amount of time without being edited, the system determines that the first recognized character is the intended character input by the user and removes the first recognized character from the buffer, thereby inhibiting future editing.
G06F 3/0488 - Techniques d’interaction fondées sur les interfaces utilisateur graphiques [GUI] utilisant des caractéristiques spécifiques fournies par le périphérique d’entrée, p. ex. des fonctions commandées par la rotation d’une souris à deux capteurs, ou par la nature du périphérique d’entrée, p. ex. des gestes en fonction de la pression exercée enregistrée par une tablette numérique utilisant un écran tactile ou une tablette numérique, p. ex. entrée de commandes par des tracés gestuels
G06F 3/01 - Dispositions d'entrée ou dispositions d'entrée et de sortie combinées pour l'interaction entre l'utilisateur et le calculateur
G06F 3/023 - Dispositions pour convertir sous une forme codée des éléments d'information discrets, p. ex. dispositions pour interpréter des codes générés par le clavier comme codes alphanumériques, comme codes d'opérande ou comme codes d'instruction
66.
Text browsing, editing and correction methods for automotive applications
An automotive text display arrangement is described which includes a driver text display positioned directly in front of an automobile driver and displaying a limited amount of text to the driver without impairing forward visual attention of the driver. The arrangement may include a boundary insertion mode wherein when the active text position is an active text boundary, new text is inserted between the text items separated by the active text boundary, and when the active text position is an active text item, new text replaces the active text item. In addition or alternatively, there may be a multifunctional text control knob offering multiple different user movements, each performing an associated text processing function.
Methods and apparatus for reducing impulsive interferences in a signal, without necessarily ascertaining a pitch frequency in the signal, detect onsets of the impulsive interferences by searching a spectrum of high-energy components for large temporal derivatives that are correlated along frequency and extend from a very low frequency up, possibly to about several kHz. The energies of the impulsive interferences are estimated, and these estimates are used to suppress the impulsive interferences. Optionally, techniques are employed to protect desired speech signals from being corrupted as a result of the suppression of the impulsive interferences.
G10L 21/00 - Techniques de traitement du signal de parole ou de voix pour produire un autre signal audible ou non audible, p. ex. visuel ou tactile, afin de modifier sa qualité ou son intelligibilité
G10L 19/00 - Techniques d'analyse ou de synthèse de la parole ou des signaux audio pour la réduction de la redondance, p. ex. dans les vocodeursCodage ou décodage de la parole ou des signaux audio utilisant les modèles source-filtre ou l’analyse psychoacoustique
An embodiment according to the invention provides a capability of automatically predicting how favorable a given speech signal is for statistical modeling, which is advantageous in a variety of different contexts. In Multi-Form Segment (MFS) synthesis, for example, an embodiment according to the invention uses prediction capability to provide an automatic acoustic driven template versus model decision maker with an output quality that is high, stable and depends gradually on the system footprint. In speaker selection for a statistical Text-to-Speech synthesis (TTS) system build, as another example context, an embodiment according to the invention enables a fast selection of the most appropriate speaker among several available ones for the full voice dataset recording and preparation, based on a small amount of recorded speech material.
G10L 13/06 - Unités élémentaires de parole utilisées dans les synthétiseurs de paroleRègles de concaténation
G10L 13/04 - Détails des systèmes de synthèse de la parole, p. ex. structure du synthétiseur ou gestion de la mémoire
G10L 19/00 - Techniques d'analyse ou de synthèse de la parole ou des signaux audio pour la réduction de la redondance, p. ex. dans les vocodeursCodage ou décodage de la parole ou des signaux audio utilisant les modèles source-filtre ou l’analyse psychoacoustique
G10L 25/48 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes spécialement adaptées pour un usage particulier
G10L 25/18 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par le type de paramètres extraits les paramètres extraits étant l’information spectrale de chaque sous-bande
69.
Systems, methods and articles for a server providing communications and services involving automobile head units
Network communications, Web-based services and customized services using the Web-based services may be provided to drivers and users via the automobile head unit in the vehicle and via their mobile device. The automobile head unit in the vehicle and the mobile device are communicatively linked via a short range wireless connection. Also, these devices may communicate over a network such as a cellular network to a service provider that provides entertainment and informational services to the mobile device and the head unit of the vehicle. The user's profile and preferences are able to follow the user to various locations and into vehicles because this information is stored at a server accessible by the user's mobile device, and in some embodiments, also the head unit. The mobile device may provide services to the head unit if it does not have wider network connectivity over the short range wireless connection.
H04L 29/08 - Procédure de commande de la transmission, p.ex. procédure de commande du niveau de la liaison
B60K 35/00 - Instruments spécialement adaptés aux véhiculesAgencement d’instruments dans ou sur des véhicules
B60R 16/023 - Circuits électriques ou circuits de fluides spécialement adaptés aux véhicules et non prévus ailleursAgencement des éléments des circuits électriques ou des circuits de fluides spécialement adapté aux véhicules et non prévu ailleurs électriques pour la transmission de signaux entre des parties ou des sous-systèmes du véhicule
70.
Using emoticons for contextual text-to-speech expressivity
Techniques disclosed herein include systems and methods that improve audible emotional characteristics used when synthesizing speech from a text source. Systems and methods herein use emoticons identified from a source text to provide contextual text-to-speech expressivity. In general, techniques herein analyze text and identify emoticons included within the text. The source text is then tagged with corresponding mood indicators. For example, if the system identifies an emoticon at the end of a sentence, then the system can infer that this sentence has a specific tone or mood associated with it. Depending on whether the emoticon is a smiley face, angry face, sad face, laughing face, etc., the system can infer use or mood from the various emoticons and then change or modify the expressivity of the TTS output such as by changing intonation, prosody, speed, pauses, and other expressivity characteristics.
G10L 13/08 - Analyse de texte ou génération de paramètres pour la synthèse de la parole à partir de texte, p. ex. conversion graphème-phonème, génération de prosodie ou détermination de l'intonation ou de l'accent tonique
71.
Systems, methods and articles for a communications device providing communications and services involving automobile head units
Network communications, Web-based services and customized services using the Web-based services may be provided to drivers and users via the automobile head unit in the vehicle and via their mobile device. The automobile head unit in the vehicle and the mobile device are communicatively linked via a short range wireless connection. Also, these devices may communicate over a network such as a cellular network to a service provider that provides entertainment and informational services to the mobile device and the head unit of the vehicle. The user's profile and preferences are able to follow the user to various locations and into vehicles because this information is stored at a server accessible by the user's mobile device, and in some embodiments, also the head unit. The mobile device may provide services to the head unit if it does not have wider network connectivity over the short range wireless connection.
An embodiment of the invention is a software tool used to convert text, speech synthesis markup language (SSML), and/or extended SSML to synthesized audio. Provisions are provided to create, view, play, and edit the synthesized speech, including editing pitch and duration targets, speaking type, paralinguistic events, and prosody. Prosody can be provided by way of a sample recording. Users can interact with the software tool by way of a graphical user interface (GUI). The software tool can produce synthesized audio file output in many file formats.
G10L 13/06 - Unités élémentaires de parole utilisées dans les synthétiseurs de paroleRègles de concaténation
G10L 13/033 - Édition de voix, p. ex. transformation de la voix du synthétiseur
G10L 13/08 - Analyse de texte ou génération de paramètres pour la synthèse de la parole à partir de texte, p. ex. conversion graphème-phonème, génération de prosodie ou détermination de l'intonation ou de l'accent tonique
73.
Interchangeable input modules associated with varying languages
Interchangeable input modules, such as keypads, having user input devices configured to mate with base devices are described herein. The user input devices may include pluralities of inputs, such as input keys, associated with languages. The interchangeable input modules may further include storage components configured to store configuration data, linguistic structures, and/or predictive logic. Additionally, the interchangeable input modules may have interfaces configured to electrically couple the interchangeable input modules to the base devices after the interchangeable input modules are mated with the base devices.
G06F 3/023 - Dispositions pour convertir sous une forme codée des éléments d'information discrets, p. ex. dispositions pour interpréter des codes générés par le clavier comme codes alphanumériques, comme codes d'opérande ou comme codes d'instruction
74.
Method and apparatus for generating synthetic speech with contrastive stress
Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings.
G10L 13/08 - Analyse de texte ou génération de paramètres pour la synthèse de la parole à partir de texte, p. ex. conversion graphème-phonème, génération de prosodie ou détermination de l'intonation ou de l'accent tonique
G10L 13/02 - Procédés d'élaboration de parole synthétiqueSynthétiseurs de parole
G10L 13/10 - Règles de prosodie dérivées du texteIntonation ou accent tonique
75.
Method and apparatus for combining text to speech and recorded prompts
An arrangement provides for improved synthesis of speech arising from a message text. The arrangement stores prerecorded prompts and speech related characteristics for those prompts. A message is parsed to determine if any message portions have been recorded previously. If so then speech related characteristics for those portions are retrieved. The arrangement generates speech related characteristics for those parties not previously stored. The retrieved and generated characteristics are combined. The combination of characteristics is then used as the input to a speech synthesizer.
G10L 13/00 - Synthèse de la paroleSystèmes de synthèse de la parole à partir de texte
G10L 13/08 - Analyse de texte ou génération de paramètres pour la synthèse de la parole à partir de texte, p. ex. conversion graphème-phonème, génération de prosodie ou détermination de l'intonation ou de l'accent tonique
76.
Technology for entering data using patterns of relative directions
A method and apparatus for entering words into a computer system. Letters contained in a desired word are entered by giving approximate location and directional information relative to any specified keyboard layout. The inputs need not correspond to specific keys on the keyboard, a sequence of ambiguous key entries corresponding to individual words can be used to retrieve a word from the dictionary. The system tracks directional information of movement relative to a/the specific keyboard layout, reducing it to predetermined primary directions and translates this seemingly ambiguous information into accurate words from the dictionary. The system may also capture the user's intention (with regard to text entry) by observing the movements on the keyboard.
G06F 3/02 - Dispositions d'entrée utilisant des interrupteurs actionnés manuellement, p. ex. des claviers ou des cadrans
G09G 5/00 - Dispositions ou circuits de commande de l'affichage communs à l'affichage utilisant des tubes à rayons cathodiques et à l'affichage utilisant d'autres moyens de visualisation
77.
Speech synthesis system, speech synthesis program product, and speech synthesis method
Waveform concatenation speech synthesis with high sound quality. Prosody with both high accuracy and high sound quality is achieved by performing a two-path search including a speech segment search and a prosody modification value search. An accurate accent is secured by evaluating the consistency of the prosody by using a statistical model of prosody variations (the slope of fundamental frequency) for both of two paths of the speech segment selection and the modification value search. In the prosody modification value search, a prosody modification value sequence that minimizes a modified prosody cost is searched for. This allows a search for a modification value sequence that can increase the likelihood of absolute values or variations of the prosody to the statistical model as high as possible with minimum modification values.
Network communications, Web-based services and customized services using the Web-based services may be provided over a peer-to-peer network from a first peer to a second peer (e.g., automobile head unit) wherein the first peer has a separate connection to a more general server-based network such as the Internet. A communications device application based on a peer communications framework component in communication with a peer network stack on the communications device may work as middleware, with a connection to both a more general server-based network such as the Internet and to an external device, such as a head unit of an automobile. Although the communications device has a separate connection out to the Internet via a general network stack co-existing on the same communications device, the peer network stack and the general network stack are not directly connected.
G06F 15/16 - Associations de plusieurs calculateurs numériques comportant chacun au moins une unité arithmétique, une unité programme et un registre, p. ex. pour le traitement simultané de plusieurs programmes
H04W 88/06 - Dispositifs terminaux adapté au fonctionnement dans des réseaux multiples, p. ex. terminaux multi-mode
H04W 88/04 - Dispositifs terminaux adapté à la retransmission à destination ou en provenance d'un autre terminal ou utilisateur
Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings.
The present invention relates to a signal processing system, comprising a number of microphones and loudspeakers, a hands-free set configured to receive a telephone signal from a remote party and to transmit a microphone signal supplied by at least one of the microphones to the remote party; an in-vehicle communication system configured to receive a microphone signal supplied by at least one of the microphones; receive the telephone signal; amplify the microphone signal to obtain at least one first output signal; output the at least one first output signal and/or a second output signal corresponding to the telephone signal to at least one of the loudspeakers; and wherein the signal processing systems is configured to detect speech activity in the telephone signal and to control the in-vehicle communication system to reduce amplification of the microphone signal by a damping factor, if it is detected that speech activity is present in the telephone signal.
A conversational, natural language voice user interface may provide an integrated voice navigation services environment. The voice user interface may enable a user to make natural language requests relating to various navigation services, and further, may interact with the user in a cooperative, conversational dialogue to resolve the requests. Through dynamic awareness of context, available sources of information, domain knowledge, user behavior and preferences, and external systems and devices, among other things, the voice user interface may provide an integrated environment in which the user can speak conversationally, using natural language, to issue queries, commands, or other requests relating to the navigation services provided in the environment.
Embodiments of the present invention exploit redundancy of succeeding FFT spectra and use this redundancy for computing interpolated temporal supporting points. An analysis filter bank converts overlapped sequences of an audio (ex. loudspeaker) signal from a time domain to a frequency domain to obtain a time series of short-time loudspeaker spectra. An interpolator temporally interpolates this time series. The interpolation is fed to an echo canceller, which computes an estimated echo spectrum. A microphone analysis filter bank converts overlapped sequences of an audio microphone signal from the time domain to the frequency domain to obtain a time series of short-time microphone spectra. The estimated echo spectrum is subtracted from the microphone spectrum. Further signal enhancement (filtration) may be applied. A synthesis filter bank converts the filtered microphone spectra to the time domain to generate an echo compensated audio microphone signal. Computational complexity of signal processing systems can, therefore, be reduced.
G10K 11/00 - Procédés ou dispositifs pour transmettre, conduire ou diriger le son en généralProcédés ou dispositifs de protection contre le bruit ou les autres ondes acoustiques ou pour amortir ceux-ci, en général
G10L 21/02 - Amélioration de l'intelligibilité de la parole, p. ex. réduction de bruit ou annulation d'écho
G10L 19/02 - Techniques d'analyse ou de synthèse de la parole ou des signaux audio pour la réduction de la redondance, p. ex. dans les vocodeursCodage ou décodage de la parole ou des signaux audio utilisant les modèles source-filtre ou l’analyse psychoacoustique utilisant l'analyse spectrale, p. ex. vocodeurs à transformée ou vocodeurs à sous-bandes
83.
Systems and methods for character correction in communication devices
Systems and methods for character error correction are provided, useful for a user of mobile appliances to produce written text with reduced errors. The system includes an interface, a word prediction engine, a statistical engine, an editing distance calculator, and a selector. A string of characters, known as the inputted word, may be entered into the mobile device via the interface. The word prediction engine may then generate word candidates similar to the inputted word using fuzzy logic and user preferences generated from past user behavior. The statistical engine may then generate variable is error costs determined by the probability of erroneously inputting any given character. The editing distance calculator may then determine the editing distance between the inputted word and each of the word candidates by grid comparison using the variable error costs. The selector may choose one or more preferred candidates from the word candidates using the editing distances.
G06K 9/72 - Méthodes ou dispositions pour la reconnaissance utilisant des moyens électroniques utilisant une analyse de contexte basée sur l'identité provisoire attribuée à une série de formes successives, p.ex. d'un mot
84.
Systems and methods for an automated personalized dictionary generator for portable devices
A system and method for automated dictionary population is provided to facilitate the entry of textual material in dictionaries for enhancing word prediction. The automated dictionary population system is useful in association with a mobile device including at least one dictionary which includes entries. The device receives a communication which is parsed and textual data extracted. The text is compared to the entries of the dictionaries to identify new words. Statistical information for the parsed words, including word usage frequency, recency, or likelihood of use, is generated. Profanities may be processed by identifying profanities, modifying the profanities, and asking the user to provide feedback. Phrases are identified by phrase markers. Lastly, the new words are stored in a supplementary word list as single words or by linking the words of the identified phrases to preserve any phrase relationships. Likewise, the statistical information may be stored.
Embodiments of the present invention exploit redundancy of succeeding FFT spectra and use this redundancy for computing interpolated temporal supporting points. An analysis filter bank converts overlapped sequences of an audio (ex. loudspeaker) signal from a time domain to a frequency domain to obtain a time series of short-time loudspeaker spectra. An interpolator temporally interpolates this time series. The interpolation is fed to an echo canceller, which computes an estimated echo spectrum. A microphone analysis filter bank converts overlapped sequences of an audio microphone signal from the time domain to the frequency domain to obtain a time series of short-time microphone spectra. The estimated echo spectrum is subtracted from the microphone spectrum. Further signal enhancement (filtration) may be applied. A synthesis filter bank converts the filtered microphone spectra to the time domain to generate an echo compensated audio microphone signal. Computational complexity of signal processing systems can, therefore, be reduced.
G10K 11/00 - Procédés ou dispositifs pour transmettre, conduire ou diriger le son en généralProcédés ou dispositifs de protection contre le bruit ou les autres ondes acoustiques ou pour amortir ceux-ci, en général
G10L 21/02 - Amélioration de l'intelligibilité de la parole, p. ex. réduction de bruit ou annulation d'écho
G10L 19/02 - Techniques d'analyse ou de synthèse de la parole ou des signaux audio pour la réduction de la redondance, p. ex. dans les vocodeursCodage ou décodage de la parole ou des signaux audio utilisant les modèles source-filtre ou l’analyse psychoacoustique utilisant l'analyse spectrale, p. ex. vocodeurs à transformée ou vocodeurs à sous-bandes
86.
In-car communication system for multiple acoustic zones
An In-Car Communication (ICC) system supports the communication paths within a car by receiving the speech signals of a speaking passenger and playing it back for one or more listening passengers. Signal processing tasks are split into a microphone related part and into a loudspeaker related part. A sound processing system suitable for use in a vehicle having multiple acoustic zones includes a plurality of microphone In-Car Communication (Mic-ICC) instances coupled and a plurality of loudspeaker In-Car Communication (Ls-ICC) instances. The system further includes a dynamic audio routing matrix with a controller and coupled to the Mic-ICC instances, a mixer coupled to the plurality of Mic-ICC instances and a distributor coupled to the Ls-ICC instances.
G10L 15/20 - Techniques de reconnaissance de la parole spécialement adaptées de par leur robustesse contre les perturbations environnantes, p. ex. en milieu bruyant ou reconnaissance de la parole émise dans une situation de stress
H04M 9/08 - Systèmes téléphoniques à haut-parleur à double sens comportant des moyens pour conditionner le signal, p. ex. pour supprimer les échos dans l'une ou les deux directions du trafic
H04R 5/02 - Dispositions spatiales ou structurelles de haut-parleurs
H04R 1/40 - Dispositions pour obtenir la fréquence désirée ou les caractéristiques directionnelles pour obtenir la caractéristique directionnelle désirée uniquement en combinant plusieurs transducteurs identiques
H04R 3/02 - Circuits pour transducteurs pour empêcher la réaction acoustique
H04R 3/12 - Circuits pour transducteurs pour distribuer des signaux à plusieurs haut-parleurs
87.
System and method for information identification using tracked preferences of a user
A system and a method of retrieving information is described. In a system according to the invention, software modules may be used to provide the user with information that is most likely to be the information desired.
G06F 21/00 - Dispositions de sécurité pour protéger les calculateurs, leurs composants, les programmes ou les données contre une activité non autorisée
G06F 17/30 - Recherche documentaire; Structures de bases de données à cet effet
H04M 1/2745 - Dispositifs dans lesquels plusieurs signaux peuvent être enregistrés simultanément avec possibilité d'emmagasiner plus d'un numéro d'abonné à la fois utilisant des mémoires électroniques statiques, p. ex. des puces électroniques
A text message processing arrangement is described for use in a mobile environment. A mobile messaging application processes user text messages during a user messaging session. A user state model reflects situational parameters to characterize user cognitive load. A functionality control module adjusts functional performance of the mobile messaging application based on the user state model.
H04M 1/00 - Équipement de sous-station, p. ex. pour utilisation par l'abonné
H04W 48/04 - Restriction d'accès effectuée dans des conditions spécifiques sur la base des données de localisation ou de mobilité de l'utilisateur ou du terminal, p. ex. du sens ou de la vitesse de déplacement
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for reducing latency in web-browsing TTS systems without the use of a plug-in or Flash® module. A system configured according to the disclosed methods allows the browser to send prosodically meaningful sections of text to a web server. A TTS server then converts intonational phrases of the text into audio and responds to the browser with the audio file. The system saves the audio file in a cache, with the file indexed by a unique identifier. As the system continues converting text into speech, when identical text appears the system uses the cached audio corresponding to the identical text without the need for re-synthesis via the TTS server.
G10L 13/00 - Synthèse de la paroleSystèmes de synthèse de la parole à partir de texte
G10L 13/08 - Analyse de texte ou génération de paramètres pour la synthèse de la parole à partir de texte, p. ex. conversion graphème-phonème, génération de prosodie ou détermination de l'intonation ou de l'accent tonique
G10L 13/10 - Règles de prosodie dérivées du texteIntonation ou accent tonique
90.
Method for determining a noise reference signal for noise compensation and/or noise reduction
The invention provides a method for determining a noise reference signal for noise compensation and/or noise reduction. A first audio signal on a first signal path and a second audio signal on a second signal path are received. The first audio signal is filtered using a first adaptive filter to obtain a first filtered audio signal. The second audio signal is filtered using a second adaptive filter to obtain a second filtered audio signal. The first and the second filtered audio signal are combined to obtain the noise reference signal. The first and the second adaptive filter are adapted such as to minimize a wanted signal component in the noise reference signal.
G10K 11/00 - Procédés ou dispositifs pour transmettre, conduire ou diriger le son en généralProcédés ou dispositifs de protection contre le bruit ou les autres ondes acoustiques ou pour amortir ceux-ci, en général
A speech output is generated from a text input written in a first language and containing inclusions in a second language. Words in the native language are pronounced with a native pronunciation and words in the foreign language are pronounced with a proficient foreign pronunciation. Language dependent phoneme symbols generated for words of the second language are replaced with language dependent phoneme symbols of the first language, where said replacing includes the steps of assigning to each language dependent phoneme symbol of the second language a language independent target phoneme symbol, mapping to each one language independent target phoneme symbol a language independent substitute phoneme symbol assignable to a language dependent substitute phoneme symbol of the first language, substituting the language dependent phoneme symbols of the second language by the language dependent substitute phoneme symbols of the first language.
G10L 13/08 - Analyse de texte ou génération de paramètres pour la synthèse de la parole à partir de texte, p. ex. conversion graphème-phonème, génération de prosodie ou détermination de l'intonation ou de l'accent tonique
G06F 17/28 - Traitement ou traduction du langage naturel
G10L 13/06 - Unités élémentaires de parole utilisées dans les synthétiseurs de paroleRègles de concaténation
92.
System and method for dynamic noise adaptation for robust automatic speech recognition
A speech processing method and arrangement are described. A dynamic noise adaptation (DNA) model characterizes a speech input reflecting effects of background noise. A null noise DNA model characterizes the speech input based on reflecting a null noise mismatch condition. A DNA interaction model performs Bayesian model selection and re-weighting of the DNA model and the null noise DNA model to realize a modified DNA model characterizing the speech input for automatic speech recognition and compensating for noise to a varying degree depending on relative probabilities of the DNA model and the null noise DNA model.
G10L 15/20 - Techniques de reconnaissance de la parole spécialement adaptées de par leur robustesse contre les perturbations environnantes, p. ex. en milieu bruyant ou reconnaissance de la parole émise dans une situation de stress
A handwriting recognition apparatus facilitates user entry of strokes one on top of another. The apparatus, which includes a processor and a display integrated with a touch sensitive screen, receives a series of strokes via the screen. Each stroke is defined by contact, trace, and lift occurrences. Each stroke appears on the display until occurrence of a prescribed event, and then disappears. The apparatus accumulates strokes into a buffer and interprets all accumulated strokes collectively against a character database and optionally a linguistic database, to identify multiple candidate strings that could be represented by the accumulated strokes. The apparatus displays candidate strings for user selection after all strokes have faded, or after receiving a user submitted delimiter, or after a given delay has elapsed following user entry of the latest stroke. Alternatively, candidate strings are displayed after each stroke without waiting for timeout or explicit delimiter.
G06F 3/041 - Numériseurs, p. ex. pour des écrans ou des pavés tactiles, caractérisés par les moyens de transduction
G06K 9/00 - Méthodes ou dispositions pour la lecture ou la reconnaissance de caractères imprimés ou écrits ou pour la reconnaissance de formes, p.ex. d'empreintes digitales
G06F 3/0488 - Techniques d’interaction fondées sur les interfaces utilisateur graphiques [GUI] utilisant des caractéristiques spécifiques fournies par le périphérique d’entrée, p. ex. des fonctions commandées par la rotation d’une souris à deux capteurs, ou par la nature du périphérique d’entrée, p. ex. des gestes en fonction de la pression exercée enregistrée par une tablette numérique utilisant un écran tactile ou une tablette numérique, p. ex. entrée de commandes par des tracés gestuels
G06F 3/023 - Dispositions pour convertir sous une forme codée des éléments d'information discrets, p. ex. dispositions pour interpréter des codes générés par le clavier comme codes alphanumériques, comme codes d'opérande ou comme codes d'instruction
94.
Efficient audio signal processing in the sub-band regime
A signal processing system enhances an audio signal. The audio signal is divided into audio sub-band signals. Some audio sub-band signals are excised. Other audio sub-band signals are processed to obtain enhanced audio sub-band signals. At least a portion of the excised audio sub-band signals are reconstructed. The reconstructed audio sub-band signals are synthesized with the enhanced audio sub-band signals to form an enhanced audio signal.
H04B 3/20 - Réduction des effets d'échos ou de sifflementSystèmes à ligne de transmission Détails ouverture ou fermeture de la voie d'émissionCommande de la transmission dans une direction ou l'autre
H04M 9/08 - Systèmes téléphoniques à haut-parleur à double sens comportant des moyens pour conditionner le signal, p. ex. pour supprimer les échos dans l'une ou les deux directions du trafic
G10L 25/18 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par le type de paramètres extraits les paramètres extraits étant l’information spectrale de chaque sous-bande
A method of selecting a service and inputting information to that service, in which an input device having keys is provided. When a key is pressed and released quickly, the user indicates a desire to enter a symbol on the key in order to enter symbols of an entry string. In addition, one or more of the keys may also be used to identify a service and also supply that service with the entry string. For example, by pressing and holding such a key, the entry string may be delimited and then sent to a service corresponding to the pressed key. In this manner, a single key press may be used to both delimit an entry string and also send the entry string to the service. The service may use the delimited entry string to retrieve information, which is then supplied to the input device.
H04M 1/274 - Dispositifs dans lesquels plusieurs signaux peuvent être enregistrés simultanément avec possibilité d'emmagasiner plus d'un numéro d'abonné à la fois
A computer receives user entry of a sequence of keypresses, representing an intended series of letters collectively spelling-out some or all of a desired textual object. Resolution of the intended series of letters and the desired textual object is ambiguous, however, because some or all of the key presses individually represent multiple letters. The computer interprets the keypresses utilizing concurrent, competing strategies, including one-keypress-per-letter and multi-tap interpretations. The computer displays a combined output of proposed interpretations and completions from both strategies.
H03K 17/94 - Commutation ou ouverture de porte électronique, c.-à-d. par d'autres moyens que la fermeture et l'ouverture de contacts caractérisée par la manière dont sont produits les signaux de commande
H03M 11/00 - Codage en relation avec des claviers ou des dispositifs similaires, c.-à-d. codage de la position des touches actionnées
G06F 3/02 - Dispositions d'entrée utilisant des interrupteurs actionnés manuellement, p. ex. des claviers ou des cadrans
G09G 5/00 - Dispositions ou circuits de commande de l'affichage communs à l'affichage utilisant des tubes à rayons cathodiques et à l'affichage utilisant d'autres moyens de visualisation
G06F 13/12 - Commande par programme pour dispositifs périphériques utilisant des matériels indépendants du processeur central, p. ex. canal ou processeur périphérique
G06F 13/38 - Transfert d'informations, p. ex. sur un bus
97.
Measurement and tuning of hands free telephone systems
An arrangement is described for measuring performance characteristics of a hands free telephone system. There is a measurement system which is coupleable over a telephone audio interface directly to the hands free telephone system for measuring the performance characteristics.
User input is received, specifying a continuous traced path across a keyboard presented on a touch sensitive display. An input sequence is resolved, including traced keys and auxiliary keys proximate to the traced keys by prescribed criteria. For each of one or more candidate entries of a prescribed vocabulary, a set-edit-distance metric is computed between said input sequence and the candidate entry. Various rules specify when penalties are imposed, or not, in computing the set-edit-distance metric. Candidate entries are ranked and displayed according to the computed metric.
G09G 5/00 - Dispositions ou circuits de commande de l'affichage communs à l'affichage utilisant des tubes à rayons cathodiques et à l'affichage utilisant d'autres moyens de visualisation
G06F 3/023 - Dispositions pour convertir sous une forme codée des éléments d'information discrets, p. ex. dispositions pour interpréter des codes générés par le clavier comme codes alphanumériques, comme codes d'opérande ou comme codes d'instruction
G06F 3/0488 - Techniques d’interaction fondées sur les interfaces utilisateur graphiques [GUI] utilisant des caractéristiques spécifiques fournies par le périphérique d’entrée, p. ex. des fonctions commandées par la rotation d’une souris à deux capteurs, ou par la nature du périphérique d’entrée, p. ex. des gestes en fonction de la pression exercée enregistrée par une tablette numérique utilisant un écran tactile ou une tablette numérique, p. ex. entrée de commandes par des tracés gestuels
The method provides a spectral speech description to be used for synthesis of a speech utterance, where at least one spectral envelope input representation is received. In one solution the improvement is made by manipulation an extremum, i.e. a peak or a valley, in the rapidly varying component of the spectral envelope representation. The rapidly varying component of the spectral envelope representation is manipulated to sharpen and/or accentuate extrema after which it is merged back with the slowly varying component or the spectral envelope input representation to create an enhanced spectral envelope final representation. In other solutions a complex spectrum envelope final representation is created with phase information derived from one of the group delay representation of a real spectral envelope input representation corresponding to a short-time speech signal and a transformed phase component of the discrete complex frequency domain input representation corresponding to the speech utterance.
G10L 21/00 - Techniques de traitement du signal de parole ou de voix pour produire un autre signal audible ou non audible, p. ex. visuel ou tactile, afin de modifier sa qualité ou son intelligibilité
G10L 21/02 - Amélioration de l'intelligibilité de la parole, p. ex. réduction de bruit ou annulation d'écho
G10L 21/0232 - Traitement dans le domaine fréquentiel
G10L 21/003 - Changement de la qualité de la voix, p. ex. de la hauteur tonale ou des formants
G10L 13/033 - Édition de voix, p. ex. transformation de la voix du synthétiseur
100.
Automatically generating audible representations of data content based on user preferences
A custom-content audible representation of selected data content is automatically created for a user. The content is based on content preferences of the user (e.g., one or more web browsing histories). The content is aggregated, converted using text-to-speech technology, and adapted to fit in a desired length selected for the personalized audible representation. The length of the audible representation may be custom for the user, and may be determined based on the amount of time the user is typically traveling.