The present invention extends to methods, systems, and computer program products for interpreting queries according to preferences. Multi-domain natural language understanding systems can support a variety of different types of clients. Queries can be received and interpreted across one or more domains. Preferred query interpretations can be identified and query responses provided based on any of: domain preferences, preferences indicated by an identifier, or (e.g., weighted) scores exceeding a threshold.
A method of assisting a user. The method including obtaining a plurality of rules having condition components and action components, the action components specifying conversation schemas, detecting, by a sensor, a fact related to an environment of the user, identifying a rule, of the plurality of rules, having a condition component that is satisfied by the detected fact, initiating a conversation with the user according to a conversation schema of the action component of the rule of the plurality of rules, and performing an action in response to a positive statement by the user.
A method and an apparatus for processing an intelligent voice query. A voice query input is received from a user. Automatic speech recognition and natural language understanding generate structured query data. It is modified based on an input adaptation rule to obtain modified structured query data appropriate for a content providing server, which provides a query result output corresponding to the modified structured query data. Input adaptation rules may comprise rule sets based on behavior patterns of the user and/or business recommendations. The query result output can be used for natural language generation, which may have similar adaptation rules for output.
A method for automatic speech recognition (ASR) of audio data streams involves obtaining a service level indication to determine the appropriate ASR model from a set of models. The set includes at least a first model with higher accuracy and greater computing resource requirements, and a second model with lower accuracy and reduced resource demands. The method includes selecting an ASR model based on the service level indication, receiving the audio data stream, and executing ASR using the chosen model. This approach allows for dynamic adaptation of ASR processing based on available resources and desired accuracy, optimizing performance and resource allocation.
A method and system for controlling a GUI on a user's network-connected device, the control being provided by a telephone call between the user and a speech recognition and speech synthesis system. An example of a restaurant ordering system is provided. The user calls a phone number and is guided through a verbal ordering process that includes one or more of: adding an item, deleting an item, changing quantities, changing sizes, and changing details of an item. The user's choices are added to a display so that a current status of the order is visible to the user. The GUI is updated as changes are made to the order. The GUI can also request additional information, upsell items, and show menus. The GUI aids the user in confirming that the order is correct. The system provides the final order to a restaurant for fulfillment.
Methods and systems for correction of a likely erroneous word in a speech transcription are disclosed. By evaluating token confidence scores of individual words or phrases, the automatic speech recognition system can replace a low-confidence score word with a substitute word or phrase. Among various approaches, neural network models can be used to generate individual confidence scores. Such word substitution can enable the speech recognition system to automatically detect and correct likely errors in transcription. Furthermore, the system can indicate the token confidence scores on a graphic user interface for labeling and dictionary enhancement.
Systems and methods are provided for natural language processing using neural network models and natural language virtual assistants. The system and method include receiving a natural language phrase including a word sequence, computing corresponding error probabilities that the words are errors, and for a word with a corresponding error probability above a threshold, then computing a replacement phrase with a low error probability to provide a response from the virtual assistant depending on the replacement phrase.
G06N 7/00 - Agencements informatiques fondés sur des modèles mathématiques spécifiques
8.
BUILDING A NATURAL LANGUAGE UNDERSTANDING APPLICATION USING A RECEIVED ELECTRONIC RECORD CONTAINING PROGRAMMING CODE INCLUDING AN INTERPRET-BLOCK AND AN INTERPRET-STATEMENT
A method of building a natural language understanding application is provided. The method includes receiving at least one electronic record containing programming code and creating executable code from the programming code. Further, the executable code, when executed by a processor, causes the processor to create a parse and an interpretation of a sequence of input tokens, the programming code includes an interpret-block and the interpret-block includes an interpret-statement. Additionally, the interpret-statement includes a pattern expression and the interpret-statement includes an action statement.
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
H04M 3/493 - Services d'information interactifs, p. ex. renseignements sur l'annuaire téléphonique
9.
PERFORMING SPEECH RECOGNITION USING A SET OF WORDS WITH DESCRIPTIONS IN TERMS OF COMPONENTS SMALLER THAN THE WORDS
A system and method is presented for performing dual mode speech recognition, employing a local recognition module on a mobile device and a remote recognition engine on a server device. The system accepts a spoken query from a user, and both the local recognition module and the remote recognition engine perform speech recognition operations on the query, returning a transcription and confidence score, subject to a latency cutoff time. If both sources successfully transcribe the query, then the system accepts the result having the higher confidence score. If only one source succeeds, then that result is accepted. In either case, if the remote recognition engine does succeed in transcribing the query, then a client vocabulary is updated if the remote system result includes information not present in the client vocabulary.
G10L 15/30 - Reconnaissance distribuée, p. ex. dans les systèmes client-serveur, pour les applications en téléphonie mobile ou réseaux
G10L 15/04 - SegmentationDétection des limites de mots
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
G10L 15/26 - Systèmes de synthèse de texte à partir de la parole
G10L 15/34 - Adaptation d’un reconnaisseur unique pour traitement en parallèle, p. ex. par utilisation de processeurs multiples ou informatique en nuage
G10L 17/06 - Techniques de prise de décisionStratégies d’alignement de motifs
10.
DERIVING ACOUSTIC FEATURES AND LINGUISTIC FEATURES FROM RECEIVED SPEECH AUDIO
A computer-implemented method is provided. The method including receiving speech audio of dictation associated with a user ID, deriving acoustic features from the speech audio, storing the derived acoustic features in a user profile associated with the user ID, receiving a request for acoustic features through an application programming interface (API), the request including the user ID, and sending the derived acoustic features through the API.
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
G10L 15/18 - Classement ou recherche de la parole utilisant une modélisation du langage naturel
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
G10L 15/26 - Systèmes de synthèse de texte à partir de la parole
G10L 25/51 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes spécialement adaptées pour un usage particulier pour comparaison ou différentiation
G10L 25/90 - Détermination de la hauteur tonale des signaux de parole
A method and system for acoustic model conditioning on non-phoneme information features for optimized automatic speech recognition is provided. The method includes using an encoder model to encode sound embedding from a known key phrase of speech and conditioning an acoustic model with the sound embedding to optimize its performance in inferring the probabilities of phonemes in the speech. The sound embedding can comprise non-phoneme information related to the key phrase and the following utterance. Further, the encoder model and the acoustic model can be neural networks that are jointly trained with audio data.
G10L 15/16 - Classement ou recherche de la parole utilisant des réseaux neuronaux artificiels
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
G10L 25/30 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par la technique d’analyse utilisant des réseaux neuronaux
An automated answering system and method are disclosed for use in providing automated customer service. The automated answering system uses generative artificial intelligence to aid in forming a knowledgebase of information regarding a merchant's business that is used in answering the customer queries. The automated answering system of the present technology also uses generative artificial intelligence to aid in formulating a response to queries using the formed knowledgebase.
An automated answering system and method are disclosed for use in providing automated customer service. The automated answering system uses generative artificial intelligence to aid in forming a knowledgebase of information regarding a merchant's business that is used in answering the customer queries. The automated answering system of the present technology also uses generative artificial intelligence to aid in formulating a response to queries using the formed knowledgebase.
G06Q 30/016 - Fourniture d’une assistance aux clients, p. ex. pour assister un client dans un lieu commercial ou par un service d’assistance après-vente
G06N 5/046 - Inférence en avantSystèmes de production
14.
METHOD AND SYSTEM FOR CONVERSATION TRANSCRIPTION WITH METADATA
Methods and systems for enabling an efficient review of meeting content via a metadata-enriched, speaker-attributed and multiuser-editable transcript are disclosed. By incorporating speaker diarization and other metadata, the system can provide a structured and effective way to review and/or edit the transcript by one or more editors. One type of metadata can be image or video data to represent the meeting content. Furthermore, the present subject matter utilizes a multimodal diarization model to identify and label different speakers. The system can synchronize various sources of data, e.g., audio channel data, voice feature vectors, acoustic beamforming, image identification, and extrinsic data, to implement speaker diarization.
G06F 40/166 - Édition, p. ex. insertion ou suppression
G06F 40/284 - Analyse lexicale, p. ex. segmentation en unités ou cooccurrence
G10L 15/02 - Extraction de caractéristiques pour la reconnaissance de la paroleSélection d'unités de reconnaissance
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
A computer system ingests a catalog of a plurality of items. The catalog is specific to a particular domain and including names for individual items of the plurality of items. One or more attributes are respectively associated to the individual items of the plurality of items. A specialist grammar specific to the particular domain of the catalog is obtained and used to interpret natural language input related to the catalog based on the names for the individual items of the plurality of items and their associated one or more attributes.
Various approaches relate to user defined content filtering in media playing devices of undesirable content represented in stored and real-time content from content providers. For example, video, image, and/or audio data can be analyzed to identify and classify content included in the data using various classification models and object and text recognition approaches. Thereafter, the identification and classification can be used to control presentation and/or access to the content and/or portions of the content. For example, based on the classification, portions of the content can be modified (e.g., replaced, removed, degraded, etc.) using one or more techniques (e.g., media replacement, media removal, media degradation, etc.) and then presented.
G06V 20/40 - ScènesÉléments spécifiques à la scène dans le contenu vidéo
H04N 21/44 - Traitement de flux élémentaires vidéo, p. ex. raccordement d'un clip vidéo récupéré d'un stockage local avec un flux vidéo en entrée ou rendu de scènes selon des graphes de scène du flux vidéo codé
H04N 21/466 - Procédé d'apprentissage pour la gestion intelligente, p. ex. apprentissage des préférences d'utilisateurs pour recommander des films
An audio recognition system provides for delivery of promotional content to its user. A user interface device, locally or with the assistance of a network-connected server, performs recognition of audio in response to queries. Recognition can be through a method such as processing features extracted from the audio. Audio can comprise recorded music, singing or humming, instrumental music, vocal music, spoken voice, or other recognizable types of audio. Campaign managers provide promotional content for delivery in response to audio recognized in queries.
A machine learning system for a digital assistant is described, together with a method of training such a system. The machine learning system is based on an encoder-decoder sequence-to-sequence neural network architecture trained to map input sequence data to output sequence data, where the input sequence data relates to an initial query and the output sequence data represents canonical data representation for the query. The method of training involves generating a training dataset for the machine learning system. The method involves clustering vector representations of the query data samples to generate canonical-query original-query pairs in training the machine learning system.
Systems for automatic speech recognition and/or natural language understanding automatically learn new words by finding subsequences of phonemes that, if they were a new word, would enable a successful tokenization of a phoneme sequence. Systems can learn alternate pronunciations of words by finding phoneme sequences with a small edit distance to existing pronunciations. Systems can learn the part of speech of words by finding part-of-speech variations that would enable parses by syntactic grammars. Systems can learn what types of entities a word describes by finding sentences that could be parsed by a semantic grammar but for the words not being on an entity list.
A system and method for masking an identity of a speaker of natural language speech, such as speech clips to be labeled by humans in a system generating voice transcriptions for training an automatic speech recognition model. The natural language speech is morphed prior to being presented to the human for labeling. In one embodiment, morphing comprises pitch shifting the speech randomly either up or down, then frequency shifting the speech, then pitch shifting the speech in a direction opposite the first pitch shift. Labeling the morphed speech comprises at least one or more of transcribing the morphed speech, identifying a gender of the speaker, identifying an accent of the speaker, and identifying a noise type of the morphed speech.
G06F 40/58 - Utilisation de traduction automatisée, p. ex. pour recherches multilingues, pour fournir aux dispositifs clients une traduction effectuée par le serveur ou pour la traduction en temps réel
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
G10L 15/18 - Classement ou recherche de la parole utilisant une modélisation du langage naturel
G10L 19/125 - Excitation de la hauteur tonale, p. ex. prédiction linéaire à excitation de code avec innovation synchrone de la hauteur tonale [PSI-CELP]
A server supports multiple virtual assistants. It receives requests that include wake phrase audio and an identification of the source of the request, such as a virtual assistant device. Based on the identification, the server searches a database for a wake phrase detector appropriate for the identified source. The server then applies the wake phrase detector to the received wake phrase audio. If the wake phrase audio triggers the wake phrase detector, the server provides an appropriate response to the source.
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
The technology disclosed relates to natural language understanding-based search engines, ranking sponsored search results and simulated ranking of sponsored search results. Tools and methods describe how to simulate the ranking of sponsored search results. The tools further identify instances of user queries within the scope of trigger patterns, optionally providing examples both of user queries for which a sponsored search result is likely to be displayed and examples for which the sponsored search result will not rank highly enough to be displayed, at least on the first page of search results.
[Object] Technology is provided to enable a mobile terminal to function as a digital assistant even when the mobile terminal is in a state where it cannot communicate with a server apparatus. [Solution] When a user terminal 200 receives a query A from a user, user terminal 200 sends query A to a server 100. Server 100 interprets the meaning of query A using a grammar A. Server 100 obtains a response to query A based on the meaning of query A and sends the response to user terminal 200. Server 100 further sends grammar A to user terminal 200. That is, server 100 sends to user terminal 200 a grammar used to interpret the query received from user terminal 200.
A user specifies a natural language command to a device. Software on the device generates contextual metadata about the user interface of the device, such as data about all visible elements of the user interface, and sends the contextual metadata along with the natural language command to a natural language understanding engine. The natural language understanding engine parses the natural language query using a stored grammar (e.g., a grammar provided by a maker of the device) and as a result of the parsing identifies information about the command (e.g., the user interface elements referenced by the command) and provides that information to the device. The device uses that provided information to respond to the command.
Methods and systems for enabling an efficient review of meeting content via a metadata-enriched, speaker-attributed transcript are disclosed. By incorporating speaker diarization and other metadata, the system can provide a structured and effective way to review and/or edit the transcript. One type of metadata can be image or video data to represent the meeting content. Furthermore, the present subject matter utilizes a multimodal diarization model to identify and label different speakers. The system can synchronize various sources of data, e.g., audio channel data, voice feature vectors, acoustic beamforming, image identification, and extrinsic data, to implement speaker diarization.
G06F 40/166 - Édition, p. ex. insertion ou suppression
G06F 40/284 - Analyse lexicale, p. ex. segmentation en unités ou cooccurrence
G10L 15/02 - Extraction de caractéristiques pour la reconnaissance de la paroleSélection d'unités de reconnaissance
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
A data processing system includes a queue manager receiving data processing requests and determining a queue depth representing the number of pending requests. A load supervisor assigns a service level to each request based on the queue depth when the request is at the head of the queue. The system offers two service levels, with the second level requiring fewer computing resources than the first. This dynamic management system optimizes resource allocation by adjusting service levels based on the workload, ensuring efficient processing of data requests.
As audio (1) is input to an extension of a browser, the extension transmits the audio (1) to a language processing server. A speech recognition unit obtains a text (1) corresponding to the audio (1), and transmits the text (1) to a natural language understanding unit. In the natural language understanding unit, an information processing unit identifies a URL (1) corresponding to the text (1), and transmits the URL (1) to the browser. The extension passes the URL (1) to a browsing function. The browsing function uses the URL (1) to access a web server. The web server transmits a web page (1) corresponding to the URL (1) to the browser. The browsing function shows a screen corresponding to the web page (1) on a display.
G06F 16/955 - Recherche dans le Web utilisant des identifiants d’information, p. ex. des localisateurs uniformisés de ressources [uniform resource locators - URL]
A method for processing an audio signal involves receiving sound waves at a microphone, converting them into a first audio signal, and extracting a second audio signal from an electromagnetic signal received at a receiver. The first audio signal is correlated with the second audio signal to calculate a correlation value. If the correlation value exceeds a threshold, the first audio signal is processed using the second audio signal to reduce unwanted sound contributions, resulting in a processed audio signal. Further processing is then performed on the processed audio signal to determine a characteristic of the desired sound.
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
G10L 21/0316 - Amélioration de l'intelligibilité de la parole, p. ex. réduction de bruit ou annulation d'écho en changeant l’amplitude
G10L 25/06 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par le type de paramètres extraits les paramètres extraits étant des coefficients de corrélation
G10L 25/51 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes spécialement adaptées pour un usage particulier pour comparaison ou différentiation
A method includes recognizing words comprised by a first utterance; interpreting the recognized words according to a grammar comprised by a domain; from the interpreting of the recognized words, determining a timeout period for the first utterance based on the domain of the first utterance; detecting end of voice activity in the first utterance; executing an instruction following an amount of time after detecting end of voice activity of the first utterance in response to the amount of time exceeding the timeout period, the executed instruction based at least in part on interpreting the recognized words.
A voice interface recognizes spoken utterances from multiple users. It responds to the utterances in ways such as modifying the attributes of instances of items. The voice interface computes a voice vector for each utterance and associates it with the item instance that is modified. For following utterances with a closely matching voice vector, the voice interface modifies the same instance. For following utterances with a voice vector that is not a close match to one stored for any item instance, the voice interface modifies a different item instance.
A voice interface recognizes spoken utterances from multiple users. It responds to the utterances in ways such as modifying the attributes of instances of items. The voice interface computes a voice vector for each utterance and associates it with the item instance that is modified. For following utterances with a closely matching voice vector, the voice interface modifies the same instance. For following utterances with a voice vector that is not a close match to one stored for any item instance, the voice interface modifies a different item instance.
The technology disclosed relates to natural language understanding-based search engines, ranking sponsored search results and simulated ranking of sponsored search results. Tools and methods describe how to simulate the ranking of sponsored search results. The tools further identify instances of user queries within the scope of trigger patterns, optionally providing examples both of user queries for which a sponsored search result is likely to be displayed and examples for which the sponsored search result will not rank highly enough to be displayed, at least on the first page of search results.
Natural language grammars interpret expressions at the conversational human-machine interfaces of devices. Under conditions favoring engagement, as specified in a unit of conversational code, the device initiates a discussion using one or more of TTS, images, video, audio, and animation depending on the device capabilities of screen and audio output. Conversational code units specify conditions based on conversation state, mood, and privacy. Grammars provide intents that cause calls to system functions. Units can provide scripts for guiding the conversation. The device, or supporting server system, can provide feedback to creators of the conversational code units for analysis and machine learning.
A system and method of real-time feedback confirmation to solicit a virtual assistant response from an evolving semantic state of at least a portion of an utterance. A user accesses a virtual assistant on an electronic device having the system and/or method configured to capture a command, a question, and/or a fulfillment request from audio such as, the speech emitted from the speaking user. The speech may be intercepted by a speech engine configured to transcribe the speech into text that is matched with the fragment pattern's regular expression to generate a fragment and/or the speech may be processed with a machine learning model to identify fragments. The fragments are identified by a domain handler configured to update a data structure of the current semantic state of the utterance in real-time on an interface of an electronic device.
A system and method of real-time feedback confirmation to solicit a virtual assistant response from an evolving semantic state of at least a portion of an utterance. A user accesses a virtual assistant on an electronic device having the system and/or method configured to capture a command, a question, and/or a fulfillment request from audio such as, the speech emitted from the speaking user. The speech may be intercepted by a speech engine configured to transcribe the speech into text that is matched with the fragment pattern's regular expression to generate a fragment and/or the speech may be processed with a machine learning model to identify fragments. The fragments are identified by a domain handler configured to update a data structure of the current semantic state of the utterance in real-time on an interface of an electronic device.
Automatically generating sentences that a user can say to invoke a set of defined actions performed by a virtual assistant are disclosed. A sentence is received and keywords are extracted from the sentence. Based on the keywords, additional sentences are generated. A classifier model is applied to the generated sentences to determine a sentence that satisfies a threshold. In the situation a sentence satisfies the threshold, an intent associated with the classifier model can be invoked. In the situation the sentences fail to satisfy the classifier model, the virtual assistant can attempt to interpret the received sentence according to the most likely intent by invoking a sentence generation model fine-tuned for a particular domain, generate additional sentences with a high probability of having the same intent and fulfill the specific action defined by the intent.
G10L 15/18 - Classement ou recherche de la parole utilisant une modélisation du langage naturel
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
A neural TTS system is trained to generate key acoustic frames at variable rates while omitting other frames. The frame skipping depends on the acoustic features to be generated for the input text. The TTS system can interpolate frames between the key frames at a target rate for a vocoder to synthesis audio samples.
A system detects a period of non-voice activity and compares its duration to a cutoff period. The system adapts the cutoff period based on parsing previously-recognized speech of a user that is stored on a user's device or the system, which detects the voice activity, to determine according to a model, such as a machine-learned model, the probability that the speech recognized so far is a prefix to a longer complete utterance. The cutoff period is longer when a parse of previously recognized speech, which is based on the user profile, has a high probability of being a prefix of a longer utterance.
A voice morphing model can transform diverse voices to one or a small number of target voices. An acoustic model can be trained for high accuracy on the target voices. Speech recognition on diverse voices can be performed by morphing it to a target voice and then performing recognition on audio with the target voice. The morphing model and an acoustic model for speech recognition can be trained separately or jointly.
A voice morphing model can transform diverse voices to one or a small number of target voices. An acoustic model can be trained for high accuracy on the target voices. Speech recognition on diverse voices can be performed by morphing it to a target voice and then performing recognition on audio with the target voice. The morphing model and an acoustic model for speech recognition can be trained separately or jointly.
A source of requests for speech recognition can pass audio and a voiceprint with requests. Speech recognition can run with improved accuracy by biasing an acoustic model for the voice in the audio using the voiceprint. The audio can be used to calculate a new voiceprint, which can be used to update the voiceprint included with the audio. The updated voiceprint can be sent back to the source and then used with future speech recognition requests.
G10L 15/18 - Classement ou recherche de la parole utilisant une modélisation du langage naturel
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
40.
MESSAGE PROCESSING METHOD, INFORMATION PROCESSING APPARATUS, AND PROGRAM
[Object] To provide a technique for more accurate interpretation of a message inputted by a user.
[Object] To provide a technique for more accurate interpretation of a message inputted by a user.
[Solving Means] An information processing server 300 obtains a first message from a user in a thread 001, has a context of the first message stored in a context database 500 in association with the thread 001, obtains a second message from the user in the thread 001, and provides the second message to a conversation server 400 together with the context of the first message.
H04L 51/04 - Messagerie en temps réel ou quasi en temps réel, p. ex. messagerie instantanée [IM]
H04L 51/02 - Messagerie d'utilisateur à utilisateur dans des réseaux à commutation de paquets, transmise selon des protocoles de stockage et de retransmission ou en temps réel, p. ex. courriel en utilisant des réactions automatiques ou la délégation par l’utilisateur, p. ex. des réponses automatiques ou des messages générés par un agent conversationnel
H04L 51/216 - Gestion de l'historique des conversations, p. ex. regroupement de messages dans des sessions ou des fils de conversation
Aspects include methods, systems, and computer-program products providing virtual assistant domain functionality. A natural language query including one or more words is received. A collection of natural language modules is accessed. The collection natural language modules are configured to process sets of natural language queries. A natural language module, from the collection of natural language modules, is identified to interpret the natural language query. An interpretation of the natural language query is computed using the identified natural language module. A response to the natural language query is returned using the computed interpretation.
G06F 40/40 - Traitement ou traduction du langage naturel
G10L 15/30 - Reconnaissance distribuée, p. ex. dans les systèmes client-serveur, pour les applications en téléphonie mobile ou réseaux
G06Q 30/0283 - Estimation ou détermination de prix
G06Q 20/10 - Architectures de paiement spécialement adaptées aux systèmes de transfert électronique de fondsArchitectures de paiement spécialement adaptées aux systèmes de banque à domicile
Actions are authorized by computing a confidence score that exceeds a threshold. The confidence score is based on a match between metadata about requests and fields in corresponding database records. The confidences score weights matches by the dependability of the metadata for authentication. The confidence score is further based on the closeness of a sample of speech audio to a stored voiceprint. Additional identification may be required for authorization. The confidence score requirement may be relaxed based on identification in a buffer of recent action requests.
Support for natural language expressions is provided by the use of semantic grammars that describe the structure of expressions in that grammar and that construct the meaning of a corresponding natural language expression. A semantic grammar extension mechanism is provided, which allows one semantic grammar to be used in the place of another semantic grammar. This enriches the expressivity of semantic grammars in a simple, natural, and decoupled manner.
A system and method invoke virtual assistant action, which may comprise an argument. From audio, a probability of an intent is inferred. A probability of a domain and a plurality of variable values may also be inferred. Invoking the action is in response to the intent probability exceeding a threshold. Invoking the action may also be in response to the domain probability exceeding a threshold, a variable value probability exceeding a threshold, detecting an end of utterance, and a specific amount of time having elapsed. The intent probability may increase when the audio includes speech of words with the same meaning in multiple natural languages. Invoking the action may also be conditional on the variable value exceeding its threshold within a certain period of time of the intent probability exceeding its threshold.
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
G10L 13/02 - Procédés d'élaboration de parole synthétiqueSynthétiseurs de parole
G10L 15/16 - Classement ou recherche de la parole utilisant des réseaux neuronaux artificiels
G10L 15/18 - Classement ou recherche de la parole utilisant une modélisation du langage naturel
G10L 15/187 - Contexte phonémique, p. ex. règles de prononciation, contraintes phonotactiques ou n-grammes de phonèmes
G10L 15/197 - Grammaires probabilistes, p. ex. n-grammes de mots
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
Custom acoustic models can be configured by developers by providing audio files with custom recordings. The custom acoustic model is trained by tuning a baseline model using the audio files. Audio files may contain custom noise to apply to clean speech for training. The custom acoustic model is provided as an alternative to a standard acoustic model. A speech recognition system can select an acoustic model for use upon receiving metadata about the device conditions or type. Speech recognition is performed on speech audio using one or more acoustic models. The result can be provided to developers through the user interface, and an error rate can be computed and also provided.
G10L 15/18 - Classement ou recherche de la parole utilisant une modélisation du langage naturel
46.
Building a natural language understanding application using a received electronic record containing programming code including an interpret-block, an interpret-statement, a pattern expression and an action statement
A method of building a natural language understanding application is provided. The method includes receiving at least one electronic record containing programming code and creating executable code from the programming code. Further, the executable code, when executed by a processor, causes the processor to create a parse and an interpretation of a sequence of input tokens, the programming code includes an interpret-block and the interpret-block includes an interpret-statement. Additionally, the interpret-statement includes a pattern expression and the interpret-statement includes an action statement.
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
G10L 15/18 - Classement ou recherche de la parole utilisant une modélisation du langage naturel
H04M 3/493 - Services d'information interactifs, p. ex. renseignements sur l'annuaire téléphonique
A neural speech-to-meaning system is trained on speech audio expressing specific intents. The system receives speech audio and produces indications of when the speech in the audio matches the intent. Intents may include variables that can have a large range of values, such as the names of places. The neural speech-to-meaning system simultaneously recognizes enumerated values of variables and general intents. Recognized variable values can serve as arguments to API requests made in response to recognized intents. Accordingly, neural speech-to-meaning supports voice virtual assistants that serve users based on API hits.
Methods and systems for pre-wakeword speech processing are disclosed. Speech audio, comprising command speech spoken before a wakeword, may be stored in a buffer in oldest to newest order. Upon detection of the wakeword, reverse acoustic models and language models, such as reverse automatic speech recognition (R-ASR) can be applied to the buffered audio, in newest to oldest order, starting from before the wakeword. The speech is converted into a sequence of words. Natural language grammar models, such as natural language understanding (NLU), can be applied to match the sequence of words to a complete command, the complete command being associated with invoking a computer operation.
The application provides an apparatus, platform, method and medium for intention importance interference. The apparatus includes an interface configured to receive user-related information; and a processor coupled to the interface and configured to: extract data related to different aspects of a user from the user-related information; generate a plurality of intention probes based on the data related to different aspects of the user, each intention probe comprising an intention and associated data items; infer an importance of each intention probe by calculating a score of each associated data items of the intention probe based on the data related to different aspects of the user; and provide information associated with an intention probe with a highest importance.
Support for natural language expressions is provided by the use of semantic grammars that describe the structure of expressions in that grammar and that construct the meaning of a corresponding natural language expression. A semantic grammar extension mechanism is provided, which allows one semantic grammar to be used in the place of another semantic grammar. This enriches the expressivity of semantic grammars in a simple, natural, and decoupled manner.
Various approaches relate to user defined content filtering in media playing devices of undesirable content represented in stored and real-time content from content providers. For example, video, image, and/or audio data can be analyzed to identify and classify content included in the data using various classification models and object and text recognition approaches. Thereafter, the identification and classification can be used to control presentation and/or access to the content and/or portions of the content. For example, based on the classification, portions of the content can be modified (e.g., replaced, removed, degraded, etc.) using one or more techniques (e.g., media replacement, media removal, media degradation, etc.) and then presented.
G06V 20/40 - ScènesÉléments spécifiques à la scène dans le contenu vidéo
H04N 21/44 - Traitement de flux élémentaires vidéo, p. ex. raccordement d'un clip vidéo récupéré d'un stockage local avec un flux vidéo en entrée ou rendu de scènes selon des graphes de scène du flux vidéo codé
H04N 21/466 - Procédé d'apprentissage pour la gestion intelligente, p. ex. apprentissage des préférences d'utilisateurs pour recommander des films
52.
Method and system for acoustic model conditioning on non-phoneme information features
A method and system for acoustic model conditioning on non-phoneme information features for optimized automatic speech recognition is provided. The method includes using an encoder model to encode sound embedding from a known key phrase of speech and conditioning an acoustic model with the sound embedding to optimize its performance in inferring the probabilities of phonemes in the speech. The sound embedding can comprise non-phoneme information related to the key phrase and the following utterance. Further, the encoder model and the acoustic model can be neural networks that are jointly trained with audio data.
G10L 25/30 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par la technique d’analyse utilisant des réseaux neuronaux
53.
SYSTEMS AND METHODS FOR GENERATING AND USING SHARED NATURAL LANGUAGE LIBRARIES
Systems and methods for searching databases by sound data input are provided herein. A service provider may have a need to make their database(s) searchable through search technology. However, the service provider may not have the resources to implement such search technology. The search technology may allow for search queries using sound data input. The technology described herein provides a solution addressing the service provider’s need, by giving a search technology that furnishes search results in a fast, accurate manner. In further embodiments, systems and methods to monetize those search results are also described herein.
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
G10L 15/183 - Classement ou recherche de la parole utilisant une modélisation du langage naturel selon les contextes, p. ex. modèles de langage
G06F 16/174 - Élimination de redondances par le système de fichiers
54.
SYSTEM AND METHOD FOR VOICE UNIDENTIFIABLE MORPHING
A system and a method are disclosed for a machine learned audio morpher that is trained such that the voice characteristics of a user spoken phrase are replaced with those of a target speaker, which removes and/or reduces the user identifiable information for the spoken phrase. Training can be performed by a user and a target speaker speaking the same or similar phrases and training the audio morpher to minimize the differences between the target speaker phrase and a morphed user phrase.
A computer system ingests a catalog of a plurality of items. The catalog is specific to a particular domain and including names for individual items of the plurality of items. One or more attributes are respectively associated to the individual items of the plurality of items. A specialist grammar specific to the particular domain of the catalog is obtained and a programming language code to interpret natural language input related to the catalog is generated using the specialist grammar, and the names for the individual items of the plurality of items and their associated one or more attributes.
A voice-controlled device includes a microphone to receive a set of sound waves that includes speech uttered by a user and other sound, and to output a first audio signal that includes a contribution from the speech uttered by the user and a contribution from the other sound. The device also includes a receiver to receive an electromagnetic signal and to output a second audio signal obtained from the electromagnetic signal. An audio pre-processor of the device processes the first audio signal using the second audio signal to reduce the contribution from the other sound in a processed audio signal. The voice-controlled device then provides the processed audio signal to a speech recognition module to determine a voice command issued by the user.
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
G10L 21/0316 - Amélioration de l'intelligibilité de la parole, p. ex. réduction de bruit ou annulation d'écho en changeant l’amplitude
G10L 25/06 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par le type de paramètres extraits les paramètres extraits étant des coefficients de corrélation
G10L 25/51 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes spécialement adaptées pour un usage particulier pour comparaison ou différentiation
Methods and systems for correction of a likely erroneous word in a speech transcription are disclosed. By evaluating token confidence scores of individual words or phrases, the automatic speech recognition system can replace a low-confidence score word with a substitute word or phrase. Among various approaches, neural network models can be used to generate individual confidence scores. Such word substitution can enable the speech recognition system to automatically detect and correct likely errors in transcription. Furthermore, the system can indicate the token confidence scores on a graphic user interface for labeling and dictionary enhancement.
A video conferencing system, such as one implemented with a cloud server, receives audio streams from a plurality of endpoints. The system uses automatic speech recognition to transcribe speech in the audio streams. The system multiplexes the transcriptions into individual caption streams and sends them to the endpoints, but the caption stream to each endpoint omits the transcription of audio from the endpoint. Some systems allow muting of audio through an indication to the system. The system then omits sending the muted audio to other endpoints and also omits sending a transcription of the muted audio to other endpoints.
A method and an apparatus for processing an intelligent voice query. A voice query input is received from a user. Automatic speech recognition and natural language understanding generate structured query data. It is modified based on an input adaptation rule to obtain modified structured query data appropriate for a content providing server, which provides a query result output corresponding to the modified structured query data. Input adaptation rules may comprise rule sets based on behavior patterns of the user and/or business recommendations. The query result output can be used for natural language generation, which may have similar adaptation rules for output.
A method of assisting a user. The method including obtaining a plurality of rules having condition components and action components, the action components specifying conversation schemas, detecting, by a sensor, a fact related to an environment of the user, identifying a rule, of the plurality of rules, having a condition component that is satisfied by the detected fact, initiating a conversation with the user according to a conversation schema of the action component of the rule of the plurality of rules, and performing an action in response to a positive statement by the user.
A system for performing automated speech recognition (ASR) on audio data includes a queue manager to receive a request to perform ASR on audio data, add the request to a queue of incoming requests, and determine a queue depth representing a number of requests in the queue at a given time. The system also includes a load supervisor to receive the request and the queue depth from the queue manager and assign a service level for the request based on the queue depth. In addition, the system includes a speech-to-text converter to receive the assigned service level for the request from the load supervisor, select an ASR model for the request based on the received service level, receive the audio data associated with the request, and perform ASR on the audio data using the selected ASR model.
A method and system for controlling a GUI on a user's network-connected device, the control being provided by a telephone call between the user and a speech recognition and speech synthesis system. An example of a restaurant ordering system is provided. The user calls a phone number and is guided through a verbal ordering process that includes one or more of: adding an item, deleting an item, changing quantities, changing sizes, and changing details of an item. The user's choices are added to a display so that a current status of the order is visible to the user. The GUI is updated as changes are made to the order. The GUI can also request additional information, upsell items, and show menus. The GUI aids the user in confirming that the order is correct. The system provides the final order to a restaurant for fulfillment.
Methods and systems for intuitive spatial audio rendering with improved intelligibility are disclosed. By establishing a virtual association between an audio source and a location in the listener's virtual audio space, a spatial audio rendering system can generate spatial audio signals that create a natural and immersive audio field for a listener. The system can receive the virtual location of the source as a parameter and map the source audio signal to a source-specific multi-channel audio signal. In addition, the spatial audio rendering system can be interactive and dynamically modify the rendering of the spatial audio in response to a user's active control or tracked movement.
A method and system for implementing a speech-enabled interface of a host device via an electronic mobile device in a network are provided. The method includes establishing a communication session between the host device and the mobile device via a session service provider. According to some embodiments, a barcode can be adopted to enable the pairing of the host device and mobile device. Furthermore, the present method and system employ the voice interface in conjunction with speech recognition systems and natural language processing to interpret voice input for the hosting device, which can be used to perform one or more actions related to the hosting device.
A system and a method are disclosed that enable sidebar conversations between two or more attendees that are participating in a primary or main meeting. The sidebar conversation occurs in conjunction or concurrently with the primary meeting. A first attendee provides commands to indicate a desire to initiate a sidebar conversation and information about a targeted attendee. The commands are analyzed to determine if a trigger phrase is included. The commands are analyzed to determine if there is an identification of a second (targeted) attendee, who is currently participating in the main meeting. If the second attendee is available, then the sidebar conversation is initiated. Additional attendees can be added to the sidebar conversation. Additional independent and simultaneous sidebar conversations can be initiated (by attendees currently participating in the active sidebar conversation), thereby allowing one attendee to conduct multiple simultaneous sidebar conversations while being able to switch between them.
H04L 65/403 - Dispositions pour la communication multipartite, p. ex. pour les conférences
H04L 65/1069 - Établissement ou terminaison d'une session
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
G10L 25/57 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes spécialement adaptées pour un usage particulier pour comparaison ou différentiation pour le traitement des signaux vidéo
A user specifies a natural language command to a device. Software on the device generates contextual metadata about the user interface of the device, such as data about all visible elements of the user interface, and sends the contextual metadata along with the natural language command to a natural language understanding engine. The natural language understanding engine parses the natural language query using a stored grammar (e.g., a grammar provided by a maker of the device) and as a result of the parsing identifies information about the command (e.g., the user interface elements referenced by the command) and provides that information to the device. The device uses that provided information to respond to the command.
As audio (1) is input to an extension of a browser, the extension transmits the audio (1) to a language processing server. A speech recognition unit obtains a text (1) corresponding to the audio (1), and transmits the text (1) to a natural language understanding unit. In the natural language understanding unit, an information processing unit identifies a URL (1) corresponding to the text (1), and transmits the URL (1) to the browser. The extension passes the URL (1) to a browsing function. The browsing function uses the URL (1) to access a web server. The web server transmits a web page (1) corresponding to the URL (1) to the browser. The browsing function shows a screen corresponding to the web page (1) on a display.
G06F 16/955 - Recherche dans le Web utilisant des identifiants d’information, p. ex. des localisateurs uniformisés de ressources [uniform resource locators - URL]
G06F 40/40 - Traitement ou traduction du langage naturel
G10L 15/26 - Systèmes de synthèse de texte à partir de la parole
68.
API FOR SERVICE PROVIDER FULFILLMENT OF DATA PRIVACY REQUESTS
A system and method are disclosed for fulfilling GDPR and other privacy requests in a client device system as well as a downstream service provider with which the client device system partners. In examples, the downstream service provider may be a voice assistant service provider providing voice recognition and language understanding capabilities to an upstream client device system.
09 - Appareils et instruments scientifiques et électriques
Produits et services
Recorded computer software for spotting wake words; Recorded computer software for recognizing speech, interpreting natural language, and providing virtual assistant functions; Downloadable computer software development kits (SDKs) for developing speech recognition, natural language understanding, and virtual assistant software; Recorded computer software for controlling speech recognition, natural language understanding, and virtual assistant cloud processing; Recorded computer software for performing text-to-speech voice audio synthesis; Downloadable electronic data files featuring neural network parameter sets for synthesizing text-to-speech voices; Downloadable electronic data files featuring neural network parameter sets for spotting wake words in audio; Recorded computer software for operating a virtual assistant device for hotels and restaurants; Recorded computer software for providing a virtual assistant using artificial intelligence technology for hotels and restaurants to make customer bookings and reservations, and answer other customer queries; Preinstalled software for operating a virtual assistant device for hotels and restaurants sold as a component of virtual assistant devices for hotels and restaurants; Recorded computer software for understanding speech for use with voice ordering kiosks, drive through ordering systems, and retail ordering systems; Recorded computer software for understanding speech for use with voice reservation kiosks; Recorded computer software for understanding speech for use with smart home devices; Recorded computer software for understanding speech for use with voice enabled robots
70.
Wake suppression for audio playing and listening devices
A system and method are disclosed for ignoring a wakeword received at a speech-enabled listening device when it is determined the wakeword is reproduced audio from an audio-playing device. Determination can be by detecting audio distortions, by an ignore flag sent locally between an audio-playing device and speech-enabled device, by and ignore flag sent from a server, by comparison of received audio played audio to a wakeword within an audio-playing device or a speech-enabled device, and other means.
A system and method are disclosed capable of parsing a spoken utterance into a natural language request and a speech audio segment, where the natural language request directs the system to use the speech audio segment as a new wakeword. In response to this wakeword assignment directive, the system and method are further capable of immediately building a new wakeword spotter to activate the device upon matching the new wakeword in the input audio. Different approaches to promptly building a new wakeword spotter are described. Variations of wakeword assignment directives can make the new wakeword public or private. They can also add the new wakeword to earlier wakewords, or replace earlier wakewords.
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
A processing system detects a period of non-voice activity and compares its duration to a cutoff period. The system adapts the cutoff period based on parsing previously-recognized speech to determine, according to a model, such as a machine-learned model, the probability that the speech recognized so far is a prefix to a longer complete utterance. The cutoff period is longer when a parse of previously recognized speech has a high probability of being a prefix of a longer utterance.
09 - Appareils et instruments scientifiques et électriques
42 - Services scientifiques, technologiques et industriels, recherche et conception
Produits et services
Recorded computer software for spotting wake words; Recorded computer software for recognizing speech, interpreting natural language, and providing virtual assistant functions; Downloadable computer software development kits (SDKs) for developing speech recognition, natural language understanding, and virtual assistant software; Recorded computer software for performing text-to-speech voice audio synthesis; Downloadable electronic data files featuring neural network parameter sets for synthesizing text-to-speech voices; Downloadable electronic data files featuring neural network parameter sets for spotting wake words in audio; Recorded computer software for operating a virtual assistant device for hotels and restaurants; Recorded computer software for providing a virtual assistant using artificial intelligence technology for hotels and restaurants to make customer bookings and reservations, and answer other customer queries; Preinstalled software for operating a virtual assistant device for hotels and restaurants sold as a component of virtual assistant devices for hotels and restaurants; Recorded computer software for understanding speech for use with smart home devices; Recorded computer software for understanding speech for use with voice enabled robots; Recorded computer software for training of custom wake word spotters for virtual assistants; Recorded computer software for synthesis of text-to-speech voice audio Platform as a service (PaaS) featuring computer software platforms for configuring virtual assistants through a web interface; Platform as a service (PaaS) featuring computer software platforms for configuring domain-specific content for virtual assistants; Providing online non-downloadable computer software for training of custom wake word spotters for virtual assistants; Providing online non-downloadable computer software for synthesis of text-to-speech voice audio; Platform as a service (PaaS) featuring computer software platforms for configuring custom text-to-speech voices
A system and method invoke virtual assistant action, which may comprise an argument. From audio, a probability of an intent is inferred. A probability of a domain and a plurality of variable values may also be inferred. Invoking the action is in response to the intent probability exceeding a threshold. Invoking the action may also be in response to the domain probability exceeding a threshold, a variable value probability exceeding a threshold, detecting an end of utterance, and a specific amount of time having elapsed. The intent probability may increase when the audio includes speech of words with the same meaning in multiple natural languages. Invoking the action may also be conditional on the variable value exceeding its threshold within a certain period of time of the intent probability exceeding its threshold.
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
G10L 15/16 - Classement ou recherche de la parole utilisant des réseaux neuronaux artificiels
G10L 15/18 - Classement ou recherche de la parole utilisant une modélisation du langage naturel
G10L 13/02 - Procédés d'élaboration de parole synthétiqueSynthétiseurs de parole
G10L 15/197 - Grammaires probabilistes, p. ex. n-grammes de mots
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
G10L 15/187 - Contexte phonémique, p. ex. règles de prononciation, contraintes phonotactiques ou n-grammes de phonèmes
75.
SYSTEM AND METHOD FOR COMPUTING REGION CENTERS BY POINT CLUSTERING
A system and a method are disclosed that calculate the center of a geographic region. A set of topological/geographical points is received. A set of clusters is determined. A weight for each cluster is computed. The highest weighted cluster is selected. The geographic region center is calculated using the selected cluster. The geographical points can include a key for each point and be filtered by an indicated key before calculating the center of a geographic location.
A system and method are disclosed for achieving interoperability and access to a personal extension knowledge/preference database (PEKD) through interconnected voice verification systems. Devices from various different companies and systems can link to a voice verification system (VVS). Users can also enroll with the VSS so that the VSS can provide authentication of users by personal wake phrases. Thereafter users can access their PEKD from un-owned devices by speaking their wake phrase.
Methods and systems for automatically generating sample phrases or sentences that a user can say to invoke a set of defined actions performed by a virtual assistant are disclosed. By enabling finetuned general-purpose natural language models, the system can generate potential and accurate utterance sentences based on extracted keywords or the input utterance sentence. Furthermore, domain-specific datasets can be used to train the pre-trained, general-purpose natural language models via unsupervised learning. These generated sentences can improve the efficiency of configuring a virtual assistant. The system can further optimize the effectiveness of a virtual assistant in understanding the user, which can enhance the user experience of communicating with it.
G06F 40/35 - Représentation du discours ou du dialogue
G10L 15/02 - Extraction de caractéristiques pour la reconnaissance de la paroleSélection d'unités de reconnaissance
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
78.
RECOMMENDATION ENGINE FOR UPSELLING IN RESTAURANT ORDERS
A computer-implemented method is provided to support a food ordering system for food items from a menu of a restaurant using natural language. Expressions made for ordering are used to recommend a food item that a user has a high probability of wanting to include in an order. The recommendation engine is trained using machine learning. Expressions are collected and parsed to identify words that might indicate food items offered by the restaurant. The words are provided to a restaurant owner to identify food items on a menu, with which the words are associated.
Machine learned models take in vectors representing desired behaviors and generate voice vectors that provide the parameters for text-to-speech (TTS) synthesis. Models may be trained on behavior vectors that include user profile attributes, situational attributes, or semantic attributes. Situational attributes may include age of people present, music that is playing, location, noise, and mood. Semantic attributes may include presence of proper nouns, number of modifiers, emotional charge, and domain of discourse. TTS voice parameters may apply per utterance and per word as to enable contrastive emphasis.
A server supports multiple virtual assistants. It receives requests that include wake phrase audio and an identification of the source of the request, such as a virtual assistant device. Based on the identification, the server searches a database for a wake phrase detector appropriate for the identified source. The server then applies the wake phrase detector to the received wake phrase audio. If the wake phrase audio triggers the wake phrase detector, the server provides an appropriate response to the source.
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
A driver interface for use within an automobile provides responses to voice commands issued for example by a driver of the automobile. The interface includes a camera and microphone for capturing image data such as gestures and audio data from the automobile driver. The image data and audio data are processed to extract image and linguistic features from the image and audio data, which image and linguistic features are processed to interpret and infer a meaning of the voice command.
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
G10L 15/02 - Extraction de caractéristiques pour la reconnaissance de la paroleSélection d'unités de reconnaissance
G10L 15/30 - Reconnaissance distribuée, p. ex. dans les systèmes client-serveur, pour les applications en téléphonie mobile ou réseaux
G10L 15/18 - Classement ou recherche de la parole utilisant une modélisation du langage naturel
G10L 15/187 - Contexte phonémique, p. ex. règles de prononciation, contraintes phonotactiques ou n-grammes de phonèmes
G10L 15/24 - Reconnaissance de la parole utilisant des caractéristiques non acoustiques
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
G06K 9/62 - Méthodes ou dispositions pour la reconnaissance utilisant des moyens électroniques
G10L 15/16 - Classement ou recherche de la parole utilisant des réseaux neuronaux artificiels
G06V 10/40 - Extraction de caractéristiques d’images ou de vidéos
G06V 10/70 - Dispositions pour la reconnaissance ou la compréhension d’images ou de vidéos utilisant la reconnaissance de formes ou l’apprentissage automatique
G06V 20/40 - ScènesÉléments spécifiques à la scène dans le contenu vidéo
Developers can configure custom acoustic models by providing audio files with custom recordings. The custom acoustic model is trained by tuning a baseline model using the audio files. Audio files may contain custom noise to apply to clean speech for training. The custom acoustic model is provided as an alternative to a standard acoustic model. Device developers can select an acoustic model by a user interface. Speech recognition is performed on speech audio using one or more acoustic models. The result can be provided to developers through the user interface, and an error rate can be computed and also provided.
A method of controlling an engagement state of an agent during a human-machine dialog is provided. The method can include receiving a spoken request that is a conditional locking request, wherein the conditional locking request uses a natural language expression to explicitly specify a locking condition, which is a predicate, storing the predicate in a format that can be evaluated when needed by the agent, entering a conditionally locked state in response to the conditional locking request, in the conditionally locked state, receiving a multiplicity of requests without a need for a wakeup indicator, and for a request from the multiplicity of requests evaluating the predicate upon receiving the request, and processing the request if the predicate is true.
Methods and systems for enabling an efficient review of meeting content via a metadata-enriched, speaker-attributed and multiuser-editable transcript are disclosed. By incorporating speaker diarization and other metadata, the system can provide a structured and effective way to review and/or edit the transcript by one or more editors. One type of metadata can be image or video data to represent the meeting content. Furthermore, the present subject matter utilizes a multimodal diarization model to identify and label different speakers. The system can synchronize various sources of data, e.g., audio channel data, voice feature vectors, acoustic beamforming, image identification, and extrinsic data, to implement speaker diarization.
G06F 40/166 - Édition, p. ex. insertion ou suppression
G06F 40/284 - Analyse lexicale, p. ex. segmentation en unités ou cooccurrence
G10L 15/02 - Extraction de caractéristiques pour la reconnaissance de la paroleSélection d'unités de reconnaissance
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
Methods and systems for enabling an efficient review of meeting content via a metadata-enriched, speaker-attributed transcript are disclosed. By incorporating speaker diarization and other metadata, the system can provide a structured and effective way to review and/or edit the transcript. One type of metadata can be image or video data to represent the meeting content. Furthermore, the present subject matter utilizes a multimodal diarization model to identify and label different speakers. The system can synchronize various sources of data, e.g., audio channel data, voice feature vectors, acoustic beamforming, image identification, and extrinsic data, to implement speaker diarization.
G06F 40/166 - Édition, p. ex. insertion ou suppression
G06F 40/284 - Analyse lexicale, p. ex. segmentation en unités ou cooccurrence
G10L 15/02 - Extraction de caractéristiques pour la reconnaissance de la paroleSélection d'unités de reconnaissance
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
A method is described that includes processing text and speech from an input utterance using local overrides of default dictionary pronunciations. Applying this method, a word-level grammar used to process the tokens specifies at least one local word phonetic variant that applies within a specific production rule and, within a local context of the specific production rule, the local word phonetic variant overrides one or more default dictionary phonetic versions of the word. This method can be applied to parsing utterances where the pronunciation of some words depends on their syntactic or semantic context.
A system and method for masking an identity of a speaker of natural language speech, such as speech clips to be labeled by humans in a system generating voice transcriptions for training an automatic speech recognition model. The natural language speech is morphed prior to being presented to the human for labeling. In one embodiment, morphing comprises pitch shifting the speech randomly either up or down, then frequency shifting the speech, then pitch shifting the speech in a direction opposite the first pitch shift. Labeling the morphed speech comprises at least one or more of transcribing the morphed speech, identifying a gender of the speaker, identifying an accent of the speaker, and identifying a noise type of the morphed speech.
G06F 40/58 - Utilisation de traduction automatisée, p. ex. pour recherches multilingues, pour fournir aux dispositifs clients une traduction effectuée par le serveur ou pour la traduction en temps réel
G10L 15/06 - Création de gabarits de référenceEntraînement des systèmes de reconnaissance de la parole, p. ex. adaptation aux caractéristiques de la voix du locuteur
G10L 19/125 - Excitation de la hauteur tonale, p. ex. prédiction linéaire à excitation de code avec innovation synchrone de la hauteur tonale [PSI-CELP]
A system that includes a stand-alone device or a server connected client device are in communication with a server and provide recommendations. The device includes an input component, a storage component, a processor and an output component. The server-connected client device includes an input component that receives the user's request, a communication component that communicates the request to the server and receives the recommendation from the server, and an output component that provides the recommendation to user.
A client device receives a user request (e.g., in natural language form) to execute a command of an application. The client device delegates interpretation of the request to a response-processing server. Using domain knowledge previously provided by a developer of the application, the response-processing server determines the various possible responses that client devices could make in response to the request based on circumstances such as the capabilities of the client devices and the state of the application data. The response-processing server accordingly generates a response package that describes a number of different conditional responses that client devices could have to the request and provides the response package to the client device. The client device selects the appropriate response from the response package based on the circumstances as determined by the client device, executes the command (if possible), and provides the user with some representation of the response.
A command-processing server receives a natural language command from a user. The command-processing server has a set of domain command interpreters corresponding to different domains in which commands can be expressed, such as the domain of entertainment, or the domain of travel. Some or all of the domain command interpreters recognize user commands having a verbal prefix, an optional pre-filter, an object, and an optional post-filter; the pre- and post-filters may be compounded expressions involving multiple atomic filters. Different developers may independently specify the domain command interpreters and the sub-structure interpreters on which they are based.
A natural language understanding server includes grammars specified in a modified extended Backus-Naur form (MEBNF) that includes an agglutination metasymbol not supported by conventional EBNF grammar parsers, as well as an agglutination preprocessor. The agglutination preprocessor applies one or more sets of agglutination rewrite rules to the MEBNF grammars, transforming them to EBNF grammars that can be processed by conventional EBNF grammar parsers. Permitting grammars to be specified in MEBNF form greatly simplifies the authoring and maintenance of grammars supporting inflected forms of words in the languages described by the grammars.
A machine learning system for a digital assistant is described, together with a method of training such a system. The machine learning system is based on an encoder-decoder sequence-to-sequence neural network architecture trained to map input sequence data to output sequence data, where the input sequence data relates to an initial query and the output sequence data represents canonical data representation for the query. The method of training involves generating a training dataset for the machine learning system. The method involves clustering vector representations of the query data samples to generate canonical-query original-query pairs in training the machine learning system.
A discriminator trained on labeled samples of speech can compute probabilities of voice properties. A speech synthesis generative neural network that takes in text and continuous scale values of voice properties is trained to synthesize speech audio that the discriminator will infer as matching the values of the input voice properties. Voice parameters can include speaker voice parameters, accents, and attitudes, among others. Training can be done by transfer learning from an existing neural speech synthesis model or such a model can be trained with a loss function that considers speech and parameter values. A graphical user interface can allow voice designers for products to synthesize speech with a desired voice or generate a speech synthesis engine with frozen voice parameters. A vector of parameters can be used for comparison to previously registered voices in databases such as ones for trademark registration.
G10L 13/047 - Architecture des synthétiseurs de parole
G10L 13/08 - Analyse de texte ou génération de paramètres pour la synthèse de la parole à partir de texte, p. ex. conversion graphème-phonème, génération de prosodie ou détermination de l'intonation ou de l'accent tonique
G10L 13/033 - Édition de voix, p. ex. transformation de la voix du synthétiseur
G10L 15/26 - Systèmes de synthèse de texte à partir de la parole
G06N 3/084 - Rétropropagation, p. ex. suivant l’algorithme du gradient
G06N 3/04 - Architecture, p. ex. topologie d'interconnexion
The present invention extends to methods, systems, and computer program products for interpreting queries according to preferences. Multi-domain natural language understanding systems can support a variety of different types of clients. Queries can be received and interpreted across one or more domains. Preferred query interpretations can be identified and query responses provided based on any of: domain preferences, preferences indicated by an identifier, or (e.g., weighted) scores exceeding a threshold.
Aspects include methods, systems, and computer-program products providing virtual assistant domain functionality. A natural language query including one or more words is received. A collection of natural language modules is accessed. The collection natural language modules are configured to process sets of natural language queries. A natural language module, from the collection of natural language modules, is identified to interpret the natural language query. An interpretation of the natural language query is computed using the identified natural language module. A response to the natural language query is returned using the computed interpretation.
G06F 40/40 - Traitement ou traduction du langage naturel
G10L 15/30 - Reconnaissance distribuée, p. ex. dans les systèmes client-serveur, pour les applications en téléphonie mobile ou réseaux
G06Q 30/0283 - Estimation ou détermination de prix
G06Q 20/10 - Architectures de paiement spécialement adaptées aux systèmes de transfert électronique de fondsArchitectures de paiement spécialement adaptées aux systèmes de banque à domicile
G06F 40/211 - Parsage syntaxique, p. ex. basé sur une grammaire hors contexte ou sur des grammaires d’unification
96.
Method and system for acoustic model conditioning on non-phoneme information features
A method and system for acoustic model conditioning on non-phoneme information features for optimized automatic speech recognition is provided. The method includes using an encoder model to encode sound embedding from a known key phrase of speech and conditioning an acoustic model with the sound embedding to optimize its performance in inferring the probabilities of phonemes in the speech. The sound embedding can comprise non-phoneme information related to the key phrase and the following utterance. Further, the encoder model and the acoustic model can be neural networks that are jointly trained with audio data.
A speaker device includes an electroacoustic transducer configured to convert an audio signal into a set of sound waves and a transmitter configured to transmit an electromagnetic signal that carries the audio signal for receipt at distances limited to an audibility range of the set of sound waves. The audibility range of the set of sound waves corresponds to a distance at which the set of sound waves is estimated to be below a predetermined sound level.
G10L 15/22 - Procédures utilisées pendant le processus de reconnaissance de la parole, p. ex. dialogue homme-machine
G10L 21/0316 - Amélioration de l'intelligibilité de la parole, p. ex. réduction de bruit ou annulation d'écho en changeant l’amplitude
G10L 25/06 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes caractérisées par le type de paramètres extraits les paramètres extraits étant des coefficients de corrélation
G10L 25/51 - Techniques d'analyse de la parole ou de la voix qui ne se limitent pas à un seul des groupes spécialement adaptées pour un usage particulier pour comparaison ou différentiation
Systems for automatic speech recognition and/or natural language understanding automatically learn new words by finding subsequences of phonemes that, if they were a new word, would enable a successful tokenization of a phoneme sequence. Systems can learn alternate pronunciations of words by finding phoneme sequences with a small edit distance to existing pronunciations. Systems can learn the part of speech of words by finding part-of-speech variations that would enable parses by syntactic grammars. Systems can learn what types of entities a word describes by finding sentences that could be parsed by a semantic grammar but for the words not being on an entity list.
A query-processing server provides natural language services to applications. More specifically, the query-processing server receives and stores domain knowledge information from application developers, the domain knowledge information comprising a linguistic description of the natural language user queries that application developers wish their applications to support. A first portion of the domain knowledge information is applied to transform a natural language query received from an application to an ordered sequence of question elements. A second portion of the domain knowledge information is applied to group the ordered sequence of question elements into a plurality of distinct structured questions posed by the natural language query. The distinct structured questions may then be provided to the application, which may then execute them and obtain the corresponding data referenced by the questions.
A domain-independent framework parses and interprets compound natural language queries in the context of a conversation between a human and an agent. Generic grammar rules and corresponding semantics support the understanding of compound queries in the conversation context. The sub-queries themselves are from one or more domains, and they are parsed and interpreted by a pre-existing grammar, covering one or more pre-existing domains. The pre-existing grammar, extended by the generic rules, recognizes all compound queries based on any queries recognized by the pre-existing grammar. Use of the disclosed framework requires little or no change in the domain-specific NLU handling code. The framework defines a generic approach to propagating context data between sub-queries of a compound query. The framework can be further extended to propagate intra-query context data in, out and across query components. Complex query results, and other data such as accounting data, can also be propagated simultaneously with dialog context data in a consolidated intra-query context data structure.