Techniques for processing a dataset comprising data stored in fields to identify field labels. The field labels describe data stored in the dataset fields. The techniques determine whether any field labels in a field label glossary match a field. If none of the field labels in the field label glossary match the field, the techniques generate a new field label using the name of the field. The generated field label may be assigned to the field.
Techniques for processing a dataset comprising data stored in fields to identify field labels. The field labels describe data stored in the dataset fields. The techniques determine whether any field labels in a field label glossary match a field. If none of the field labels in the field label glossary match the field, the techniques generate a new field label using the name of the field. The generated field label may be assigned to the field.
A method for using a development environment to automatically generate code from a multi-tiered metadata model includes: receiving a specification to process a dataset, and, in response, accessing dataset characteristics and identifying controls received from a development environment to be applied to a field of the dataset in accordance with a metadata model by: accessing a first instance of a data structure that corresponds to the dataset; based on a reference in the first instance, accessing a second instance of a data structure associated with the field; based on a reference in the second instance, accessing a third instance of a data structure associated with metadata describing the field, and based on a reference in the third instance, accessing a fourth instance of a data structure storing a control defined based on the metadata. Based on the dataset characteristics, code is generated to apply the identified control to the field.
An approach to allocation of referenced objects to memory resources addresses a situation in which there are a far greater number of memory resources, for example, 216 elements in the set of memory resources, and yet the objects referenced in a program specification exceeds this number. The approach is applicable to compilation of a program specification for execution on a physical or virtual processor.
A method for using a development environment to automatically generate code from a multi-tiered metadata model includes: receiving a specification to process a dataset, and, in response, accessing dataset characteristics and identifying controls received from a development environment to be applied to a field of the dataset in accordance with a metadata model by: accessing a first instance of a data structure that corresponds to the dataset; based on a reference in the first instance, accessing a second instance of a data structure associated with the field; based on a reference in the second instance, accessing a third instance of a data structure associated with metadata describing the field, and based on a reference in the third instance, accessing a fourth instance of a data structure storing a control defined based on the metadata. Based on the dataset characteristics, code is generated to apply the identified control to the field.
An approach to allocation of referenced objects to memory resources addresses a situation in which there are a far greater number of memory resources, for example, 216 elements in the set of memory resources, and yet the objects referenced in a program specification exceeds this number. The approach is applicable to compilation of a program specification for execution on a physical or virtual processor.
A method for fault-tolerant processing of a number of data elements using a distributed computing cluster. The distributed computing cluster includes a number of data processors associated with a corresponding number of data stores. The method includes storing the data elements in the distributed computing cluster, wherein the data elements are distributed across the data stores according to a number of partitions of data elements, processing data elements of a first set of partitions stored at a first data store using a first data processor to generate first result data for the data elements of the first set of partitions, sending the first result data from the distributed computing cluster to a processing component of the first result data outside the distributed computing cluster, and storing the first result data in a first buffer located in the distributed computing cluster and associated with the first data processor until the processing component has persistently stored the first result data outside the distributed computing cluster.
G06F 11/14 - Error detection or correction of the data by redundancy in operation, e.g. by using different operation sequences leading to the same result
8.
Partition-based Escrow in a Distributed Computing System
A method for fault-tolerant processing of a number of data elements using a distributed computing cluster. The distributed computing cluster includes a number of data processors associated with a corresponding number of data stores. The method includes storing the data elements in the distributed computing cluster, wherein the data elements are distributed across the data stores according to a number of partitions of data elements, processing data elements of a first set of partitions stored at a first data store using a first data processor to generate first result data for the data elements of the first set of partitions, sending the first result data from the distributed computing cluster to a consumer of the first result data outside the distributed computing cluster, and storing the first result data in a first buffer located in the distributed computing cluster and associated with the first data processor until the consumer has persistently stored the first result data outside the distributed computing cluster.
A data processing system with a dataset multiplexer that enables applications to be written to specify access to datasets as operations on logical datasets. During execution of an application by the data processing system, the physical dataset used for performing data access operations may be selected based on current context. Current context may be specified based on values of system parameters and/or user specified values. The physical dataset accessed may be identified by selecting a record from multiple records in a dataset catalog associated with the logical dataset. Each record includes information to access a physical dataset associated with the selected record and context information to indicate the context in which the specific physical dataset is to be selected.
G06F 16/908 - Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
Some embodiments relate to generating a list of data fields referenceable at a point in a graph (there are different lists for each point). This list may be used as part of programming a dataflow graph to select data (e.g., at an input node of a component to select data processed in that component). One aspect relates to display of the list of data fields, because some of the data field names may be overloaded. Accordingly, the data fields may be presented hierarchically if necessary, showing the source for each overloaded data field name. Otherwise, the user may select whether the list of referenceable fields is grouped by source.
Some embodiments relate to generating a list of data fields referenceable at a point in a graph (there are different lists for each point). This list may be used as part of programming a dataflow graph to select data (e.g., at an input node of a component to select data processed in that component). One aspect relates to display of the list of data fields, because some of the data field names may be overloaded. Accordingly, the data fields may be presented hierarchically if necessary, showing the source for each overloaded data field name. Otherwise, the user may select whether the list of referenceable fields is grouped by source.
A computer-implemented method for defining a test for a computer program includes receiving operational data generated during execution of a computer program in a first computing environment, the operational data indicative of (i) a data source accessed by the computer program during execution of the computer program and (ii) a destination to where baseline data records are output by the computer program during execution of the computer program. Based on the received operational data, a data storage object is generated that includes (i) input data records from the data source and the baseline data records from the destination, and (ii) test definition data for the first computing environment. Responsive to migration of the computer program to a second computing environment, the input and baseline data records from the data storage object are stored in the second computing environment. A test configuration is defined for the migrated computer program in the second computing environment according to the test definition data in the data storage object and the mapping between the first computing environment and the second computing environment, the test configuration for the migrated computer program identifying a location of the input data records and a location of the baseline data records in the second computing environment. Execution of the migrated computer program in the second computing environment is tested using the input data records and baseline data records in the second computing environment and according to the defined test configuration for the migrated computer program.
A computer-implemented method for defining a test for a computer program includes receiving operational data generated during execution of a computer program in a first computing environment, the operational data indicative of (i) a data source accessed by the computer program during execution of the computer program and (ii) a destination to where baseline data records are output by the computer program during execution of the computer program. Based on the received operational data, a data storage object is generated that includes (i) input data records from the data source and the baseline data records from the destination, and (ii) test definition data for the first computing environment. Responsive to migration of the computer program to a second computing environment, the input and baseline data records from the data storage object are stored in the second computing environment. A test configuration is defined for the migrated computer program in the second computing environment according to the test definition data in the data storage object and the mapping between the first computing environment and the second computing environment, the test configuration for the migrated computer program identifying a location of the input data records and a location of the baseline data records in the second computing environment. Execution of the migrated computer program in the second computing environment is tested using the input data records and baseline data records in the second computing environment and according to the defined test configuration for the migrated computer program.
A data processing system with a dataset multiplexer that enables applications to be written to specify access to datasets as operations on logical datasets. During execution of an application by the data processing system, the physical dataset used for performing data access operations may be selected based on current context. Current context may be specified based on values of system parameters and/or user specified values. The physical dataset accessed may be identified by selecting a record from multiple records in a dataset catalog associated with the logical dataset. Each record includes information to access a physical dataset associated with the selected record and context information to indicate the context in which the specific physical dataset is to be selected.
The present disclosure relates to a computer-implemented method for conversion of a first data lineage to a second data lineage, the method comprising: obtaining a first data lineage specifying relationships among physical components of a plurality of physical components; receiving an identification of a portion of the first data lineage; generating a second data lineage from the identified portion of the first data lineage, the second data lineage specifying relationships among second components of a plurality of second components, wherein the second components of the plurality of second components are associated with at least some of the physical components of the identified portion of the first data lineage. A corresponding computer-readable medium, a corresponding a data processing system, and a corresponding computer program are also described.
The present disclosure relates to a computer-implemented method, the method comprising: obtaining a data lineage whose structure specifies relationships among data sets of a plurality of data sets; analyzing the structure of the data lineage; based on a result of the analyzing, identifying a subset of the plurality of data sets for which a parameter is to be evaluated, wherein the subset includes one or more of the data sets, and wherein the parameter is for indicating a potential error within a data set; and outputting an indication of the identified subset of one or more data sets. A computer-readable medium, computer program, a corresponding data processing apparatus, and a data structure are described as well.
Techniques for discovering primary, unique, and/or foreign keys for relational datasets are described. The techniques include profiling the relational datasets to obtain respective data profiles; identifying one or more primary key candidates for a first relational dataset using a first data profile of the first relational dataset and a first trained machine learning model; identifying one or more foreign key proposals for a second relational dataset using the one or more primary key candidates by performing a subset analysis of the second relational dataset with respect to the first relational dataset; identifying one or more foreign key candidates for the second relational dataset using the first data profile, a second data profile of the second relational dataset, and a second trained machine learning model different from the first trained machine learning model; and outputting the at primary key candidate(s) and the foreign key candidate(s).
Techniques for determining processing layouts to nodes of a dataflow graph. The techniques include: obtaining information specifying a dataflow graph, the dataflow graph comprising a plurality of nodes and a plurality of edges connecting the plurality nodes, the plurality of edges representing flows of data among nodes in the plurality of nodes, the plurality of nodes comprising: a first set of one or more nodes; and a second set of one or more nodes disjoint from the first set of nodes; obtaining a first set of one or more processing layouts for the first set of nodes; and determining a processing layout for each node in the second set of nodes based on the first set of processing layouts and one or more layout determination rules, the one or more layout determination rules including at least one rule for selecting among processing layouts having different degrees of parallelism, and information indicating that data generated by at least one node in the first and/or third set of nodes is not used by any nodes in the dataflow graph downstream from the at least one node.
Techniques for discovering primary, unique, and/or foreign keys for relational datasets are described. The techniques include profiling the relational datasets to obtain respective data profiles; identifying one or more primary key candidates for a first relational dataset using a first data profile of the first relational dataset and a first trained machine learning model; identifying one or more foreign key proposals for a second relational dataset using the one or more primary key candidates by performing a subset analysis of the second relational dataset with respect to the first relational dataset; identifying one or more foreign key candidates for the second relational dataset using the first data profile, a second data profile of the second relational dataset, and a second trained machine learning model different from the first trained machine learning model; and outputting the at primary key candidate(s) and the foreign key candidate(s).
Techniques for using finite state machines (FSMs) to implement workflows in a data processing system comprising at least one data store storing data objects and a workflow management system (WMS). The WMS is configured to perform: determining a current value of an attribute of a first data object by accessing the current value in the at least one data store; identifying, using the current value and metadata specifying relationships among at least some of the data objects, an actor authorized to perform a workflow task for the first data object; generating a GUI through which the actor can provide the input that the workflow task is to be performed; and in response to receiving, from the actor and through the GUI, input specifying that the workflow task is to be performed: performing the workflow task; and updating the current workflow state of the first FSM to a second workflow state.
A method for developing a reusable data processing program including a set of data transformation steps by displaying a set of records and enabling a user to select one or more data transformation steps, applying the data transformation steps to the records, and displaying the transformed records.
A method for developing a reusable data processing program including a set of data transformation steps by displaying a set of records and iteratively enabling a user to select one or more data transformation steps, iteratively applying the data transformation steps to the records, and iteratively displaying the transformed records.
In an aspect, a method for migrating data records to a federated database system includes obtaining data records from a data source in a first federated database system; generating a data snapshot file based on the obtained data records and data indicative of a characteristic associated with the obtained data records; generating a hash of the data snapshot file to prevent modification of the data snapshot file; storing the data snapshot file and the generated hash in a data storage; migrating the obtained data records from the data snapshot file to a data target in a second federated database system, the migrating including: retrieving the data records from the data snapshot file stored in the data storage; providing the retrieved data records to the data target according to a mapping between a characteristic of the data source and a characteristic of the data target.
G06F 16/25 - Integrating or interfacing systems involving database management systems
G06F 16/27 - Replication, distribution or synchronisation of data between databases or within a distributed database systemDistributed database system architectures therefor
24.
Migration of datasets among federated database systems
In an aspect, a method for migrating data records to a federated database system includes obtaining data records from a data source in a first federated database system; generating a data snapshot file based on the obtained data records and data indicative of a characteristic associated with the obtained data records; generating a hash of the data snapshot file to prevent modification of the data snapshot file; storing the data snapshot file and the generated hash in a data storage; migrating the obtained data records from the data snapshot file to a data target in a second federated database system, the migrating including: retrieving the data records from the data snapshot file stored in the data storage; providing the retrieved data records to the data target according to a mapping between a characteristic of the data source and a characteristic of the data target.
A method implemented by a data processing system for enabling a system to pipeline or otherwise process data in conformance with specified criteria by providing a graphical user interface for selecting data to be processed, determining metadata of selected data, and, based on the metadata, automatically processing the selected data in conformance with the specified criteria.
G06F 16/20 - Information retrievalDatabase structures thereforFile system structures therefor of structured data, e.g. relational data
G06F 3/048 - Interaction techniques based on graphical user interfaces [GUI]
G06F 9/448 - Execution paradigms, e.g. implementations of programming paradigms
G06F 9/451 - Execution arrangements for user interfaces
G06F 16/28 - Databases characterised by their database models, e.g. relational or object models
G06F 16/908 - Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
A method implemented by a data processing system for enabling a system to pipeline or otherwise process data in conformance with specified criteria by providing a graphical user interface for selecting data to be processed, determining metadata of selected data, and, based on the metadata, automatically processing the selected data in conformance with the specified criteria.
Techniques for obtaining information about data entity instances managed by a data processing system using at least one data store. The techniques include obtaining a query comprising a first portion comprising information for identifying instances of a first data entity stored in at least one data store; and a second portion indicating at least one attribute of the first data entity; generating, from the query, a plurality of executable queries including a first set of one or more executable queries and a second set of one or more executable queries, the generating comprising: generating, using the first portion, the first set of executable queries for identifying instances of the first data entity, and generating, using the second portion, the second set of executable queries for obtaining attribute values for instances of the first data entity; and executing the plurality of executable queries to obtain results for the query.
A method implemented by a data processing system for enabling a user to browse a data catalog and select fields from multiple data sources to be integrated into a data profile so that, when a request is received for the data profile, data from those fields can be made available efficiently and immediately.
A method implemented by a data processing system for enabling a user to browse a data catalog and select fields of datasets from multiple data sources to be integrated into a data profile so that, when a request is received for the data profile, data from those fields can be made available efficiently and immediately.
A method implemented by a data processing system for: enabling a user to preview attributes of fields of an expanded view of a base dataset and to specify one or more of the fields to use in downstream data processing and generating a dataset that includes the one or more of the fields from the preview specified to be used in the downstream data processing, with the generated dataset having increased efficiency with respect to speed and data memory, relative to an efficiency of generating a dataset including all the fields of the expanded view when only the specified one or more of the fields are used in the downstream data processing.
Described are techniques for causing a data processing system to perform real-time decisioning by generating a data record (e.g., a dynamic data record) based on a request for the real-time decisioning, wherein the data record includes batch data and real-time data retrieved from one or more operational systems responsive to receipt of the request, with real-time being with regard to when the request is received by the data processing system.
A method implemented by a data processing system for: enabling a user to preview attributes of fields of an expanded view of a base dataset and to specify one or more of the fields to use in downstream data processing and generating a dataset that includes the one or more of the fields from the preview specified to be used in the downstream data processing, with the generated dataset having increased efficiency with respect to speed and data memory, relative to an efficiency of generating a dataset including all the fields of the expanded view when only the specified one or more of the fields are used in the downstream data processing.
Described are techniques for causing a data processing system to perform real-time decisioning by generating a record (e.g., dynamic record) based on a request for the real-time decisioning, wherein the record (e.g., dynamic record) includes batch data and real-time data retrieved from one or more operational systems responsive to receipt of the request, with real-time being with regard to when the request is received by the data processing system.
At least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform: obtaining an automatically generated initial dataflow graph, the initial dataflow graph comprising a first plurality of nodes representing a first plurality of data processing operations and a first plurality of links representing flows of data among nodes in the first plurality of nodes; and generating an updated dataflow graph by iteratively applying dataflow graph optimization rules to update the initial dataflow graph, the updated dataflow graph comprising a second plurality of nodes representing a second plurality of data processing operations and a second plurality of links representing flows of data among nodes in the second plurality of nodes.
Described are techniques for back-calculating one or more values of a new, real-time aggregate before sufficient data to calculate the new, real-time aggregate has been collected, wherein the back-calculating is based on data collected for one or more aggregates that have been executing prior to start of execution of the new, real-time aggregate.
A method for performing real-time segmentation by updating a wide record based on receipt of real-time data, wherein an item of real-time data represents a transaction, detecting that the updated wide record satisfies criteria for performing real-time segmentation, and performing real-time segmentation on the updated, wide record, wherein real-time is relative to when a transaction represented in the updated wide record occurs.
Described are techniques for back-calculating one or more values of a new, real-time aggregate before sufficient data to calculate the new, real-time aggregate has been collected, wherein the back-calculating is based on data collected for one or more aggregates that have been executing prior to start of execution of the new, real-time aggregate.
A method for performing real-time segmentation by updating a wide record based on receipt of real-time data, wherein an item of real-time data represents a transaction, detecting that the updated wide record satisfies criteria for performing real-time segmentation, and performing real-time segmentation on the updated, wide record, wherein real-time is relative to when a transaction represented in the updated wide record occurs.
A method for enabling a user to generate a complex aggregation on their own by providing the user with a graphical user interface that displays data items in a data catalog and that provides controls for the user to select data items to be used in generating the complex aggregation, and to select a type of aggregation, and based on the user's selections, automatically generating computer instructions to generate a value of the complex aggregation is described.
A data processing system that receives user input specifying datasets on which operations are performed with user interfaces that enable manipulation of hierarchical groups of datasets. A user interface may enable individual datasets or a previously defined group of datasets to be aggregated into another grouping. The groupings may be scoped, including by persona of users, such that, when a user is prompted to specify one or more datasets as a target of an operation by the data processing system, the available choices are limited to datasets that have a scope encompassing that user. The interfaces may prompt a user to select a grouping within the hierarchy that contains datasets on which the operation can be performed. Upon selection of a grouping with multiple datasets as a target of an operation that is performed on datasets singly, the operation may be performed on each dataset in the selected group.
A method for enabling a user to generate a complex aggregation on their own by providing the user with a graphical user interface that displays data items in a data catalog and that provides controls for the user to select data items to be used in generating the complex aggregation, and to select a type of aggregation, and based on the user's selections, automatically generating computer instructions to generate a value of the complex aggregation is described.
Methods and systems are configured to determine a semantic meaning for data and generate data processing rules based on the semantic meaning of the data. The semantic meaning includes syntactical or contextual meaning for the data that is determined, for example, by profiling, by the data processing system, values stored in a field included in data records of one or more datasets; applying, by the data processing system, one or more classifiers to the profiled values; identifying, based on applying the one or more classifiers, one or more attributes indicative of a logical or syntactical characteristic for the values of the field, with each of the one or more attributes having a respective confidence level that is based on an output of each of the one or more classifiers. The attributes are associated with the fields and are used for generating data processing rules and processing the data.
Techniques for managing access privileges in a data processing system include obtaining a plurality of rules for granting and/or denying privileges to a first actor to perform at least one action on a first instance of a first data entity of data entities; identifying, from among attributes of the first data entity, a first attribute whose values are used by one or more of the plurality of rules; obtaining, from a user or from at least one data store, a first value of the first attribute; identifying, using the first value and from among the plurality of rules, a first rule that depends on the first value; generating a graphical user interface (GUI) including a visual rendering of at least some of the plurality of rules, the visual rendering emphasizing the first rule identified using the first value of the first attribute; and displaying the generated GUI to the user.
H04L 41/22 - Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks comprising specially adapted graphical user interfaces [GUI]
48.
Generating rules for data processing values of data fields from semantic labels of the data fields
Methods and systems are configured to determine a semantic meaning for data and generate data processing rules based on the semantic meaning of the data. The semantic meaning includes syntactical or contextual meaning for the data that is determined, for example, by profiling, by the data processing system, values stored in a field included in data records of one or more datasets; applying, by the data processing system, one or more classifiers to the profiled values; identifying, based on applying the one or more classifiers, one or more attributes indicative of a logical or syntactical characteristic for the values of the field, with each of the one or more attributes having a respective confidence level that is based on an output of each of the one or more classifiers. The attributes are associated with the fields and are used for generating data processing rules and processing the data.
Methods and systems are configured to determine a semantic meaning for data and generate data processing rules based on the semantic meaning of the data. The semantic meaning includes syntactical or contextual meaning for the data that is determined, for example, by profiling, by the data processing system, values stored in a field included in data records of one or more datasets; applying, by the data processing system, one or more classifiers to the profiled values; identifying, based on applying the one or more classifiers, one or more attributes indicative of a logical or syntactical characteristic for the values of the field, with each of the one or more attributes having a respective confidence level that is based on an output of each of the one or more classifiers. The attributes are associated with the fields and are used for generating data processing rules and processing the data.
A method implemented by a data processing system including: accessing the container image that includes the first application and a second application; determining, by the data processing system, the number of parallel executions of the given module of the first application; for the given module, generating a plurality of instances of the container image in accordance with the number of parallel executions determined, for each instance, configuring that instance to execute the given module of the first application; causing each of the plurality of configured instances to execute on one or more of the host systems; and for at least one of the plurality of configured instances, causing, by the second application of that configured instance, communication between the data processing system and the one or more of the host systems executing that configured instance.
A method includes accessing a schema that specifies relationships among datasets, computations on the datasets, or transformations of the datasets, selecting a dataset from among the datasets, and identifying, from the schema, other datasets that are related to the selected dataset. Attributes of the datasets are identified, and logical data representing the identified attributes and relationships among the attributes is generated. The logical data is provided to a development environment, which provides access to portions of the logical data representing the identified attributes. A specification that specifies at least one of the identified attributes in performing an operation is received from the development environment. Based on the specification and the relationships among the identified attributes represented by the logical data, a computer program is generated to perform the operation by accessing, from storage, at least one dataset having the at least one of the attributes specified in the specification.
Techniques for discovering semantic meaning of data in fields included in one or more data sets, the method including: a first field having a previously-assigned label that indicates a semantic meaning of the first field; identifying a set of one or more candidate labels, for potential assignment to the first field instead of the previously-assigned label; evaluating, using a previously-determined label score and a first candidate label score, whether to assign a first candidate label to the first field, the evaluating comprising: when the first candidate label score is at least a first threshold amount greater than a previously- determined label score, presenting the first candidate label to a user by generating an interface through which the user can provide input indicating whether to assign the first candidate label to the first field instead of the previously-determined label.
A method for generating an executable application to transform and load data into a structured dataset includes receiving a metadata file that specifies values for parameters for structuring data feeds, received from a networked data source, into a structured database. The metadata file specifies logical rules for transforming the data feeds. The values of the parameters and the logical rules for transforming the plurality of the data feeds are validated to ensure logical consistency for each data feed. Data rules are generated that specify standards for transforming each data feed in accordance with the validated values of the parameters and logical rules. The executable application is generated that is configured to receive source data comprising a data feed from one or more data sources and transform the source data into structured data that satisfies the one or more standards for the structured data record in compliance with the data rules.
Techniques for discovering semantic meaning of data in fields included in one or more data sets, the method including: a first field having a previously-assigned label that indicates a semantic meaning of the first field; identifying a set of one or more candidate labels, for potential assignment to the first field instead of the previously-assigned label; evaluating, using a previously-determined label score and a first candidate label score, whether to assign a first candidate label to the first field, the evaluating comprising: when the first candidate label score is at least a first threshold amount greater than a previously-determined label score, presenting the first candidate label to a user by generating an interface through which the user can provide input indicating whether to assign the first candidate label to the first field instead of the previously-determined label.
Some embodiments provide techniques of enforcing valid data assignments in a data processing system in which data can be dynamically updated by user devices and/or computerized processes. The techniques identify, using a validation rule associated with a data entity, one or more valid values for assignment to an attribute of an instance of the data entity. The techniques identify the valid value(s) by generating a query for the one or more valid values using one or more condition(s) on the attribute in the validation rule, and executing the generated query to obtain the one or more valid values for the first attribute. The attribute may then be assigned one or more of the identified valid value(s).
Some embodiments provide techniques of enforcing valid data assignments in a data processing system in which data can be dynamically updated by user devices and/or computerized processes. The techniques identify, using a validation rule associated with a data entity, one or more valid values for assignment to an attribute of an instance of the data entity. The techniques identify the valid value(s) by generating a query for the one or more valid values using one or more condition(s) on the attribute in the validation rule, and executing the generated query to obtain the one or more valid values for the first attribute. The attribute may then be assigned one or more of the identified valid value(s).
Among other things, we describe a method of receiving a portion of metadata from a data source, the portion of metadata describing nodes and edges; generating instances of a data structure representing the portion of metadata, at least one instance of the data structure including an identification value that identifies a corresponding node, one or more property values representing respective properties of the corresponding node, and one or more pointers to respective identification values, each pointer representing an edge associated with a node identified by the corresponding respective identification value; storing the instances of the data structure in random access memory; receiving a query that includes an identification of at least one particular element of data; and using at least one instance of the data structure to cause a display of a computer system to display a representation of lineage of the particular element of data.
A method for using a metadata model to perform operations on data items, with the metadata model including parent nodes and child nodes connected by edges, with the parent nodes specifying logical metadata and the child nodes specifying physical metadata representing the data items, and with the edges specifying relationships between the nodes. The method includes: identifying a given data item and physical metadata of that given data item, accessing the metadata model, identifying, in the metadata model, a child node representing the physical metadata of the given data item, traversing one or more edges in the metadata model to identify parent nodes of the child node, determining, from logical metadata associated with the identified parent nodes, one or more operations to be performed on the given data item, applying the one or more operations to the given data item to transform the data item, and storing the transformed data item.
A method for using a metadata model to perform operations on data items, with the metadata model including parent nodes and child nodes connected by edges, with the parent nodes specifying logical metadata and the child nodes specifying physical metadata representing the data items, and with the edges specifying relationships between the nodes. The method includes: identifying a given data item and physical metadata of that given data item, accessing the metadata model, identifying, in the metadata model, a child node representing the physical metadata of the given data item, traversing one or more edges in the metadata model to identify parent nodes of the child node, determining, from logical metadata associated with the identified parent nodes, one or more operations to be performed on the given data item, applying the one or more operations to the given data item to transform the data item, and storing the transformed data item.
Techniques for generating a dataflow graph include generating a first dataflow graph with a plurality of first nodes representing first computer operations in processing data, with at least one of the first computer operations being a declarative operation that specifies one or more characteristics of one or more results of processing of data, and transforming the first dataflow graph into a second dataflow graph for processing data in accordance with the first computer operations, the second dataflow graph including a plurality of second nodes representing second computer operations, with at least one of the second nodes representing one or more imperative operations that implement the logic specified by the declarative operation, where the one or more imperative operations are unrepresented by the first nodes in the first dataflow graph.
A data processing system for discovering a semantic meaning of a field included in one or more data sets is configured to identify a field included in one or more data sets, with the field having an identifier. For that field, the system profiles data values of the field to generate a data profile, accesses a plurality of label proposal tests, and generates a set of label proposals by applying the plurality of label proposal tests to the data profile. The system determines a similarity among the label proposals and selects a classification. The system identifies one of the label proposals as identifying the semantic meaning. The system stores the identifier of the field with the identified one of the label proposals that identifies the semantic meaning.
G06F 16/908 - Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
Described herein are techniques, performed by a data processing system, for enabling efficient development of software application programs in a dynamic environment with multiple datasets by generating entries in a dataset catalog to provide a software application program with access to output data dynamically generated by dataflow graphs, the entries associated with respective software application programs developed as dataflow graphs. The techniques include identifying a subgraph, wherein, when the subgraph is executed, the subgraph generates output data by applying one or more data processing operations to data obtained from one or more data sources; creating, in the dataset catalog, a new entry associated with the identified subgraph, the new entry associated with information indicating nodes, links, and configuration parameters of the identified subgraph; and configuring the dataset catalog to enable access to the new entry, in the dataset catalog, associated with the identified subgraph.
Described herein are techniques, performed by a data processing system, for enabling efficient development of software application programs in a dynamic environment with multiple datasets by generating entries in a dataset catalog to provide a software application program with access to output data dynamically generated by dataflow graphs, the entries associated with respective software application programs developed as dataflow graphs. The techniques include identifying a subgraph, wherein, when the subgraph is executed, the subgraph generates output data by applying one or more data processing operations to data obtained from one or more data sources; creating, in the dataset catalog, a new entry associated with the identified subgraph, the new entry associated with information indicating nodes, links, and configuration parameters of the identified subgraph; and configuring the dataset catalog to enable access to the new entry, in the dataset catalog, associated with the identified subgraph.
A method is described for processing keyed data items that are each associated with a value of a key, the keyed data items being from a plurality of distinct data streams, the processing including collecting the keyed data items, determining, based on contents of at least one of the keyed data items, satisfaction of one or more specified conditions for execution of one or more actions and causing execution of at least one of the one or more actions responsive to the determining.
Characterizing data includes: reading data from an interface to a data storage system, and storing two or more sets of summary data summarizing data stored in different respective data sources in the data storage system; and processing the stored sets of summary data to generate system information characterizing data from multiple data sources in the data storage system. The processing includes: analyzing the stored sets of summary data to select two or more data sources that store data satisfying predetermined criteria, and generating the system information including information identifying a potential relationship between fields of records included in different data sources based at least in part on comparison between values from a stored set of summary data summarizing a first of the selected data sources and values from a stored set of summary data summarizing a second of the selected data sources.
A method for updating a computer program includes receiving a computer program hosted on and configured to be executed by a first computing system. The method includes analyzing the computer program to obtain characterization of a lineage, an architecture, and an operation of the computer program. The lineage includes relationships among elements of the computer program, the architecture includes a characteristic of the data source, the data target, and one or more processors configured to process the data contained in data records, and the operation includes processes that are executed to process the data from the data records. The method includes receiving a characterization of an update to be made to the computer program, in which when the computer program is modified according to the update, at least some of the modified computer program is configured to be hosted on and executed by a second computing system; and modifying the computer program to implement the update to generate the modified computer program.
A method for updating a computer program includes receiving a computer program hosted on and configured to be executed by a first computing system. The method includes analyzing the computer program to obtain characterization of a lineage, an architecture, and an operation of the computer program. The lineage includes relationships among elements of the computer program, the architecture includes a characteristic of the data source, the data target, and one or more processors configured to process the data contained in data records, and the operation includes processes that are executed to process the data from the data records. The method includes receiving a characterization of an update to be made to the computer program, in which when the computer program is modified according to the update, at least some of the modified computer program is configured to be hosted on and executed by a second computing system; and modifying the computer program to implement the update to generate the modified computer program.
A method for updating a computer program includes receiving a computer program hosted on and configured to be executed by a first computing system. The method includes analyzing the computer program to obtain characterization of a lineage, an architecture, and an operation of the computer program. The lineage includes relationships among elements of the computer program, the architecture includes a characteristic of the data source, the data target, and one or more processors configured to process the data contained in data records, and the operation includes processes that are executed to process the data from the data records. The method includes receiving a characterization of an update to be made to the computer program, in which when the computer program is modified according to the update, at least some of the modified computer program is configured to be hosted on and executed by a second computing system; and modifying the computer program to implement the update to generate the modified computer program.
G06F 16/27 - Replication, distribution or synchronisation of data between databases or within a distributed database systemDistributed database system architectures therefor
H04L 67/06 - Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
H04L 67/00 - Network arrangements or protocols for supporting network services or applications
72.
SYSTEMS AND METHODS FOR PERFORMING DATA PROCESSING OPERATIONS USING VARIABLE LEVEL PARALLELISM
Techniques for determining processing layouts to nodes of a dataflow graph. The techniques include: obtaining information specifying a dataflow graph, the dataflow graph comprising a plurality of nodes and a plurality of edges connecting the plurality nodes, the plurality of edges representing flows of data among nodes in the plurality of nodes, the plurality of nodes comprising: a first set of one or more nodes; and a second set of one or more nodes disjoint from the first set of nodes; obtaining a first set of one or more processing layouts for the first set of nodes; and determining a processing layout for each node in the second set of nodes based on the first set of processing layouts and one or more layout determination rules, the one or more layout determination rules including at least one rule for selecting among processing layouts having different degrees of parallelism, and information indicating that data generated by at least one node in the first and/or third set of nodes is not used by any nodes in the dataflow graph downstream from the at least one node.
Techniques for determining processing layouts to nodes of a dataflow graph. The techniques include: obtaining information specifying a dataflow graph, the dataflow graph comprising a plurality of nodes and a plurality of edges connecting the plurality nodes, the plurality of edges representing flows of data among nodes in the plurality of nodes, the plurality of nodes comprising: a first set of one or more nodes; and a second set of one or more nodes disjoint from the first set of nodes; obtaining a first set of one or more processing layouts for the first set of nodes; and determining a processing layout for each node in the second set of nodes based on the first set of processing layouts and one or more layout determination rules, the one or more layout determination rules including at least one rule for selecting among processing layouts having different degrees of parallelism, and information indicating that data generated by at least one node in the first and/or third set of nodes is not used by any nodes in the dataflow graph downstream from the at least one node.
An electronic system for increasing the speed of preparing data with a specified data quality for storage by automatically identifying for a user, with minimal user input, common contexts among (i) fields in disparate datasets, and (ii) names the user has specified as potentially describing the fields, and by using those common contexts to govern the disparate datasets prior to storage to ensure the specified data quality.
An electronic system for increasing the speed of preparing data with a specified data quality for storage by automatically identifying for a user, with minimal user input, common contexts among (i) fields in disparate datasets, and (ii) names the user has specified as potentially describing the fields, and by using those common contexts to govern the disparate datasets prior to storage to ensure the specified data quality.
An electronic system for increasing the speed of preparing data with a specified data quality for storage by automatically identifying for a user, with minimal user input, common contexts among (i) fields in disparate datasets, and (ii) names the user has specified as potentially describing the fields, and by using those common contexts to govern the disparate datasets prior to storage to ensure the specified data quality.
A method includes automatically determining a component of a security label for each first record in a first table of a database having multiple tables, including: identifying a second record related to the first record according to a foreign key relationship; identifying a component of the security label for the second record; and assigning a value for the component of the security label for the first record based on the identified component of the security label for the second record. The method includes storing the determined security label in the record.
Some embodiments relate to a method for use in connection with governance of a plurality of data assets managed by a data processing system, the method comprising: using at least one computer hardware processor to perform: accessing a data governance policy comprising a first data standard (e.g., by obtaining information about the first standard stored in a database system); generating a first data asset collection at least in part by automatically selecting, from among the plurality of data assets managed by the data processing system and using at least one data asset criterion, one or more data assets that meet the at least one data asset criterion; associating the first data asset collection with the first data standard; and verifying whether at least one of the one or more data assets in the first data asset collection complies with the first data standard.
Some embodiments relate to a method for use in connection with governance of a plurality of data assets managed by a data processing system, the method comprising: using at least one computer hardware processor to perform: accessing a data governance policy comprising a first data standard (e.g., by obtaining information about the first standard stored in a database system); generating a first data asset collection at least in part by automatically selecting, from among the plurality of data assets managed by the data processing system and using at least one data asset criterion, one or more data assets that meet the at least one data asset criterion; associating the first data asset collection with the first data standard; and verifying whether at least one of the one or more data assets in the first data asset collection complies with the first data standard.
Some embodiments relate to a method for use in connection with governance of a plurality of data assets managed by a data processing system, the method comprising: using at least one computer hardware processor to perform: accessing a data governance policy comprising a first data standard (e.g., by obtaining information about the first standard stored in a database system); generating a first data asset collection at least in part by automatically selecting, from among the plurality of data assets managed by the data processing system and using at least one data asset criterion, one or more data assets that meet the at least one data asset criterion; associating the first data asset collection with the first data standard; and verifying whether at least one of the one or more data assets in the first data asset collection complies with the first data standard.
A method for performing a distributed computation on a computing system using computational resources dynamically allocated using a computational resource manager includes storing information specifying quantities of computational resources associated with respective ones of a number of program portions of the program, where the program portions perform successive transformations of data and each program portion uses computational resources granted by the computational resource manager enabling computation associated with that program portion to be performed in the computing system, requesting a first quantity of computational resources associated with a first program portion of the number of program portions from the computational resource manager, receiving a second quantity of computational resources from the computational resource manager, less than the requested first quantity of computational resources, performing computation associated with the first portion of the program using the second quantity of computational resources, while performing the computation associated with the first portion of the program using the second quantity of computational resources, receiving an additional quantity of computational resources from the computational resource manager, and performing an additional computation associated with the first portion of the program using the additional quantity of computational resources while performing the computation associated with the first portion using the second quantity of computational resources.
A method for performing a distributed computation on a computing system using computational resources dynamically allocated using a computational resource manager includes storing information specifying quantities of computational resources associated with respective ones of a number of program portions of the program, where the program portions perform successive transformations of data and each program portion uses computational resources granted by the computational resource manager enabling computation associated with that program portion to be performed in the computing system, requesting a first quantity of computational resources associated with a first program portion of the number of program portions from the computational resource manager, receiving a second quantity of computational resources from the computational resource manager, less than the requested first quantity of computational resources, performing computation associated with the first portion of the program using the second quantity of computational resources, while performing the computation associated with the first portion of the program using the second quantity of computational resources, receiving an additional quantity of computational resources from the computational resource manager, and performing an additional computation associated with the first portion of the program using the additional quantity of computational resources while performing the computation associated with the first portion using the second quantity of computational resources.
A method for performing a distributed computation on a computing system using computational resources dynamically allocated using a computational resource manager includes storing information specifying quantities of computational resources associated with respective ones of a number of program portions of the program, where the program portions perform successive transformations of data and each program portion uses computational resources granted by the computational resource manager enabling computation associated with that program portion to be performed in the computing system, requesting a first quantity of computational resources associated with a first program portion of the number of program portions from the computational resource manager, receiving a second quantity of computational resources from the computational resource manager, less than the requested first quantity of computational resources, performing computation associated with the first portion of the program using the second quantity of computational resources, while performing the computation associated with the first portion of the program using the second quantity of computational resources, receiving an additional quantity of computational resources from the computational resource manager, and performing an additional computation associated with the first portion of the program using the additional quantity of computational resources while performing the computation associated with the first portion using the second quantity of computational resources.
A data processing system configured to perform: obtaining a first data lineage representing relationships among physical data elements, the first data lineage being generated at least in part by performing at least one of: (a) analyzing source code of at least one computer program configured to access the physical data elements; and (b) analyzing information obtained during runtime of the at least one computer program; obtaining, based on user input, a second data lineage representing relationships among business data elements; obtaining an association between at least some of the physical data elements of the first data lineage and at least some of the business data elements of the second data lineage; and generating, based on the association between the physical data elements and the business data elements, an indication of agreement or discrepancy between the first data lineage and the second data lineage.
In a first aspect, a method includes, at a node of a Hadoop cluster, the node storing a first portion of data in HDFS data storage, executing a first instance of a data processing engine capable of receiving data from a data source external to the Hadoop cluster, receiving a computer-executable program by the data processing engine, executing at least part of the program by the first instance of the data processing engine, receiving, by the data processing engine, a second portion of data from the external data source, storing the second portion of data other than in HDFS storage, and performing, by the data processing engine, a data processing operation identified by the program using at least the first portion of data and the second portion of data.
A method performed by a computer system including: accessing a specification that specifies a plurality of modules to be implemented by the computer program for processing the one or more values of the one or more fields in the structured data item; transforming the specification into the computer program that implements the plurality of modules, wherein the transforming includes: for each of one or more first modules of the plurality of modules: identifying one or more second modules of the plurality of modules that each receive input that is at least partly based on an output of the first module; and formatting an output data format of the first module such that the first module outputs only one or more values of one or more fields of the structured data item.
Systems and methods are for executing, by a data processing system, a workflow to process results data indicating an output of a data quality test on data records by generating, responsive to receiving the results data and metadata describing the results data, a data quality issue associated with a state and one or more processing steps of the workflow to resolve a data quality error associated with the data quality test. Operations include generating a workflow for processing results data based a state specified by a data quality issue. Generating the workflow includes: assigning, based on the results data and the state of the data quality issue, an entity responsible for resolving the data quality error; determining, based on the metadata, one or more actions for satisfying the data quality condition specified in the data quality test; and updating the state associated with the data quality issue.
Systems and methods are for executing, by a data processing system, a workflow to process results data indicating an output of a data quality test on data records by generating, responsive to receiving the results data and metadata describing the results data, a data quality issue associated with a state and one or more processing steps of the workflow to resolve a data quality error associated with the data quality test. Operations include generating a workflow for processing results data based a state specified by a data quality issue. Generating the workflow includes: assigning, based on the results data and the state of the data quality issue, an entity responsible for resolving the data quality error; determining, based on the metadata, one or more actions for satisfying the data quality condition specified in the data quality test; and updating the state associated with the data quality issue.
Systems and methods are for executing, by a data processing system, a workflow to process results data indicating an output of a data quality test on data records by generating, responsive to receiving the results data and metadata describing the results data, a data quality issue associated with a state and one or more processing steps of the workflow to resolve a data quality error associated with the data quality test. Operations include generating a workflow for processing results data based a state specified by a data quality issue. Generating the workflow includes: assigning, based on the results data and the state of the data quality issue, an entity responsible for resolving the data quality error; determining, based on the metadata, one or more actions for satisfying the data quality condition specified in the data quality test; and updating the state associated with the data quality issue.
Techniques for managing access privileges in a data processing system include obtaining a plurality of rules for granting and/or denying privileges to a first actor to perform at least one action on a first instance of a first data entity of data entities; identifying, from among attributes of the first data entity, a first attribute whose values are used by one or more of the plurality of rules; obtaining, from a user or from at least one data store, a first value of the first attribute; identifying, using the first value and from among the plurality of rules, a first rule that depends on the first value; generating a graphical user interface (GUI) including a visual rendering of at least some of the plurality of rules, the visual rendering emphasizing the first rule identified using the first value of the first attribute; and displaying the generated GUI to the user.
Techniques for managing access privileges in a data processing system include obtaining a plurality of rules for granting and/or denying privileges to a first actor to perform at least one action on a first instance of a first data entity of data entities; identifying, from among attributes of the first data entity, a first attribute whose values are used by one or more of the plurality of rules; obtaining, from a user or from at least one data store, a first value of the first attribute; identifying, using the first value and from among the plurality of rules, a first rule that depends on the first value; generating a graphical user interface (GUI) including a visual rendering of at least some of the plurality of rules, the visual rendering emphasizing the first rule identified using the first value of the first attribute; and displaying the generated GUI to the user.
A method is described for processing keyed data items that are each associated with a value of a key, the keyed data items being from a plurality of distinct data streams, the processing including collecting the keyed data items, determining, based on contents of at least one of the keyed data items, satisfaction of one or more specified conditions for execution of one or more actions and causing execution of at least one of the one or more actions responsive to the determining.
Techniques for storing data entities by a data processing system are described herein. The data processing system may store a plurality of data entity instances generated using a plurality of data entities. The plurality of data entity instances may include a first data entity instance generated using a first data entity and a second data entity instance generated using a second data entity. The first data entity instance may include a first attribute that is configured to inherit its value from a second attribute of the second data entity instance. The data processing system may provide the inherited value of the second attribute of the second data entity instance as the value of the first attribute of the first data entity instance.
A data processing system that receives user input specifying datasets on which operations are performed with user interfaces that enable manipulation of hierarchical groups of datasets. A user interface may enable individual datasets or a previously defined group of datasets to be aggregated into another grouping. The groupings may be scoped, including by persona of users, such that, when a user is prompted to specify one or more datasets as a target of an operation by the data processing system, the available choices are limited to datasets that have a scope encompassing that user. The interfaces may prompt a user to select a grouping within the hierarchy that contains datasets on which the operation can be performed. Upon selection of a grouping with multiple datasets as a target of an operation that is performed on datasets singly, the operation may be performed on each dataset in the selected group.
A data processing system with a dataset multiplexer that enables applications to be written to specify access to datasets as operations on logical datasets. During execution of an application by the data processing system, operations that access a dataset are implemented by accessing an entry in a dataset catalog for the logical dataset. That entry includes information to access the physical data source storing the logical dataset, including conversion of data from the format of the physical data source to the format of the logical dataset. An entry in the catalog may be created based on registration of a data source with the dataset multiplexer and may be updated automatically based on changes in storage of the dataset. This maintenance of the catalog may be partially or totally automated such that the system automatically adjusts to any changes in storage of the dataset without need for modification of any application.
A data processing system that receives user input specifying datasets on which operations are performed with user interfaces that enable manipulation of hierarchical groups of datasets. A user interface may enable individual datasets or a previously defined group of datasets to be aggregated into another grouping. The groupings may be scoped, including by persona of users, such that, when a user is prompted to specify one or more datasets as a target of an operation by the data processing system, the available choices are limited to datasets that have a scope encompassing that user. The interfaces may prompt a user to select a grouping within the hierarchy that contains datasets on which the operation can be performed. Upon selection of a grouping with multiple datasets as a target of an operation that is performed on datasets singly, the operation may be performed on each dataset in the selected group.
Techniques for storing data entities by a data processing system are described herein. The data processing system may store a plurality of data entity instances generated using a plurality of data entities. The plurality of data entity instances may include a first data entity instance generated using a first data entity and a second data entity instance generated using a second data entity. The first data entity instance may include a first attribute that is configured to inherit its value from a second attribute of the second data entity instance. The data processing system may provide the inherited value of the second attribute of the second data entity instance as the value of the first attribute of the first data entity instance.
Techniques for obtaining information about data entity instances managed by a data processing system using at least one data store. The techniques include obtaining a query comprising a first portion comprising information for identifying instances of a first data entity stored in at least one data store; and a second portion indicating at least one attribute of the first data entity; generating, from the query, a plurality of executable queries including a first set of one or more executable queries and a second set of one or more executable queries, the generating comprising: generating, using the first portion, the first set of executable queries for identifying instances of the first data entity, and generating, using the second portion, the second set of executable queries for obtaining attribute values for instances of the first data entity; and executing the plurality of executable queries to obtain results for the query.
A data processing system with a dataset multiplexer that enables applications to be written to specify access to datasets as operations on logical datasets. During execution of an application by the data processing system, operations that access a dataset are implemented by accessing an entry in a dataset catalog for the logical dataset. That entry includes information to access the physical data source storing the logical dataset, including conversion of data from the format of the physical data source to the format of the logical dataset. An entry in the catalog may be created based on registration of a data source with the dataset multiplexer and may be updated automatically based on changes in storage of the dataset. This maintenance of the catalog may be partially or totally automated such that the system automatically adjusts to any changes in storage of the dataset without need for modification of any application.
Techniques for obtaining information about data entity instances managed by a data processing system using at least one data store. The techniques include obtaining a query comprising a first portion comprising information for identifying instances of a first data entity stored in at least one data store; and a second portion indicating at least one attribute of the first data entity; generating, from the query, a plurality of executable queries including a first set of one or more executable queries and a second set of one or more executable queries, the generating comprising: generating, using the first portion, the first set of executable queries for identifying instances of the first data entity, and generating, using the second portion, the second set of executable queries for obtaining attribute values for instances of the first data entity; and executing the plurality of executable queries to obtain results for the query.