A disclosed method can include (i) forming a cavity in a substrate core for a semiconductor device from a floor of the substrate core to a ceiling of the substrate core, (ii) bonding an electronic component to an element through a bonding layer to form an electronic component aggregation, (iii) disposing the electronic component aggregation within the cavity, and (iv) filling the cavity. Various other apparatuses, systems, and methods are also disclosed.
H01L 21/48 - Manufacture or treatment of parts, e.g. containers, prior to assembly of the devices, using processes not provided for in a single one of the groups or
H01L 23/13 - Mountings, e.g. non-detachable insulating substrates characterised by the shape
Systems, methods, and devices for integrated circuit power management. A mode of a power management state is entered, from the power management state, in response to an entry condition of the mode. A device that is otherwise powered off in the power management state is powered on in the mode of the power management state. In some implementations, the device includes a communications path between a second device and a third device. In some implementations, the device is in a power domain that is powered off in the power management state. In some implementations, the power domain is powered off in the mode. In some implementations, the device is powered on in the mode via a power rail that is specific to the mode. In some implementations, the entry condition of the mode includes an amount of data stored for display in a display buffer falling below a threshold amount.
Aspects of the present disclosure provide a method of fabricating a semiconductor structure that includes a plurality of unmerged source-and-drain (S/D) contacts. For example, the method can include forming over a substrate a plurality of first channels that are stacked over each other and extend along a top surface of the substrate, forming a first merged S/D contact on the first channels, forming a first gate structure for each of the first channels, removing a portion of the first merged S/D contact such that a first remaining individual S/D contact is formed on each of the first channels at one end thereof after the first gate structure is formed, and forming a first interconnect that connects the first remaining individual S/D contacts.
Aspects of the present disclosure provide a method of fabricating a semiconductor structure that includes a plurality of un-merged source-and-drain (S/D) contacts. For example, the method can include forming over a substrate a plurality of first channels that are stacked over each other and extend along a top surface of the substrate, forming on each of the first channels at one end thereof a first S/D contact, forming a first replacement dielectric material that covers the first S/D contacts, forming a first gate structure for each of the first channels; and replacing the first replacement dielectric material with a first interconnect that connects the first S/D contacts after the first gate structure is formed.
H10D 84/03 - Manufacture or treatment characterised by using material-based technologies using Group IV technology, e.g. silicon technology or silicon-carbide [SiC] technology
H10D 84/40 - Integrated devices formed in or on semiconductor substrates that comprise only semiconducting layers, e.g. on Si wafers or on GaAs-on-Si wafers characterised by the integration of at least one component covered by groups or with at least one component covered by groups or , e.g. integration of IGFETs with BJTs
A power manager of an apparatus exposes an application programming interface (API) usable to specify priority and quality-of-service (QoS) parameters (e.g., latency, throughput) for a workload. An application, for instance, specifies the priority and QoS parameters for a workload to be processed using a hardware compute unit. The priority and QoS parameters are employed by the power manager as a basis to configure the power setting of a hardware compute unit. In particular, resource prioritization is extended to both real-time and best-effort workloads to satisfy specified QoS parameters for inference workloads.
In accordance with the described techniques, a scalable input/output virtualization (SIOV) device includes multiple hardware queues, backend hardware resources, and a command processor running scheduling firmware. The scheduling firmware selects a shared work queue of multiple shared work queues managed by the scheduling firmware from which to dispatch tasks based on one or dispatch policies. In addition, the scheduling firmware selects a hardware queue of the multiple hardware queues in which to enqueue the tasks based on one or more queue policies. Further, the scheduler dispatches the tasks from the shared work queue to the hardware queue, and the tasks are read from the hardware queue by the backend hardware resources for execution.
A system comprises a machine check architecture and a processor. The machine check architecture is configured to log hardware errors. The processor is configured to obtain a log of one or more of the hardware errors from the machine check architecture and/or to generate a copy of the log. The processor is further configured to either (1) deliver the log to an in-band agent and the copy of the log to an out-of-band agent or (2) deliver the copy of the log to the in-band agent and the log to the out-of-band agent. Various other devices, systems, and methods are also disclosed.
Embodiments herein describe a circuit including a user domain configured to execute user functions and a hardened domain configured to communicate with the user domain. The hardened domain includes peripheral component interconnect express (PCIe) function decoding logic having a plurality of register bits and a Trusted Execution Environment (TEE) Device Interface Security Protocol (TDISP) core communicating with the PCIe function decoding logic. The TDISP core supports a plurality of PCIe functions. Each register bit of the plurality of register bits is assigned to a respective PCIe function of the plurality of PCIe functions.
Shared last level cache usage management for multiple clients is described. In one or more implementations, a system includes a shared last level cache coupled to multiple clients and a dynamic random access memory. The system further includes a linear dropout regulator that supplies power to the shared last level cache. A data fabric included in the system is configured to control a level of the power supplied from the linear dropout regulator to be either a first level or a second level based on usage of the shared last level cache.
A power manager of an apparatus exposes an application programming interface(API) usable for applications to specify priority and quality-of-service (QoS) parameters (e.g., bandwidth requirements) for a workload. An application, for instance, specifies the priority and QoS parameters for a workload to be processed using a hardware compute unit. The power manager employs the priority and QoS parameters to configure the bandwidth allocation to access a memory system. In particular, the bandwidth allocation and prioritization are dynamically extended to real-time and best-effort workloads to satisfy specified QoS parameters for inference workloads and improve user experiences.
A system includes a first chiplet and a second chiplet connected via a plurality of interconnects. The system includes a pattern generator configured to generate a test pattern on behalf of the first chiplet. The system includes a pattern checker configured to check the test pattern on behalf of the second chiplet. The system includes a first repair multiplexer and a second repair multiplexer corresponding to the first chiplet and the second chiplet, respectively. The first repair multiplexer and the second repair multiplexer configured to selectively enable a repair path responsive to a short fault between two interconnects of the plurality interconnects based on the checked test pattern.
A processor includes an accelerated access circuit, a data structure, and a hardware scheduler. The data structure is managed by software and bound to the accelerated access circuit. The hardware scheduler is configured to schedule, on the accelerated access circuit, a work item requesting access to the data structure. The accelerated access circuit is configured to receive a request from the work item to access the data structure. Responsive to the request, the accelerated access circuit is further configured to serialize access by the work item to the data structure thereby preventing other work items from accessing the data structure.
The packet processing chip of a networking device includes a packet processing pipeline circuit and a programmable policer circuit. A single programmable policer circuit may use policing policy identifiers and an aggregated token bucket to police multiple network flows. A policing policy identifier may govern which policing policy is used for a network packet. Each network flow may have a flow table entry that includes a policing policy identifier and a programmable policer circuit identifier. After reading the flow table entry for processing a network packet, the packet processing pipeline circuit may send data including the policing policy identifier to the programmable policer circuit which returns a policing decision in accordance with the policing policy and the state of the aggregated token bucket. Different network flows may use the same aggregated token buck and different policing policies to thereby implement strict priority of some network flows over others.
A computer-implemented method for physical core-specific wear-based task scheduling can include obtaining a wear metric for each physical core based of the plurality of physical cores of the at least one integrated circuit, wherein the wear metric is indicative of a physical condition of each physical core. The computer-implemented method can then schedule a plurality of tasks across at least one physical core of the plurality of physical cores based at least in part on the wear metric of each physical core of the plurality of cores. Various other methods, systems, and computer-readable media are also disclosed.
A device that defines and uses a bounding volume for testing for ray intersections with a displaced micro-mesh. The bounding volume is indirectly based on a twisted prism composed of two triangles and three bilinear patches that bounds the displaced micro-mesh. Instead of detecting intersection with the bilinear patches directly, tetrahedrons that circumscribe the bilinear patches can be used instead. The two bases and the three tetrahedra make fourteen triangles. The device tests for potential intersection with the displaced micro-mesh by testing for an intersection with any of the fourteen triangles. Various other methods and systems are also disclosed.
Temporary system adjustment for component overclocking is described. In accordance with the described techniques, a processor and/or memory are operated according to first settings. During operation of the processor and/or the memory according to the first settings, a signal triggers a temporary adjustment of operation of the processor and/or the memory according to second settings. Responsive to the request, operation of the processor and/or the memory is switched to the second settings without rebooting. After a duration, operation of the processor and/or the memory is switched back to the first settings. In one or more implementations, at least one of the first settings or the second settings overclock the processor and/or the memory.
Using artificial intelligence (AI)-based techniques to guide instruction scheduling in a compiler can improve the efficiency and code generation quality of the compiler. AI-guided scheduling of a basic block of a computer program can include obtaining first and second representations of the basic block; selecting K instruction scheduling procedures from a set of N instruction scheduling procedures based on analysis of the first representation of the basic block by a model, where 1≤K
The disclosed device includes various circuit blocks and a clock tree for sending a clock signal to the circuit blocks. The clock tree includes various clock drivers. The device also includes a control circuit that power gates, in response to one of the circuit blocks being power gated, a portion of the clock tree that includes one of the clock drivers. Various other methods, systems, and computer-readable media are also disclosed.
A computer-implemented method for physical core-specific wear-based task scheduling can include obtaining a wear metric for each physical core based of the plurality of physical cores of the at least one integrated circuit, wherein the wear metric is indicative of a physical condition of each physical core. The computer-implemented method can then schedule a plurality of tasks across at least one physical core of the plurality of physical cores based at least in part on the wear metric of each physical core of the plurality of cores. Various other methods, systems, and computer-readable media are also disclosed.
G05B 19/418 - Total factory control, i.e. centrally controlling a plurality of machines, e.g. direct or distributed numerical control [DNC], flexible manufacturing systems [FMS], integrated manufacturing systems [IMS] or computer integrated manufacturing [CIM]
The disclosed device includes various circuit blocks and a clock tree for sending a clock signal to the circuit blocks. The clock tree includes various clock drivers. The device also includes a control circuit that power gates, in response to one of the circuit blocks being power gated, a portion of the clock tree that includes one of the clock drivers. Various other methods, systems, and computer-readable media are also disclosed.
A device that defines and uses a bounding volume for testing for ray intersections with a displaced micro-mesh. The bounding volume is indirectly based on a twisted prism composed of two triangles and three bilinear patches that bounds the displaced micro-mesh. Instead of detecting intersection with the bilinear patches directly, tetrahedrons that circumscribe the bilinear patches can be used instead. The two bases and the three tetrahedra make fourteen triangles. The device tests for potential intersection with the displaced micro-mesh by testing for an intersection with any of the fourteen triangles. Various other methods and systems are also disclosed.
Workgroup processors associated with a shader program interface are provided with local launchers capable of launching shader threads partially or completely independently from the shader program interface. The local launchers maintain local queues separately from the shader program interface. The local launchers allocate resources for shader thread execution at an associated workgroup processor either directly or through a request to the shader program interface. In some implementations, the shader program interface leases resources to the local launcher in response to a request for resources and terminates the lease when the local launcher notifies the shader program interface that execution of the shader thread is complete.
Adaptive system probe action to minimize input/output dirty data transfers is described. In one or more implementations, a system includes a processor, a memory configured to store data, and a cache configured to store a portion of the data stored in the memory for execution by the processor. The system also includes a cache coherence controller including a cache line history. The cache coherence controller is configured detect a direct memory access request from an input/output device. The direct memory access request is associated with an input/output operation involving the data. The cache coherence controller is further configured to identify a cache line associated with the direct memory access request, and, in response to the cache line history including a dirty data transfer record corresponding to the cache line, selectively send a probe to the cache based on a state of the cache line.
G06F 3/06 - Digital input from, or digital output to, record carriers
G06F 12/0808 - Multiuser, multiprocessor or multiprocessing cache systems with cache invalidating means
G06F 13/28 - Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access, cycle steal
24.
SYSTEMS AND METHODS FOR INDICATING RECENTLY INVALIDATED CACHE LINES
A computing device includes detection circuitry configured to detect invalidation of a line of a cache array. The computing device additionally includes setting circuitry configured to set, in response to the detected invalidation, a spare state encoding in an entry of a partial line-based probe filter that indicates recent invalidation of the line of the cache array. The computing device also includes processing circuitry configured to process a transaction that hits on the entry of the partial line-based probe filter by avoiding a multicast probe of the cache array. Various other methods, systems, and computer-readable media are also disclosed.
G06F 12/0891 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using clearing, invalidating or resetting means
25.
Adaptive Quantum Instruction Scheduler for Quantum Parallel Processing Units
A quantum computing device includes a plurality of quantum parallel processing units (Q-PPUs) configured to execute a set of quantum instructions of a quantum application program. The quantum computing device includes an adaptive quantum instruction scheduler to dynamically distribute the set of quantum instructions to the plurality of Q-PPUs based, at least in part, upon a measured probability of a desired result of executing the set of quantum instructions of the quantum application program and a decoherence time of a qubit.
G06N 10/80 - Quantum programming, e.g. interfaces, languages or software-development kits for creating or handling programs capable of running on quantum computersPlatforms for simulating or accessing quantum computers, e.g. cloud-based quantum computing
26.
Adaptive System Probe Action to Minimize Input/Output Dirty Data Transfers
Adaptive system probe action to minimize input/output dirty data transfers is described. In one or more implementations, a system includes a processor, a memory configured to store data, and a cache configured to store a portion of the data stored in the memory for execution by the processor. The system also includes a cache coherence controller including a cache line history. The cache coherence controller is configured detect a direct memory access request from an input/output device. The direct memory access request is associated with an input/output operation involving the data. The cache coherence controller is further configured to identify a cache line associated with the direct memory access request, and, in response to the cache line history including a dirty data transfer record corresponding to the cache line, selectively send a probe to the cache based on a state of the cache line.
A method for reducing cache fills can include training a filter, by at least one processor and in response to at least one of eviction or rewrite of one or more entries of a cache, the filter indicating one or more cache loads from which the one or more entries were previously filled. The method can also include preventing, by the at least one processor and based on the trained filter, one or more subsequent fills to the cache from the one or more cache loads. Various other methods and systems are also disclosed.
G06F 12/0891 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using clearing, invalidating or resetting means
An example device can include at least one network controller configured to receive a data request and to retrieve data based on the data request, and a cache agent configured to receive a data access parameter based on the data request, and reconfigure a cache for at least one memory cache based on the data access parameter. The data request can be received from a computer device and the data can be retrieved from at least one memory device. An example data access parameter can include a latency of at least one network-attached memory device to retrieve data from the at least one memory device based on the data request. An example device can further comprises a flit profiler configured to determine the data access parameter. Various other methods, systems, and computer-readable media are also disclosed.
An exemplary apparatus for interfacing dies that use incompatible protocols includes a first die that uses a first protocol, a second die that uses a second protocol, and a die management unit communicatively coupled to both the first die and the second die in an integrated circuit. In some examples, the die management unit is configured to translate at least one message between the first protocol and the second protocol to support communication between the first die and the second die. Various other apparatuses, systems, and methods are also disclosed.
G06F 11/22 - Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
G06F 11/14 - Error detection or correction of the data by redundancy in operation, e.g. by using different operation sequences leading to the same result
Workgroup processors associated with a shader program interface are provided with local launchers capable of launching shader threads partially or completely independently from the shader program interface. The local launchers maintain local queues separately from the shader program interface. The local launchers allocate resources for shader thread execution at an associated workgroup processor either directly or through a request to the shader program interface. In some implementations, the shader program interface leases resources to the local launcher in response to a request for resources and terminates the lease when the local launcher notifies the shader program interface that execution of the shader thread is complete.
A power manager of an apparatus exposes an application programming interface (API) usable to specify priority and quality-of-service (QoS) parameters (e.g., latency, throughput) for a workload. An application, for instance, specifies the priority and QoS parameters for a workload to be processed using a hardware compute unit. The priority and QoS parameters are employed by the power manager as a basis to configure the power setting of a hardware compute unit. In particular, resource prioritization is extended to both real-time and best-effort workloads to satisfy specified QoS parameters for inference workloads.
Shared last level cache usage management for multiple clients is described. In one or more implementations, a system includes a shared last level cache coupled to multiple clients and a dynamic random access memory. The system further includes a linear dropout regulator that supplies power to the shared last level cache. A data fabric included in the system is configured to control a level of the power supplied from the linear dropout regulator to be either a first level or a second level based on usage of the shared last level cache.
An exemplary method for performing lane-specific error detection in high-speed data links involves receiving, at a receiver, an ordered set of data from a transmitter communicatively coupled to the receiver via a data link. The exemplary method also involves identifying, in the ordered set of data, an error-detection reference. The exemplary method further involves performing, in connection with one data lane of the data link, an error-detection operation on the ordered set of data based at least in part on the error-detection reference. Various other devices, systems, and methods are also disclosed.
An exemplary apparatus for interfacing dies that use incompatible protocols includes a first die that uses a first protocol, a second die that uses a second protocol, and a die management unit communicatively coupled to both the first die and the second die in an integrated circuit. In some examples, the die management unit is configured to translate at least one message between the first protocol and the second protocol to support communication between the first die and the second die. Various other apparatuses, systems, and methods are also disclosed.
Processing systems and associated methods are provided for managing tasks and handling errors based on criticality levels assigned to individual processing circuitry blocks. The system includes a plurality of processing circuitry blocks, each assigned a level of criticality by a critical domain manager, such that errors arising in relation to these blocks are handled in accordance with their assigned level of criticality. The system features timers assigned to tasks initiated by the processing circuitry blocks, with timeout durations determined by the criticality levels respectively assigned to the processing circuitry blocks and tasks initiated by them. Upon timeout expiration without task completion, various error handling procedures are selected and executed based on the assigned criticality levels.
G06F 11/07 - Responding to the occurrence of a fault, e.g. fault tolerance
G06F 11/14 - Error detection or correction of the data by redundancy in operation, e.g. by using different operation sequences leading to the same result
G06F 9/38 - Concurrent instruction execution, e.g. pipeline or look ahead
A system includes a first chiplet and a second chiplet connected via a plurality of interconnects. The system includes a pattern generator configured to generate a test pattern on behalf of the first chiplet. The system includes a pattern checker configured to check the test pattern on behalf of the second chiplet. The system includes a first repair multiplexer and a second repair multiplexer corresponding to the first chiplet and the second chiplet, respectively. The first repair multiplexer and the second repair multiplexer configured to selectively enable a repair path responsive to a short fault between two interconnects of the plurality interconnects based on the checked test pattern.
An exemplary method for performing lane-specific error detection in high-speed data links involves receiving, at a receiver, an ordered set of data from a transmitter communicatively coupled to the receiver via a data link. The exemplary method also involves identifying, in the ordered set of data, an error-detection reference. The exemplary method further involves performing, in connection with one data lane of the data link, an error-detection operation on the ordered set of data based at least in part on the error-detection reference. Various other devices, systems, and methods are also disclosed.
The disclosed device includes various components; and a control circuit for managing performance states of the components. The control circuit can receive an event trigger corresponding to one of the components, monitor an activity metric for the component, and update a performance state of the component based on the event trigger and the activity metric. Various other methods, systems, and computer-readable media are also disclosed.
A system includes a memory device and a memory controller operatively connected to the memory device via a physical layer interface (PHY). The PHY includes an active first-in-first-out (FIFO) buffer configured to receive commands from the memory controller. The PHY also includes one or more on-demand FIFO buffers configured to be selectively enabled by the active first-in-first-out buffer to handle a data payload. The system ensures efficient power usage by gating clocks and clock distribution to the one or more on-demand FIFO buffers.
A technique for rendering is provided. The technique includes mapping a randomization portion of an item of identifying information to a random block of an address space; mapping a linear portion of the item of identifying information to an element within the block; and accessing the element.
Devices, methods, and systems for communicating debugging information. Debugging information is stored from debugging hardware of an integrated circuit into a memory of the integrated circuit. The debugging information is retrieved from the memory and encapsulating the debugging information in a packet. The packet is transmitted over an interface to a device that is external to the integrated circuit. In some implementations, the debugging information is stored in MMIO space of the memory that is not mapped to registers of the integrated circuit. In some implementations, the debugging information is stored in a MMIO space of the memory, wherein a base address of the MMIO is indicated in a base address register (BAR) of the integrated circuit. In some implementations, the debugging information is encapsulated in a USB4 packet and transmitted over a USB4 interface to the device that is external to the integrated circuit.
Systems and methods described herein use multiple reduced-precision intersection testers in parallel to determine candidate nodes to traverse in a wide BVH. Primitives are quantized to generate primitive packets, that are stored compactly in, with, or near a leaf node. At the leaves of the BVH, these intersection testers test a ray simultaneously against a plurality of triangles in the primitive packet to find candidate triangles that require full-precision intersection. Triangles or primitives that generate an inconclusive result during low-precision testing are retested using full-precision testers to definitively determine ray-triangle hits or misses. Testing the quantized triangles simultaneously using low-precision testers culls instances wherein the ray misses a box or a triangle that need not be tested using higher precision.
An apparatus and method for efficiently processing cache accesses of an integrated circuit. In various implementations, a computing system includes a cache with a tag array, a cache controller, and a data array. The cache controller includes a cache set status array. The cache set status array stores data using any of a variety of flip-flop circuits, which reduces access times and power consumption compared to random access memory (RAM) cells. In each pipeline stage prior to updating the cache set status array, the cache controller conditionally updates cache set status values based on comparisons between a selected set of the memory access request with a set of a previous memory access request that has not yet updated the status array. Updates of cache status values based on the tag comparison occur in the second pipeline stage, which allows reduction of the clock cycle.
G06F 12/126 - Replacement control using replacement algorithms with special data handling, e.g. priority of data or instructions, handling errors or pinning
G06F 12/0864 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using pseudo-associative means, e.g. set-associative or hashing
45.
GENERATION OF OVERCLOCKING PROFILE RECOMMENDATIONS FOR COMPUTING SYSTEMS
An apparatus and method for efficiently scheduling wavefronts for efficiently increasing computing system performance by using operating settings that exceed the manufacturer's default range of settings. In various implementations, a computing system includes a server that communicates with multiple client devices through a network. The server receives an overclocking request from a client device for an overclocking recommendation profile. The server searches a database that stores multiple overclocking recommendation profiles based on test data generated by multiple system configurations. The server either receives an overclocking recommendation profile from the database or generates an overclocking recommendation profile using one of several hardware-based and software-based systems. The server sends the first overclocking recommendation profile to the client device.
A method includes forming a plurality of thermal sensing elements at predetermined locations on a semiconductor chip proximate to a target location, measuring a temperature of the semiconductor chip at each predetermined location using a corresponding one of the plurality of thermal sensing elements, and determining a temperature at the target location using the temperatures measured at each of the predetermined locations.
H01L 21/66 - Testing or measuring during manufacture or treatment
H01L 27/06 - Devices consisting of a plurality of semiconductor or other solid-state components formed in or on a common substrate including integrated passive circuit elements with at least one potential-jump barrier or surface barrier the substrate being a semiconductor body including a plurality of individual components in a non-repetitive configuration
47.
SYSTEMS AND METHODS FOR POWER CONTROL IN 3D STACKED DIE
A method for controlling power in 3D stacked die can include configuring a first die of a set of 3D stacked die to receive power from a power source, wherein the first die includes one or more field effect transistors configured to control the power. The method can also include configuring one or more power domains included in a second die of the set of 3D stacked die to receive the power that is controlled by the one or more field effect transistors included in the first die. Various other methods and systems are also disclosed.
H01L 23/00 - Details of semiconductor or other solid state devices
H01L 25/065 - Assemblies consisting of a plurality of individual semiconductor or other solid-state devices all the devices being of a type provided for in a single subclass of subclasses , , , , or , e.g. assemblies of rectifier diodes the devices not having separate containers the devices being of a type provided for in group
48.
APPARATUS, SYSTEM, AND METHOD FOR UNIQUELY IDENTIFYING INDIVIDUAL DIES ACROSS DIE STACKS
An exemplary apparatus for uniquely identifying individual dies across die stacks includes a die stack and a plurality of signals arranged across the die stack. The plurality of signals are manipulated to form a unique identifier for each die included in the die stack. Various other apparatuses, systems, and methods are also disclosed.
H01L 23/544 - Marks applied to semiconductor devices, e.g. registration marks, test patterns
H01L 21/68 - Apparatus specially adapted for handling semiconductor or electric solid state devices during manufacture or treatment thereofApparatus specially adapted for handling wafers during manufacture or treatment of semiconductor or electric solid state devices or components for positioning, orientation or alignment
49.
APPARATUS, SYSTEM, AND METHOD FOR BALANCING TIMING CLOSURE
An integrated circuit die includes a set of electronic circuits disposed on a semiconductor material. The integrated circuit die also includes one or more through-silicon vias that vertically span the semiconductor material to transmit data signals. Additionally, the integrated circuit die includes a programmable delay element integrated with the set of electronic circuits on the semiconductor material and configured to delay data signals. Various other apparatuses, systems, and methods are also disclosed.
H01L 25/065 - Assemblies consisting of a plurality of individual semiconductor or other solid-state devices all the devices being of a type provided for in a single subclass of subclasses , , , , or , e.g. assemblies of rectifier diodes the devices not having separate containers the devices being of a type provided for in group
H01L 25/00 - Assemblies consisting of a plurality of individual semiconductor or other solid-state devices
H03K 5/00 - Manipulation of pulses not covered by one of the other main groups of this subclass
A bonded die assembly includes first conductive pads of a first substrate each bonded to respective second conductive pads of a second substrate, the first and second conductive pads arrayed at an inter-pad spacing, a plurality of active components located in the second substrate and arrayed at an inter-component spacing, and a metallization structure disposed between the first substrate and the second substrate, where the metallization structure is configured to decrease the inter-component spacing relative to the inter-pad spacing. The die assembly is characterized by an improved utilization of available device active area.
H01L 23/00 - Details of semiconductor or other solid state devices
H01L 23/48 - Arrangements for conducting electric current to or from the solid state body in operation, e.g. leads or terminal arrangements
H01L 23/522 - Arrangements for conducting electric current within the device in operation from one component to another including external interconnections consisting of a multilayer structure of conductive and insulating layers inseparably formed on the semiconductor body
51.
VOLTAGE MARGIN OPTIMIZATION BASED ON WORKLOAD SENSITIVITY
An apparatus and method for efficiently managing, based on workload types, current transients that cause voltage transients on a power rail of an integrated circuit. In various implementations, a computing system includes a memory subsystem, multiple clients for processing tasks, and a communication fabric that transfers data between the memory subsystem and the multiple clients. If circuitry of the communication fabric (or “fabric”) determines the type of workload being processed by the multiple clients is memory latency sensitive and not peak memory bandwidth sensitive, then the circuitry assigns a high-performance clock frequency to the communication fabric and reduces an issue rate of memory requests to the memory subsystem. The reduced issue rate reduces the current transients, which reduces the amount of voltage droop margin to use for the workload. Accordingly, the communication fabric consumes less power. In response, power credits are redistributed from the communication fabric to the multiple clients.
Systems, apparatuses, and methods for adaptive graph repartitioning of a computational graph representing at least a portion of a neural network are described. Parallelization of a neural network model includes executing some partitions of a computational graph on accelerators while other partitions can be executed on a CPU. When executing these partitions using specific computing devices, operational parameters of different computing devices and overall availability of system resources can keep changing over time. Adaptive graph repartitioning includes repartitioning a computational graph that includes nodes representing various operations for a neural network model. Dynamic repartitioning of the graph partitions can be performed to better distribute load between devices for different conditions such as exploiting different device strengths, minimizing the idle time of the devices, and generating different partition configurations.
Methods and systems for efficient data movement between various components in an image processing system are described. In a first mode, a dedicated data bus is configured for direct transmission of image data between the ISP core and the video core and/or display core. Using this bus, the video core and/or display core can request data from the ISP core. The ISP core transmits data to the video core and/or display core in response to the request. If the video core is unable to process incoming data from the ISP core at the rate at which data are generated by the ISP, it can apply backpressure to the ISP core to pause transmission of data, e.g., by not consuming data transmitted from the ISP core. The data transmission is performed using a cache memory or buffer as an intermediatory, in a second mode.
Systems, methods, and apparatus for partial tensor correction are disclosed. During quantization, weight tensors can be corrected for quantization errors in order to increase accuracy that is otherwise degraded as a result of quantization. To correct errors, the weight tensor is partially corrected using a data-free, non-iterative, per-input channel level technique to achieve accuracy improvement while using lower precision. Further, sensitive channels prone to accuracy degradation due to quantization are identified. Based on this identification, parts of weight tensor is retained for CPU computation and remaining parts of the weight tensor are offloaded for accelerator computation. The proposed partial tensor retention scheme achieves efficient heterogenous DNN computations with improved performance and accuracy on heterogenous systems. Furthermore, combining the partial tensor correction and partial tensor retention techniques allows for achieving improved performance and accuracy in a heterogenous computing environment while using low precision computations.
A method of operating an electronic device is disclosed. The method includes generating an initial time value, generating an initial time stamp based on the initial time value, and transmitting the initial time stamp. The initial time stamp has a number of bits. For a set period of time, the time stamp is updated and the updated time stamp is transmitted from the electronic device. After the set period of time, the time stamp is adjusted by a residual time amount and the adjusted time stamp is transmitted from the electronic device. Other implementations are disclosed.
The disclosed computer-implemented method includes configuring a cache with a cache addressing scheme that increases a capacity of each entry of the cache, compressing a data segment for storing in the cache, and storing metadata of the compression in the cache with the compressed data segment. Various other methods, systems, and computer-readable media are also disclosed.
G06F 12/0864 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using pseudo-associative means, e.g. set-associative or hashing
G06F 12/0853 - Cache with multiport tag or data arrays
57.
INCREASED SHIFT FREQUENCY FOR MULTI-CHIP-MODULE SCAN
The disclosed device includes an input, an output and multiple flops serially connected between the input and the output. The device also includes a bypass circuit coupled to the input and the output and a control circuit. The control circuit can enable, in response to a first trigger, the flops to shift data to the output. In response to a second trigger, the control circuit can disable the flops and enable the bypass circuit to shift data from the input to the output. The flops can be part of a chain of flops. Various other methods, systems, and computer-readable media are also disclosed.
Embodiments herein describe a hardware accelerator that includes multiple power or clock domains. For example, the hardware accelerator can include an array of data processing engines (DPEs) where different subsets of the DPEs (e.g., different columns, rows, or blocks) are disposed in different power or clock domains within the hardware accelerator. When one or more subsets of the DPEs are idle (e.g., the hardware accelerator has not assigned any tasks to those DPEs), the accelerator can deactivate the corresponding power or clock domain (or domains), which deactivates the DPEs in those domains while the DPEs in the other power or clock domains remain operational. As such, idle DPEs can be deactivated to conserve energy while DPEs with work can remain operational.
Embodiments herein describe integrating an accelerator into a same SoC (or same chip or IC) as a CPU. The SoC also includes a controller (e.g., a microcontroller) that orchestrates data processing engines (DPEs) in the accelerator. The controller (or orchestrator) receives a task from the CPU and then configures the DPEs to perform the task. For example, the controller may divide the task into a sequence of operations that are performed by one or more of the DPEs. The controller can then report back to the CPU when the task is complete.
Embodiments herein describe integrating an AI accelerator into a same SoC (or same chip or IC) as a CPU. Thus, instead of relying on off-chip communication techniques, on-chip communication techniques such as an interconnect (e.g., a NoC) can be used to facilitate communication. This can result in faster communication between the AI accelerator and the CPU. Moreover, a tighter integration between the CPU and AI accelerator can make it easier for the CPU to offload AI tasks to the Al accelerator. In one embodiment, the AI accelerator includes address translation circuitry for translating virtual addresses used in the AI accelerator to physical addresses used to store the data.
G06F 15/78 - Architectures of general purpose stored program computers comprising a single central processing unit
G06F 12/1036 - Address translation using associative or pseudo-associative address translation means, e.g. translation look-aside buffer [TLB] for multiple virtual address spaces, e.g. segmentation
61.
CONTROLLER FOR AN ARRAY OF DATA PROCESSING ENGINES
Embodiments herein describe integrating an accelerator into a same SoC (or same chip or IC) as a CPU. The SoC also includes a controller (e.g., a microcontroller) that orchestrates data processing engines (DPEs) in the accelerator. The controller (or orchestrator) receives a task from the CPU and then configures the DPEs to perform the task. For example, the controller may divide the task into a sequence of operations that are performed by one or more of the DPEs. The controller can then report back to the CPU when the task is complete.
G06F 15/78 - Architectures of general purpose stored program computers comprising a single central processing unit
G06F 13/42 - Bus transfer protocol, e.g. handshakeSynchronisation
G06F 13/28 - Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access, cycle steal
G06F 12/1036 - Address translation using associative or pseudo-associative address translation means, e.g. translation look-aside buffer [TLB] for multiple virtual address spaces, e.g. segmentation
G06F 12/0831 - Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means
G06N 3/063 - Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
62.
FLEXIBLE ALLOCATION OF PROCESSORS FOR SAFETY-CRITICAL AND NON-CRITICAL APPLICATIONS
Devices and methods for allocating components of a safety critical system are provided. The processing device comprises resources including memory, a host processor and a plurality of processors connected to the resources via a shared pathway of a network and configured to execute an application based on instructions from the host processor. Each of the plurality of processors is assigned to one of a plurality of criticality domain levels and isolated pathways are created, via the shared pathway, between the plurality of processors and the plurality of resources based on which of the processors are assigned to one or more of the plurality of criticality domain levels to access one or more of the plurality of resources. The application is executed using the network. The isolated pathways are, for example, created by disabling one or more switches. Alternatively, the isolated pathways are created via programmable logic.
G06F 9/50 - Allocation of resources, e.g. of the central processing unit [CPU]
G06F 9/38 - Concurrent instruction execution, e.g. pipeline or look ahead
G06F 11/14 - Error detection or correction of the data by redundancy in operation, e.g. by using different operation sequences leading to the same result
G06F 15/173 - Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star or snowflake
G06F 15/78 - Architectures of general purpose stored program computers comprising a single central processing unit
63.
SYSTEMS AND METHODS FOR DIRECT DATA TRANSMISSION IN IMAGE PROCESSING SYSTEMS
Methods and systems for efficient data movement between various components in an image processing system are described. In a first mode, a dedicated data bus is configured for direct transmission of image data between the ISP core and the video core and/or display core. Using this bus, the video core and/or display core can request data from the ISP core. The ISP core transmits data to the video core and/or display core in response to the request. If the video core is unable to process incoming data from the ISP core at the rate at which data are generated by the ISP, it can apply backpressure to the ISP core to pause transmission of data, e.g., by not consuming data transmitted from the ISP core. The data transmission is performed using a cache memory or buffer as an intermediatory, in a second mode.
Systems and methods for efficient sharing of memory space in cloud-based applications are described. Data that can be shared between multiple instances of an application is identified and a dedicated memory space is allocated to such data. Whether the data can be shared or not is determined based on the data's content, to avoid corruption and irregular allocations. In conditions where data needs to be shared, a processing circuitry can determine if the data is already in use by another application instance. If so, a shared memory comprising the data is identified and a reference counter for the shared memory is updated. If no other application instances currently use the data, a selected shared memory is assigned to the data and the data is copied from its dedicated memory space to the selected shared memory. In either condition, the original memory space is freed-up, thereby ensuring efficient memory usage.
Systems, methods, and apparatus for partial tensor correction are disclosed. During quantization, weight tensors can be corrected for quantization errors in order to increase accuracy that is otherwise degraded as a result of quantization. To correct errors, the weight tensor is partially corrected using a data-free, non-iterative, per-input channel level technique to achieve accuracy improvement while using lower precision. Further, sensitive channels prone to accuracy degradation due to quantization are identified. Based on this identification, parts of weight tensor is retained for CPU computation and remaining parts of the weight tensor are offloaded for accelerator computation. The proposed partial tensor retention scheme achieves efficient heterogenous DNN computations with improved performance and accuracy on heterogenous systems. Furthermore, combining the partial tensor correction and partial tensor retention techniques allows for achieving improved performance and accuracy in a heterogenous computing environment while using low precision computations.
The disclosed device includes multiple mesh lanes for sending data packets across the device. The device also includes a control circuit that can detect a low bandwidth workload and reroute data packets to avoid one or more mesh lane. The control circuit can then disable the avoided mesh lanes. Various other methods, systems, and computer-readable media are also disclosed.
An exemplary apparatus for uniquely identifying individual dies across die stacks includes a die stack and a plurality of signals arranged across the die stack. The plurality of signals are manipulated to form a unique identifier for each die included in the die stack. Various other apparatuses, systems, and methods are also disclosed.
G11C 5/02 - Disposition of storage elements, e.g. in the form of a matrix array
G11C 5/06 - Arrangements for interconnecting storage elements electrically, e.g. by wiring
H01L 25/18 - Assemblies consisting of a plurality of individual semiconductor or other solid-state devices the devices being of types provided for in two or more different main groups of the same subclass of , , , , or
A processing system employs a hardware signal monitor (HSM) to manage signaling for processing units. The HSM monitors designated memory addresses assigned to signals generated by one or more processing units. Moreover, a single HSM is used to receive and process multiple signals, such as processing different signals received from the one or more processing units. Thus, the HSM improves scalability by managing multiple signals and correspondingly, is able to monitor a greater number of active tasks completed by the one or more processing units.
A technique is provided. The technique includes opening an aperture for processing partial results; receiving partial results in the aperture; and processing the partial results to generate final results.
A technique for improving performance of a hash operation on a processor is provided, in which an input value is hashed into a second value corresponding to a number of bins. The number of bins is an integer that corresponds to a product of first and second integers, the first integer corresponding to a prime number and the second integer corresponding to a power of two. A first modulo hashing operation is performed in which the input value is hashed into the first integer. A second hashing operation is performed using less than all bits of the input value. An output value is formed by concatenating a result of the first hashing operation with a result of the second hashing operation.
H04L 9/32 - Arrangements for secret or secure communicationsNetwork security protocols including means for verifying the identity or authority of a user of the system
Devices, methods, and systems for load fusion. In an explicit approach, a first load operation and a second load operation, in a stream of operations, are replaced with a single load operation. One or more operations, configured to move and shift a value stored in a destination register of the single load operation, are inserted after the single load operation in the stream of operations. In an implicit approach, Information of a first load operation is inserted into a tracking table. Information of a second load operation is inserted into the tracking table responsive to an address of the second load operation being within a range. A load operation is executed from an address indicated by the first load operation and from an address indicated by the second load operation based on the tracking table.
A technique for controlling processing precision is provided. The technique includes identifying a first set of execution instances to operate at a normal precision and a second set of execution instances to operate at a reduced precision; and operating the first set of execution instances at the normal precision and the second set of execution instances at the reduced precision.
A data processing system includes a data processor and a memory. The data processor is for issuing memory commands including a first memory command that accesses data of a first size. The memory is operative to transfer data of the first size by separating a first portion of data from a second portion of data by a data gap. The data processor is operable to selectively prioritize and issue a second memory command after issuing the first memory command at a time that fills the data gap.
A thermal management system for an integrated circuit can include plural micro vapor chambers each configured to operate within their local environment. An exemplary system includes a semiconductor die, a first micro vapor chamber coupled with a first region of the semiconductor die, and a second micro vapor chamber coupled with a second region of the semiconductor die.
A technique includes determining a base decision rate; monitoring for key events; and based on the base decision rate and the monitoring, determining a time at which to generate an action to be performed by an application entity of an application. The base decision rate includes a baseline rate at which the application is directed to determine new actions to be performed by the application entity. In some examples, the base decision rate is determined using a trained AI model, by applying information about the state of the application to the model and obtaining the base decision rate in response. In some examples, key events are unexpected events that occur in the application. In some examples, since the base decision rate represents a rate at which to generate actions, given the current state of the application, the key events, which represent unexpected events, override or modify the base decision rate.
A63F 13/67 - Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use
A63F 13/56 - Computing the motion of game characters with respect to other game characters, game objects or elements of the game scene, e.g. for simulating the behaviour of a group of virtual soldiers or for path finding
A system-on-chip structure includes a substrate, first and second die disposed over a surface of the substrate and separated by an inter-die gap, a protection layer disposed over a sidewall of each of the first and second die, and a gap fill layer disposed over the protection layers and substantially filling the inter-die gap.
H01L 25/10 - Assemblies consisting of a plurality of individual semiconductor or other solid-state devices all the devices being of a type provided for in a single subclass of subclasses , , , , or , e.g. assemblies of rectifier diodes the devices having separate containers
H01L 21/56 - Encapsulations, e.g. encapsulating layers, coatings
H01L 23/29 - Encapsulation, e.g. encapsulating layers, coatings characterised by the material
H01L 23/31 - Encapsulation, e.g. encapsulating layers, coatings characterised by the arrangement
H01L 25/00 - Assemblies consisting of a plurality of individual semiconductor or other solid-state devices
78.
MULTI-MODE POWER STAGE ARCHITECTURE FOR PULSE WIDTH MODULATION CONTROLLER
A driving power stage can receive a PWM signal from the PWM controller and output a PWM signal to a secondary power stage. The secondary power stage can be turned off by the driving power stage during a light-load mode. For a single-phase application, the driving power stage can turn off a switch, causing the power stage to work as a regular power stage. Various other methods and systems are also disclosed.
H02M 1/088 - Circuits specially adapted for the generation of control voltages for semiconductor devices incorporated in static converters for the simultaneous control of series or parallel connected semiconductor devices
H02M 3/157 - Conversion of DC power input into DC power output without intermediate conversion into AC by static converters using discharge tubes with control electrode or semiconductor devices with control electrode using devices of a triode or transistor type requiring continuous application of a control signal using semiconductor devices only with automatic control of output voltage or current, e.g. switching regulators with digital control
H02M 3/158 - Conversion of DC power input into DC power output without intermediate conversion into AC by static converters using discharge tubes with control electrode or semiconductor devices with control electrode using devices of a triode or transistor type requiring continuous application of a control signal using semiconductor devices only with automatic control of output voltage or current, e.g. switching regulators including plural semiconductor devices as final control devices for a single load
79.
DEVICE, SYSTEM, AND METHOD FOR CONSOLIDATING ELIGIBLE VECTOR INSTRUCTIONS
A disclosed method for consolidating eligible vector instructions can include detecting a plurality of vector instructions within a queue of an integrated circuit. The method can also include consolidating the plurality of vector instructions into a single vector instruction based at least in part on the plurality of instructions satisfying one or more criteria. The method can further include forwarding the single vector instruction through a pipeline of the integrated circuit. Various other devices, systems, and methods are also disclosed.
Methods of fabricating a semiconductor device include securing a first die to a second die. The first die includes a first clock signal path from a clock source to a first load and passing through a tap point electrically connected to a clock output. The second die includes a second clock signal path from a clock input to a second load. The methods also include connecting the clock input of the second die to the clock output of the first die. A first divergence between the tap point and the first load is substantially the same as a second divergence from the tap point through the clock input and the clock output to the second load. Various other methods, devices, and systems are also disclosed.
H03K 17/56 - Electronic switching or gating, i.e. not by contact-making and -breaking characterised by the use of specified components by the use, as active elements, of semiconductor devices
81.
METHOD FOR LENS BREATHING COMPENSATION IN CAMERA SYSTEMS
An apparatus and method for efficiently performing auto focusing while reducing lens breathing are contemplated. In various implementations, an image sensor of a camera system captures optical images of a scene and converts the optical signals into electrical signals. One or more of an auto focus circuit and an image signal processing circuit converts the analog electrical signals into digital signals. The auto focus circuit instructs a lens controller to move the lens to multiple lens positions until a final lens position is found. To reduce lens breathing during the auto focusing steps, a compensation accesses multiple compensation ratios calculated during an offline calibration process and generates updated 10 radial distances of the current lens position using the multiple compensation ratios. The compensation circuit also provides warp vectors based on the updated radial distances to a warping circuit that reduces distortion.
The disclosed voltage regulator circuit includes an NMOS as the main power device that is coupled to a regulated voltage output. A sensing circuit senses the regulated voltage output, and a reference voltage circuit supplies a correct bias to the flipped-source follower, which amplifies the sensed voltage output. A voltage inversion circuit such as a current mirror provides an inverting gain stage for the sensed voltage output, for driving the NMOS. Various other methods, systems, and computer-readable media are also disclosed.
G05F 1/575 - Regulating voltage or current wherein the variable actually regulated by the final control device is DC using semiconductor devices in series with the load as final control devices characterised by the feedback circuit
G05F 1/565 - Regulating voltage or current wherein the variable actually regulated by the final control device is DC using semiconductor devices in series with the load as final control devices sensing a condition of the system or its load in addition to means responsive to deviations in the output of the system, e.g. current, voltage, power factor
G05F 3/24 - Regulating voltage or current wherein the variable is DC using uncontrolled devices with non-linear characteristics being semiconductor devices using diode-transistor combinations wherein the transistors are of the field-effect type only
Embodiments herein describe a hardware accelerator that includes multiple power or clock domains. For example, the hardware accelerator can include data processing engines (DPEs) which include circuitry for performing acceleration tasks (e.g., artificial intelligence (AI) tasks, data encryption tasks, data compression tasks, and the like). The DPEs are interconnected to permit them to share data when performing the acceleration tasks. In addition to the DPEs, the hardware accelerator can include other circuitry such as an interconnect, a controller, address translation circuitry, etc. The DPEs may be in a first power or clock domain while the other circuitry is in a second power or clock domain. That way, when the DPEs are idle (e.g., the hardware accelerator currently has no tasks assigned to it), the first power or clock domain can be powered down while the second power or clock domain can remain powered.
Systems, apparatuses, and methods for adaptive graph repartitioning of a computational graph representing at least a portion of a neural network are described. Parallelization of a neural network model includes executing some partitions of a computational graph on accelerators while other partitions can be executed on a CPU. When executing these partitions using specific computing devices, operational parameters of different computing devices and overall availability of system resources can keep changing over time. Adaptive graph repartitioning includes repartitioning a computational graph that includes nodes representing various operations for a neural network model. Dynamic repartitioning of the graph partitions can be performed to better distribute load between devices for different conditions such as exploiting different device strengths, minimizing the idle time of the devices, and generating different partition configurations.
A data processing system includes a data processor and a memory. The data processor is for issuing memory commands including a first memory command that accesses data of a first size. The memory is operative to transfer data of the first size by separating a first portion of data from a second portion of data by a data gap. The data processor is operable to selectively prioritize and issue a second memory command after issuing the first memory command at a time that fills the data gap.
Embodiments herein describe integrating an Al accelerator into a same SoC (or same chip or IC) as a CPU. Thus, instead of relying on off-chip communication techniques, on-chip communication techniques such as an interconnect (e.g., a NoC) can be used to facilitate communication. This can result in faster communication between the Al accelerator and the CPU. Moreover, a tighter integration between the CPU and Al accelerator can make it easier for the CPU to offload Al tasks to the Al accelerator. In one embodiment, the Ai accelerator includes address translation circuitry for translating virtual addresses used in the Al accelerator to physical addresses used to store the data.
G06F 15/78 - Architectures of general purpose stored program computers comprising a single central processing unit
G06F 12/1027 - Address translation using associative or pseudo-associative address translation means, e.g. translation look-aside buffer [TLB]
G06F 12/1081 - Address translation for peripheral access to main memory, e.g. direct memory access [DMA]
G06F 13/28 - Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access, cycle steal
G06F 13/42 - Bus transfer protocol, e.g. handshakeSynchronisation
G06N 3/063 - Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
In accordance with the described techniques, a device includes a memory system and a processor communicatively coupled to the memory system. The processor receives a load instruction from the memory system instructing the processor to load data associated with an address. In response, the processor performs a lookup for the address in a bloom filter that tracks zero value cache lines that have previously been accessed. Based on the lookup indicating that a hash of the address is present in the bloom filter, the processor generates zero value data. Furthermore, the processor processes one or more dependent instructions using the zero value data.
G06F 12/0811 - Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
G06F 9/32 - Address formation of the next instruction, e.g. by incrementing the instruction counter
G06F 12/0864 - Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using pseudo-associative means, e.g. set-associative or hashing
88.
PARALLEL INTEGRATED COLLECTIVE COMMUNICATION AND MATRIX MULTIPLICATION OPERATIONS
Techniques are described for efficiently performing integrated matrix multiplication operations to compute an output matrix based on two input matrices. Data corresponding to a first input tile of a first input matrix is retrieved and stored in a local buffer. For each output tile of the output matrix, a non-blocking remote call is initiated to retrieve data corresponding to a next tile of the first input matrix into the local buffer and, concurrently with the processing of this remote call, iteratively computes the output tile using the data of the first input matrix stored in the local buffer and a unidimensional sequence of input tiles from the second input matrix. The output matrix is generated based on the iteratively computed output tiles.
A processing system includes an error management circuit that detects errors at a plurality of processing circuits. When an error is detected at at least one processing circuit but not another processing circuit, an audio control circuit modifies audio data corresponding to processing circuits where errors are detected while maintaining audio availability of processing circuits where no error is detected. In some implementations, modifying the audio data mutes processing circuits where errors are detected.
A technique for performing a path tracing operation is provided. A cache is interrogated using a probe operation that returns a Boolean result for each of a plurality of scene data elements associated with the path tracing operation. The Boolean result indicates presence or absence of a scene data element in the cache. The path tracing operation executes at least a first instruction based at least in part on the probe operation returning a Boolean result indicating absence of one of the scene data elements in the cache. The path tracing operation executes at least a second instruction based at least in part on the probe operation returning a Boolean result indicating presence of said one scene data element in the cache, wherein the first instruction is different from the second instruction.
An apparatus and method for efficiently scheduling kernels for execution in a computing system. In various implementations, a computing system includes a cache and a processing circuit with multiple compute circuits and a scheduler. The scheduler groups kernels into scheduling groups where each scheduling group includes particular kernels of the multiple kernels that access a same data set different from a data set of another scheduling group. Each of these scheduling groups is referred to as a “cohort.” The scheduler accesses completion time estimates of kernels of the cohorts. Using the completion time estimates, the number of kernels currently executing, and the number of remaining kernels that have not yet begun execution of each currently scheduled cohort, the scheduler determines whether to immediately schedule a next cohort or delay scheduling the next cohort. By doing so, the scheduler balances throughput and cache contention.
An exemplary apparatus for distributing die-specific signals across die stacks includes a die stack and a plurality of signals arranged in a sequence across the die stack. The plurality of signals shift positions in the sequence between a first die and a second die included in the die stack. Various other apparatuses, systems, and methods are also disclosed.
Matrix-fused min-add (MFMA) instructions are described. The MFMA instructions cause a processing device to execute at least one of a min-plus function or a plus-min function. The MFMA instructions cause the processor device to execute min-plus and plus-min functions in response to a single instruction and without performing a multiplication operation as required by conventional systems. In accordance with the described techniques, a MFMA instruction causes multiple logic units (e.g., threads or wavefronts) of a processing device to execute a min-plus function, a plus-min function, or combinations thereof, as part of completing a computational task. To optimize system efficiency, the MFMA instruction causes the processing device to execute the min-plus function, the plus-min function, or combinations thereof using data stored in local registers of the processor device.
A memory device includes core circuitry including memory cells, and write data path circuitry coupled to the core circuitry. The write data path circuitry determines a second parity bit from a second signal and a poison bit. The second signal and the poison bit are determined by processing a first data signal. Further, the write data path circuitry detects a first error within the second signal based on a comparison between a first parity bit and the second parity bit, and outputs a first error signal comprising the first error.
An apparatus and method for efficiently scheduling wavefronts for execution on an integrated circuit. In various implementations, a computing system includes a parallel data processing circuit with multiple, replicated compute circuits. Each compute circuit executes one or more wavefronts. Each compute circuit includes a cache configured to store temporary data that cannot fit in the vector general-purpose register file of the compute circuit. Each wavefront requests a corresponding amount of storage space in the cache for storing the temporary data. When the available data storage space in the cache is less than a data size requested by a wavefront waiting to be dispatched, a control circuit of the compute circuit reduces a dispatch rate of wavefronts. The control circuit also reduces an issue rate of instructions of one or more dispatched wavefronts to assigned execution circuits of the compute circuit.
Techniques are described for efficiently performing integrated matrix multiplication operations to compute an output matrix based on two input matrices. Data corresponding to a first input tile of a first input matrix is retrieved and stored in a local buffer. For each output tile of the output matrix, a non-blocking remote call is initiated to retrieve data corresponding to a next tile of the first input matrix into the local buffer and, concurrently with the processing of this remote call, iteratively computes the output tile using the data of the first input matrix stored in the local buffer and a unidimensional sequence of input tiles from the second input matrix. The output matrix is generated based on the iteratively computed output tiles.
An exemplary apparatus for distributing die-specific signals across die stacks includes a die stack and a plurality of signals arranged in a sequence across the die stack. The plurality of signals shift positions in the sequence between a first die and a second die included in the die stack. Various other apparatuses, systems, and methods are also disclosed.
H01L 23/00 - Details of semiconductor or other solid state devices
H01L 25/065 - Assemblies consisting of a plurality of individual semiconductor or other solid-state devices all the devices being of a type provided for in a single subclass of subclasses , , , , or , e.g. assemblies of rectifier diodes the devices not having separate containers the devices being of a type provided for in group
Methods, apparatuses, and computer-readable medium for incorporating motion awareness into the decision-making process of automatic exposure (AE) to prevent noticeable image quality deterioration resulting from motion blur. In some instances, by harnessing the capabilities of integrated camera Image Signal Processors (ISP), Inference Processing Unit (IPU), and/or Artificial Intelligent (AI) acceleration, the described methods, apparatuses, and computer-readable medium may achieve optimal computational efficiency and enhanced image quality.
To load compacted memory between a system memory and a local data share (LDS), a processing system includes an accelerator unit (AU) connected to a memory unit. The memory unit is configured to identify that compacted data in the system memory is to be written to elements of an LDS based on two or more compaction masks. The memory unit is configured to then determine sources within the memory from which to load compacted data into the elements of the LDS by determining pre-fix sums based on the two or more compaction masks. The memory unit is configured to then load the compacted data from the identified sources to corresponding elements of the LDS.
A technique for performing ray tracing operations is provided. The technique includes arriving at a bounding box of a bounding volume hierarchy (“BVH”) having an orientation defined based on a platonic solid; testing a ray for intersection with the bounding box; and continuing traversal of the BVH based on results of the testing.