Backends and Allocators

The TensorOperations package is designed to provide powerful tools for performing tensor computations efficiently. In advanced use cases, it can be desirable to squeeze the last drops of performance out of the library, by experimenting with either different micro-optimized implementations of the same operation, or by altering the memory management system. Here, we detail how to access these functionalities. Note that all of the backend and allocator types documented below are not exported, so as not to pollute the name space and since they will typically only be manually configured in expert use cases.

Backends

Backend Selection

TensorOperations supports multiple backends for tensor contractions, allowing users to choose different implementations based on their specific needs. While special care is taken to ensure good defaults, we also provide the flexibility to select a backend manually. This can be achieved in a variety of ways:

Global setting: The default backend can be set globally on a per-type basis, as well as a per-function basis. This is achieved by hooking into the implementation of the default backend selection procedure. In particular, this procedure ends up calling TensorOperations.select_backend`, which can be overloaded to return a different backend.
Local setting: Alternatively, the backend can be set locally for a specific call to either @tensor, ncon or the function-based interface. Both @tensor and ncon accept a keyword argument backend, which will locally override the default backend selection mechanism. The result is that the specified backend will be inserted as a final argument to all calls of the primitive tensor operations. This is also how this can be achieved in the function-based interface.

using TensorOperations
mybackend = TensorOperations.StridedNative()

# inserting a backend into the @tensor macro
@tensor backend = mybackend A[i,j] := B[i,k] * C[k,j]

# inserting a backend into the ncon function
D = ncon([A, B, C], [[1, 2], [2, 3], [3, 1]]; backend=mybackend)

# inserting a backend into the function-based interface
tensoradd(A, pA, conjA, B, pB, conjB, α, β, mybackend)

Available Backends

All backends that are accepted in the three primitive tensor operations tensoradd!, tensortrace! and tensorcontract! are subtypes of the abstract type AbstractBackend.

TensorOperations.AbstractBackend — Type

abstract type AbstractBackend

Abstract supertype for all backends that can be used for tensor operations. In particular, these control different implementations of executing the basic operations.

source

TensorOperations.jl provides some options for backends out-of-the box. Firstly, there is the DefaultBackend, which is selected if no backend is specified:

TensorOperations.DefaultBackend — Type

DefaultBackend()

Default backend for tensor operations if no explicit backend is specified. This will select an actual implementation backend using the select_backend(tensorfun, tensors...) mechanism.

source

The different tensor operations have a general catch-all method in combination with DefaultBackend, which will then call select_backend to determine teh actual backend to be used, which can depend on the specific tensor types involved and the operation (tensoradd!, tensortrace! and tensorcontract!) to be performed.

TensorOperations.select_backend — Function

select_backend([tensorfun::Function], tensors...) -> AbstractBackend

Select the default backend for the given tensors or tensortypes. If tensorfun is provided, it is possible to more finely control the backend selection based on the function as well.

source

Within TensorOperations.jl, the following specific backends are available:

TensorOperations.BaseCopy — Type

BaseCopy()

Backend for tensor operations that should work for all AbstractArray types and only uses functions from the Base module, as well as LinearAlgebra.mul!.

source

TensorOperations.BaseView — Type

BaseView()

Backend for tensor operations that should work for all AbstractArray types and only uses functions from the Base module, as well as LinearAlgebra.mul!, and furthermore tries to avoid any intermediate allocations by using views.

source

TensorOperations.StridedNative — Type

StridedNative()

Backend for tensor operations that is based on StridedView objects with native Julia implementations of tensor operations.

source

TensorOperations.StridedBLAS — Type

StridedBLAS()

Backend for tensor operations that is based on using StridedView objects and rephrasing the tensor operations as BLAS operations.

source

TensorOperations.cuTENSORBackend — Type

cuTENSORBackend()

Backend for tensor operations that is based on the NVIDIA cuTENSOR library.

source

Here, arrays that are strided are typically handled most efficiently by the Strided.jl-based backends. By default, the StridedBLAS backend is used for element types that support BLAS operations, as it seems that the performance gains from using BLAS outweigh the overhead of sometimes having to allocate intermediate permuted arrays.

On the other hand, the BaseCopy and BaseView backends are used for arrays that are not strided. These are designed to be as general as possible, and as a result are not as performant as specific implementations. Nevertheless, they can be useful for debugging purposes or for working with custom tensor types that have limited support for methods outside of Base.

Finally, we also provide a cuTENSORBackend for use with the cuTENSOR.jl library, which is a NVidia GPU-accelerated tensor contraction library. This backend is only available through a package extension for cuTENSOR.

Finally, there is also the following self-explanatory backend:

TensorOperations.NoBackend — Type

NoBackend()

Backend that will be returned if no suitable backend can be found for the given tensors.

source

Custom Backends

Users can also define their own backends, to facilitate experimentation with new implementations. This can be done by defining a new type that is a subtype of AbstractBackend, and dispatching on this type in the implementation of the primitive tensor operations. In particular, the only required implemented methods are tensoradd!, tensortrace!, tensorcontract!.

For example, TensorOperationsTBLIS is a wrapper that provides a backend for tensor contractions using the TBLIS library.

Allocators

Evaluating complex tensor networks is typically done most efficiently by pairwise operations. As a result, this procedure often requires the allocation of many temporary arrays, which can affect performance for certain operations. To mitigate this, TensorOperations exposes an allocator system, which allows users to more finely control the allocation of both output tensors and temporary tensors.

In particular, the allocator system is used in multiple ways: As mentioned before, it can be used to allocate and free the intermediate tensors that are required to evaluate a tensor network in a pairwise fashion. Additionally, it can also be used to allocate and free temporary objects that arise when reshaping and permuting input tensors, for example when making them compatible with BLAS instructions.

Allocator Selection

The allocator system can only be accessed locally, by passing an allocator to the @tensor macro, the ncon function, or the function-based interface.

using TensorOperations
myallocator = TensorOperations.ManualAllocator()

# inserting a backend into the @tensor macro
@tensor allocator = myallocator A[i,j] := B[i,k] * C[k,j]

# inserting an allocator into the ncon function
D = ncon([A, B, C], [[1, 2], [2, 3], [3, 1]]; allocator=myallocator)

# inserting a backend into the function-based interface
tensoradd(A, pA, conjA, B, pB, conjB, α, β, DefaultBackend(), myallocator)

Important to note here is that the backend system is prioritized over the allocator system. In particular, this means that the backend will be selected first, while only then the allocator should be inserted.

Available Allocators

TensorOperations also provides some options for allocators out-of-the box.

TensorOperations.DefaultAllocator — Type

DefaultAllocator()

Default allocator for tensor operations if no explicit allocator is specified. This will just use the standard constructor for the tensor type, and thus probably uses Julia's default memory manager.

source

TensorOperations.ManualAllocator — Type

ManualAllocator()

Allocator that bypasses Julia's memory management for temporary tensors by leveraging Libc.malloc and Libc.free directly. This can be useful for reducing the pressure on the garbage collector. This backend will allocate using DefaultAllocator for output tensors that escape the @tensor block, which will thus still be managed using Julia's GC. The other tensors will be backed by PtrArray instances, from PtrArrays.jl, thus requiring compatibility with that interface.

source

By default, the DefaultAllocator is used, which uses Julia's built-in memory management system. Optionally, it can be useful to use the ManualAllocator, as the manual memory management reduces the pressure on the garbage collector. In particular in multi-threaded applications, this can sometimes lead to a significant performance improvement.

Finally, users can also opt to use the Bumper.jl system, which pre-allocates a slab of memory that can be re-used afterwards. This is available through a package extension for Bumper. Here, the allocator object is just the provided buffers, which are then used to store the intermediate tensors.

using TensorOperations, Bumper
buf = Bumper.default_buffer()
@no_escape buf begin
    @tensor allocator = buf A[i,j] := B[i,k] * C[k,j]
end

For convenience, the construction above is also provided in a specialized macro form which is fully equivalent:

TensorOperations.@butensor — Macro

@butensor tensor_expr

Use Bumper.jl to handle allocation of temporary tensors. This macro will use the default buffer and automatically reset it after the tensor expression has been evaluated. This macro is equivalent to @no_escape @tensor tensor_expr with all temporary allocations handled by Bumper.jl.

Note

This macro requires Bumper.jl to be installed and loaded. This can be achieved by running using Bumper or import Bumper before using the macro.

source

When using the cuTENSORBackend() and no allocator is specified, it will automatically select the allocator CUDAAllocator(), which will create new temporaries as CuArray objects. However, CUDAAllocator has three type parameters which can be used to customize the behavior of the allocator with respect to temporaries, as well as input and output tensors.

TensorOperations.CUDAAllocator — Type

CUDAAllocator{Mout,Min,Mtemp}()

Allocator that uses the CUDA memory manager and will thus allocate CuArray instances. The parameters Min, Mout, Mtemp can be any of the CUDA.jl memory types, i.e. CUDA.DeviceMemory, CUDA.UnifiedMemory or CUDA.HostMemory.

Mout is used to determine how to deal with output tensors; with Mout=CUDA.HostMemory or Mout=CUDA.UnifiedMemory the CUDA runtime will ensure that the data is also available at in the host memory, and can thus be converted back to normal arrays using unsafe_wrap(Array, outputtensor). If Mout=CUDA.DeviceMemory the data will remain on the GPU, untill an explict Array(outputtensor) is called.
Min is used to determine how to deal with input tensors; with Min=CUDA.HostMemory the CUDA runtime will itself take care of transferring the data to the GPU, otherwise it is copied explicitly.
Mtemp is used to allocate space for temporary tensors; it defaults to CUDA.default_memory which is CUDA.DeviceMemory. Only if many or huge temporary tensors are expected could it be useful to choose CUDA.UnifiedMemory.

source

Custom Allocators

Users can also define their own allocators, to facilitate experimentation with new implementations. Here, no restriction is made on the type of the allocator, and any object can be passed as an allocator. The required implemented methods are tensoralloc and tensorfree!.