Backends and Allocators
The TensorOperations
package is designed to provide powerful tools for performing tensor computations efficiently. In advanced use cases, it can be desirable to squeeze the last drops of performance out of the library, by experimenting with either different micro-optimized implementations of the same operation, or by altering the memory management system. Here, we detail how to access these functionalities. Note that all of the backend and allocator types documented below are not exported, so as not to pollute the name space and since they will typically only be manually configured in expert use cases.
Backends
Backend Selection
TensorOperations
supports multiple backends for tensor contractions, allowing users to choose different implementations based on their specific needs. While special care is taken to ensure good defaults, we also provide the flexibility to select a backend manually. This can be achieved in a variety of ways:
Global setting: The default backend can be set globally on a per-type basis, as well as a per-function basis. This is achieved by hooking into the implementation of the default backend selection procedure. In particular, this procedure ends up calling
TensorOperations.select_backend
`, which can be overloaded to return a different backend.Local setting: Alternatively, the backend can be set locally for a specific call to either
@tensor
,ncon
or the function-based interface. Both@tensor
andncon
accept a keyword argumentbackend
, which will locally override the default backend selection mechanism. The result is that the specified backend will be inserted as a final argument to all calls of the primitive tensor operations. This is also how this can be achieved in the function-based interface.
using TensorOperations
mybackend = TensorOperations.StridedNative()
# inserting a backend into the @tensor macro
@tensor backend = mybackend A[i,j] := B[i,k] * C[k,j]
# inserting a backend into the ncon function
D = ncon([A, B, C], [[1, 2], [2, 3], [3, 1]]; backend=mybackend)
# inserting a backend into the function-based interface
tensoradd(A, pA, conjA, B, pB, conjB, α, β, mybackend)
Available Backends
All backends that are accepted in the three primitive tensor operations tensoradd!
, tensortrace!
and tensorcontract!
are subtypes of the abstract type AbstractBackend
.
TensorOperations.AbstractBackend
— Typeabstract type AbstractBackend
Abstract supertype for all backends that can be used for tensor operations. In particular, these control different implementations of executing the basic operations.
TensorOperations.jl provides some options for backends out-of-the box. Firstly, there is the DefaultBackend
, which is selected if no backend is specified:
TensorOperations.DefaultBackend
— TypeDefaultBackend()
Default backend for tensor operations if no explicit backend is specified. This will select an actual implementation backend using the select_backend(tensorfun, tensors...)
mechanism.
The different tensor operations have a general catch-all method in combination with DefaultBackend
, which will then call select_backend
to determine teh actual backend to be used, which can depend on the specific tensor types involved and the operation (tensoradd!
, tensortrace!
and tensorcontract!
) to be performed.
TensorOperations.select_backend
— Functionselect_backend([tensorfun::Function], tensors...) -> AbstractBackend
Select the default backend for the given tensors or tensortypes. If tensorfun
is provided, it is possible to more finely control the backend selection based on the function as well.
Within TensorOperations.jl, the following specific backends are available:
TensorOperations.BaseCopy
— TypeBaseCopy()
Backend for tensor operations that should work for all AbstractArray
types and only uses functions from the Base
module, as well as LinearAlgebra.mul!
.
TensorOperations.BaseView
— TypeBaseView()
Backend for tensor operations that should work for all AbstractArray
types and only uses functions from the Base
module, as well as LinearAlgebra.mul!
, and furthermore tries to avoid any intermediate allocations by using views.
TensorOperations.StridedNative
— TypeStridedNative()
Backend for tensor operations that is based on StridedView
objects with native Julia implementations of tensor operations.
TensorOperations.StridedBLAS
— TypeStridedBLAS()
Backend for tensor operations that is based on using StridedView
objects and rephrasing the tensor operations as BLAS operations.
TensorOperations.cuTENSORBackend
— TypecuTENSORBackend()
Backend for tensor operations that is based on the NVIDIA cuTENSOR library.
Here, arrays that are strided are typically handled most efficiently by the Strided.jl
-based backends. By default, the StridedBLAS
backend is used for element types that support BLAS operations, as it seems that the performance gains from using BLAS outweigh the overhead of sometimes having to allocate intermediate permuted arrays.
On the other hand, the BaseCopy
and BaseView
backends are used for arrays that are not strided. These are designed to be as general as possible, and as a result are not as performant as specific implementations. Nevertheless, they can be useful for debugging purposes or for working with custom tensor types that have limited support for methods outside of Base
.
Finally, we also provide a cuTENSORBackend
for use with the cuTENSOR.jl
library, which is a NVidia GPU-accelerated tensor contraction library. This backend is only available through a package extension for cuTENSOR
.
Finally, there is also the following self-explanatory backend:
TensorOperations.NoBackend
— TypeNoBackend()
Backend that will be returned if no suitable backend can be found for the given tensors.
Custom Backends
Users can also define their own backends, to facilitate experimentation with new implementations. This can be done by defining a new type that is a subtype of AbstractBackend
, and dispatching on this type in the implementation of the primitive tensor operations. In particular, the only required implemented methods are tensoradd!
, tensortrace!
, tensorcontract!
.
For example, TensorOperationsTBLIS
is a wrapper that provides a backend for tensor contractions using the TBLIS library.
Allocators
Evaluating complex tensor networks is typically done most efficiently by pairwise operations. As a result, this procedure often requires the allocation of many temporary arrays, which can affect performance for certain operations. To mitigate this, TensorOperations
exposes an allocator system, which allows users to more finely control the allocation of both output tensors and temporary tensors.
In particular, the allocator system is used in multiple ways: As mentioned before, it can be used to allocate and free the intermediate tensors that are required to evaluate a tensor network in a pairwise fashion. Additionally, it can also be used to allocate and free temporary objects that arise when reshaping and permuting input tensors, for example when making them compatible with BLAS instructions.
Allocator Selection
The allocator system can only be accessed locally, by passing an allocator to the @tensor
macro, the ncon
function, or the function-based interface.
using TensorOperations
myallocator = TensorOperations.ManualAllocator()
# inserting a backend into the @tensor macro
@tensor allocator = myallocator A[i,j] := B[i,k] * C[k,j]
# inserting an allocator into the ncon function
D = ncon([A, B, C], [[1, 2], [2, 3], [3, 1]]; allocator=myallocator)
# inserting a backend into the function-based interface
tensoradd(A, pA, conjA, B, pB, conjB, α, β, DefaultBackend(), myallocator)
Important to note here is that the backend system is prioritized over the allocator system. In particular, this means that the backend will be selected first, while only then the allocator should be inserted.
Available Allocators
TensorOperations
also provides some options for allocators out-of-the box.
TensorOperations.DefaultAllocator
— TypeDefaultAllocator()
Default allocator for tensor operations if no explicit allocator is specified. This will just use the standard constructor for the tensor type, and thus probably uses Julia's default memory manager.
TensorOperations.ManualAllocator
— TypeManualAllocator()
Allocator that bypasses Julia's memory management for temporary tensors by leveraging Libc.malloc
and Libc.free
directly. This can be useful for reducing the pressure on the garbage collector. This backend will allocate using DefaultAllocator
for output tensors that escape the @tensor
block, which will thus still be managed using Julia's GC. The other tensors will be backed by PtrArray
instances, from PtrArrays.jl
, thus requiring compatibility with that interface.
By default, the DefaultAllocator
is used, which uses Julia's built-in memory management system. Optionally, it can be useful to use the ManualAllocator
, as the manual memory management reduces the pressure on the garbage collector. In particular in multi-threaded applications, this can sometimes lead to a significant performance improvement.
Finally, users can also opt to use the Bumper.jl
system, which pre-allocates a slab of memory that can be re-used afterwards. This is available through a package extension for Bumper
. Here, the allocator
object is just the provided buffers, which are then used to store the intermediate tensors.
using TensorOperations, Bumper
buf = Bumper.default_buffer()
@no_escape buf begin
@tensor allocator = buf A[i,j] := B[i,k] * C[k,j]
end
For convenience, the construction above is also provided in a specialized macro form which is fully equivalent:
TensorOperations.@butensor
— Macro@butensor tensor_expr
Use Bumper.jl to handle allocation of temporary tensors. This macro will use the default buffer and automatically reset it after the tensor expression has been evaluated. This macro is equivalent to @no_escape @tensor tensor_expr
with all temporary allocations handled by Bumper.jl.
This macro requires Bumper.jl to be installed and loaded. This can be achieved by running using Bumper
or import Bumper
before using the macro.
When using the cuTENSORBackend()
and no allocator is specified, it will automatically select the allocator CUDAAllocator()
, which will create new temporaries as CuArray
objects. However, CUDAAllocator
has three type parameters which can be used to customize the behavior of the allocator with respect to temporaries, as well as input and output tensors.
TensorOperations.CUDAAllocator
— TypeCUDAAllocator{Mout,Min,Mtemp}()
Allocator that uses the CUDA memory manager and will thus allocate CuArray
instances. The parameters Min
, Mout
, Mtemp
can be any of the CUDA.jl memory types, i.e. CUDA.DeviceMemory
, CUDA.UnifiedMemory
or CUDA.HostMemory
.
Mout
is used to determine how to deal with output tensors; withMout=CUDA.HostMemory
orMout=CUDA.UnifiedMemory
the CUDA runtime will ensure that the data is also available at in the host memory, and can thus be converted back to normal arrays usingunsafe_wrap(Array, outputtensor)
. IfMout=CUDA.DeviceMemory
the data will remain on the GPU, untill an explictArray(outputtensor)
is called.Min
is used to determine how to deal with input tensors; withMin=CUDA.HostMemory
the CUDA runtime will itself take care of transferring the data to the GPU, otherwise it is copied explicitly.Mtemp
is used to allocate space for temporary tensors; it defaults toCUDA.default_memory
which isCUDA.DeviceMemory
. Only if many or huge temporary tensors are expected could it be useful to chooseCUDA.UnifiedMemory
.
Custom Allocators
Users can also define their own allocators, to facilitate experimentation with new implementations. Here, no restriction is made on the type of the allocator, and any object can be passed as an allocator. The required implemented methods are tensoralloc
and tensorfree!
.