summaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
authorSai Praveen Bangaru <31557731+saipraveenb25@users.noreply.github.com>2023-09-21 01:15:29 -0400
committerGitHub <noreply@github.com>2023-09-20 22:15:29 -0700
commitc04f5b4970875d60cdf6ac9ea48dda7add4383f3 (patch)
tree7cb68cd9a9638d4e8313fddf193818e5f428b00f /docs
parent29c318bfe5c66350a67467e3b6ef08120f00fb7e (diff)
Update user-guide with new slangpy features (#3222)
Diffstat (limited to 'docs')
-rw-r--r--docs/user-guide/a1-02-slangpy.md482
1 files changed, 353 insertions, 129 deletions
diff --git a/docs/user-guide/a1-02-slangpy.md b/docs/user-guide/a1-02-slangpy.md
index ea78a52db..883ac249c 100644
--- a/docs/user-guide/a1-02-slangpy.md
+++ b/docs/user-guide/a1-02-slangpy.md
@@ -14,191 +14,190 @@ In addition, using a per-thread programming model also results in more optimized
In this tutorial, we will use a simple example to walk through the steps to use Slang in your PyTorch project.
-### Writing a simple kernel function as a Slang module
+### Installation
+`slangpy` is available via PyPI, so you can install it simply through
+```sh
+pip install slangpy
+```
-Assume we want to write a kernel function that computes `x*x` for each element in the input tensor in Slang. To do so,
-we start by creating a `square.slang` file:
+Note that `slangpy` requires `torch` with CUDA support. See the [pytorch](https://pytorch.org/) installation page to find the right version for your platform.
-```csharp
-// square.slang
-float square(float x)
-{
- return x * x;
-}
+
+You can check that you have the right installation by running:
+```sh
+python -c "import torch; print(f'cuda: {torch.cuda.is_available()}')"
```
-This function is self-explanatory. To use it in PyTorch, we need to write a GPU kernel function (that maps to a
-`__global__` CUDA function) that defines how to compute each element of the input tensor. So we continue to write
-the following Slang function:
+### Writing Slang kernels for `slangpy` >= **v1.1.5**
-```csharp
-[CudaKernel]
-void square_fwd_kernel(TensorView<float> input, TensorView<float> output)
+From **v2023.4.0**, Slang supports auto-binding features that make it easier than ever to invoke Slang kernels from python, and interoperate seamlessly with `pytorch` tensors.
+
+Here's a barebones example of a simple squaring kernel written in Slang (`square.slang`):
+
+``` csharp
+[AutoPyBindCUDA]
+[CUDAKernel]
+void square(TensorView<float> input, TensorView<float> output)
{
- uint3 globalIdx = cudaBlockIdx() * cudaBlockDim() + cudaThreadIdx();
+ // Get the 'global' index of this thread.
+ uint3 launchIdx = cudaThreadIdx() + cudaBlockIdx() * cudaBlockDim();
- if (globalIdx.x > input.size(0) || globalIdx.y > input.size(1))
+ // If the thread index is beyond the input size, exit early.
+ if (launchIdx.x < input.size(0))
return;
- float result = square(input[globalIdx.xy]);
- output[globalIdx.xy] = result;
-}
-```
-
-This code follows the standard pattern of a typical CUDA kernel function. It takes as input
-two tensors, `input` and `output`.
-It first obtains the global dispatch index of the current thread and performs range check to make sure we don't read or write out
-of the bounds of input and output tensors, and then calls `square()` to compute the per-element result, and
-store it at the corresponding location in `output` tensor.
-With a kernel function defined, we then need to expose a CPU(host) function that defines how this kernel is dispatched:
-```csharp
-[TorchEntryPoint]
-TorchTensor<float> square_fwd(TorchTensor<float> input)
-{
- var result = TorchTensor<float>.zerosLike(input);
- let blockCount = uint3(1);
- let groupSize = uint3(result.size(0), result.size(1), 1);
- __dispatch_kernel(square_fwd_kernel, blockCount, groupSize)(input, result);
- return result;
+ output[launchIdx.x] = input[launchIdx.x] * input[launchIdx.x];
}
+
```
-Here, we mark the function with the `[TorchEntryPoint]` attribute, so it will be exported to Python. In the function body, we call `TorchTensor<float>.zerosLike` to allocate a 2D-tensor that has the same size as the input.
-`zerosLike` returns a `TorchTensor<float>` object that represents a CPU handle of a PyTorch tensor.
-Then we launch `square_fwd_kernel` with the `__dispatch_kernel` syntax. Note that we can directly pass
-`TorchTensor<float>` arguments to a `TensorView<float>` parameter and the compiler will automatically convert
-the type and obtain a view into the tensor that can be accessed by the GPU kernel function.
-### Calling Slang module from Python
+`square` performs **element-wise** squaring on `input` and writes them to `output`
-Next, let's see how we can call the `square_fwd` function we defined in the Slang module.
-To do so, we use a python package called `slangpy`. You can obtain it with
-```bash
-pip install slangpy
-```
+`slangpy` works by compiling kernels to CUDA and it identifies the functions to compile by checking for the `[CUDAKernel]` attribute.
+The second attribute `[AutoPyBindCUDA]` allows us to call `multiply` directly from python without having to write any host code. If you would like to write the host code yourself for finer control, see the other version of this example [here](#manually-binding-kernels).
-With that, you can use the following code to call `square_fwd` from Python:
+You can now simply invoke this kernel from python:
-```python
+``` Python
import torch
import slangpy
-m = slangpy.loadModule("square.slang")
+m = slangpy.loadModule('multiply.slang')
-x = torch.randn(2,2)
-print(f"X = {x}")
-y = m.square_fwd(x)
-print(f"Y = {y.cpu()}")
-```
+A = torch.randn((1024,), dtype=torch.float).cuda()
+B = torch.randn((1024,), dtype=torch.float).cuda()
-Result output:
-```
-X = tensor([[ 0.1407, 0.6594],
- [-0.8978, -1.7230]])
-Y = tensor([[0.0198, 0.4349],
- [0.8060, 2.9688]])
+output = torch.zeros_like(A).cuda()
+
+# Number of threads launched = blockSize * gridSize
+m.multiply(inputA=A, inputB=B, output=output).launchRaw(blockSize=(32, 1, 1), gridSize=(64, 1, 1))
+
+print(output)
```
-And that's it! `slangpy.loadModule` uses JIT compilation to compile your Slang source into CUDA binary.
-It may take a little longer the first time you execute the script, but the compiled binaries will be cached and as
-long as the kernel code is not changed, future runs will not rebuild the CUDA kernel.
+The python script `slangpy.loadModule("square.slang")` returns a scope that contains a handle to the `square` kernel.
-Because the PyTorch JIT system requires `ninja`, you need to make sure `ninja` is installed on your system
-and is discoverable from the current environment, you also need to have a C++ compiler available on the system.
-On Windows, this means that Visual Studio need to be installed.
+The kernel can be invoked by
+1. calling `square` and binding `torch` tensors as arguments for the kernel, and then
+2. launching it using `launchRaw()` by specifying CUDA launch arguments to `blockSize` & `gridSize`. (Refer to the [CUDA documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications) for restrictions around `blockSize`)
-### Exposing an automatically differentiated kernel to PyTorch
+Note that for semantic clarity reasons, calling a kernel requires the use of keyword arguments with names that are lifted from the `.slang` implementation.
-The above example demonstrates how to write a simple kernel function in Slang and call it from Python.
-Another major benefit of using Slang is that the Slang compiler support generating backward derivative
-propagation functions automatically.
+### Invoking derivatives of kernels using slangpy
-In the following section, we walk through how to use Slang to generate a backward propagation function
-for `square`, and expose it to PyTorch as an autograd function.
+The `[AutoPyBindCUDA]` attribute can also be used on differentiable functions defined in Slang, and will automatically bind the derivatives. To do this, simply add the `[Differentiable]` attribute.
-First we need to tell Slang compiler that we need the `square` function to be considered a differentiable function, so Slang compiler can generate a backward derivative propagation function for it:
-```csharp
-[Differentiable]
-float square(float x)
-{
- return x * x;
-}
-```
-This is done by simply adding a `[Differentiable]` attribute to our `square`function.
+One key point is that the basic `TensorView<T>` objects are not differentiable. They can be used as buffers for data that does not require derivatives, or even as buffers for the manual accumulation of derivatives.
-With that, we can now define `square_bwd_kernel` that performs backward propagation as:
+Instead, use the `DiffTensorView` type for when you need differentiable tensors. Currently, `DiffTensorView` only supports the `float` dtype variety, and requires the use of `.load(offset)` and `.store(offset, val)` instead of `[]`, although
+`offset` can be a scalar `uint` or vector `uint2`, `uint3`, etc.. for multi-dimensional indexing.
-```csharp
-[CudaKernel]
-void square_bwd_kernel(TensorView<float> input, TensorView<float> grad_out, TensorView<float> grad_propagated)
+Here's a barebones example of a differentiable `sqr` that computes the `sin(x)`
+
+``` C
+[AutoPyBindCUDA]
+[CUDAKernel]
+[Differentiable]
+void square(DiffTensorView input, DiffTensorView output)
{
- uint3 globalIdx = cudaBlockIdx() * cudaBlockDim() + cudaThreadIdx();
+ uint3 launchIdx = cudaThreadIdx() + cudaBlockIdx() * cudaBlockDim();
- if (globalIdx.x > input.size(0) || globalIdx.y > input.size(1))
+ if (launchIdx.x < inputA.size(0))
return;
+
+ float val = input.load(launchIdx.x);
- DifferentialPair<float> dpInput = diffPair(input[globalIdx.xy]);
- var gradInElem = grad_out[globalIdx.xy];
- bwd_diff(square)(dpInput, gradInElem);
- grad_propagated[globalIdx.xy] = dpInput.d;
+ float result = x * x;
+
+ output.store(launchIdx.x, result);
}
```
-Note that the function follows the same structure of `square_fwd_kernel`, with the only difference being that
-instead of calling into `square` to compute the forward value for each tensor element, we are calling `bwd_diff(square)`
-that represents the automatically generated backward propagation function of `square`.
-`bwd_diff(square)` will have the following signature:
-```csharp
-void bwd_diff_square(inout DifferentialPair<float> dpInput, float dOut);
-```
+Now, `slangpy.loadModule("square.slang")` returns a scope with three callable handles `square`, `square.fwd` for the forward-mode derivative & `square.bwd` for the reverse-mode derivative.
-Where the first parameter, `dpInput` represents a pair of original and derivative value for `input`, and the second parameter,
-`dOut`, represents the initial derivative with regard to some latent variable that we wish to back-prop through. The resulting
-derivative will be stored in `dpInput.d`. For example:
+You can invoke `square()` normally to get the same effect as the previous example, or invoke `square.fwd()` / `square.bwd()` by binding pairs of tensors to compute the derivatives.
-```csharp
-// construct a pair where the primal value is 3, and derivative value is 0.
-var dp = diffPair(3.0);
-bwd_diff(square)(dp, 1.0);
-// dp.d is now 6.0
-```
+``` Python
+import torch
+import slangpy
-Similar to `square_fwd`, we can define the host side function `square_bwd` as:
+m = slangpy.loadModule('square.slang')
-```csharp
-[TorchEntryPoint]
-TorchTensor<float> square_bwd(TorchTensor<float> input, TorchTensor<float> grad_out)
-{
- var grad_propagated = TorchTensor<float>.zerosLike(input);
- let blockCount = uint3(1);
- let groupSize = uint3(input.size(0), input.size(1), 1);
- __dispatch_kernel(square_bwd_kernel, blockCount, groupSize)(input, grad_out, grad_propagated);
- return grad_propagated;
-}
+input = torch.tensor((0, 1, 3, 4, 5), dtype=torch.float).cuda()
+output = torch.zeros_like(input).cuda()
+
+# Invoke normally
+m.square(input=input, output=output).launchRaw(blockSize=(6, 1, 1), gridSize=(1, 1, 1))
+
+print(output)
+
+# Invoke reverse-mode autodiff by first allocating tensors to hold the gradients
+input = torch.tensor((0, 1, 3, 4, 5), dtype=torch.float).cuda()
+input_grad = torch.zeros_like(input).cuda()
+
+output = torch.zeros_like(input)
+# Pass in all 1s as the output derivative for our example
+output_grad = torch.ones_like(output)
+
+m.sqr.bwd(
+ input=(input, input_grad), output=(output, output_grad)
+).launchRaw(
+ blockSize=(6, 1, 1), gridSize=(1, 1, 1))
+
+# Derivatives get propagated to input_grad
+print(input_grad)
+
+# Note that the derivatives in output_grad are 'consumed'.
+# i.e. all zeros after the call.
+print(output_grad)
```
+`slangpy` also binds the forward-mode version of your kernel (propagate derivatives of inputs to the output) which can be invoked the same way using `module.sqr.fwd()`
+
You can refer to [this documentation](07-autodiff.md) for a detailed reference of Slang's automatic differentiation feature.
-With this, the python script `slangpy.loadModule("square.slang")` will now return
-a scope that defines two functions, `square_fwd` and `square_bwd`. We can then use these
-two functions to define a PyTorch autograd function class:
+### Wrapping your kernels as pytorch functions
+
+`pytorch` offers an easy way to define a custom operation using `torch.autograd.Function`, and defining the `.forward()` and `.backward()` members.
+
+This can be very helpful to wrap your Slang kernels. Here's an example of the `square` kernel as a differentiable pytorch function.
```python
m = slangpy.loadModule("square.slang")
-class MySquareFuncInSlang(torch.autograd.Function):
+class MySquareFunc(torch.autograd.Function):
@staticmethod
def forward(ctx, input):
- ctx.save_for_backward(input)
- return m.square_fwd(input)
+ output = torch.zeros_like(input)
+
+ kernel_with_args = m.square(input=input, output=output)
+ kernel_with_args.launchRaw(
+ blockSize=(32, 32, 1),
+ gridSize=((input.shape[0] + 31) / 32, (input.shape[1] + 31) / 32, 1))
+
+ ctx.save_for_backward(input, output)
+
+ return output
@staticmethod
def backward(ctx, grad_output):
- [input] = ctx.saved_tensors
- return m.square_bwd(input, grad_output)
+ (input, output) = ctx.saved_tensors
+
+ input_grad = torch.zeros_like(input)
+
+ # Note: When using DiffTensorView, grad_output gets 'consumed' during the reverse-mode.
+ # If grad_output may be reused, consider calling grad_output = grad_output.clone()
+ #
+ kernel_with_args = m.square.bwd(input=(input, input_grad), output=(output, grad_output))
+ kernel_with_args.launchRaw(
+ blockSize=(32, 32, 1),
+ gridSize=((input.shape[0] + 31) / 32, (input.shape[1] + 31) / 32, 1))
+
+ return input_grad
```
-Now we can use the autograd function `MySquareFuncInSlang` in our python script:
+Now we can use the autograd function `MySquareFunc` in our python script:
```python
x = torch.tensor([[3.0, 4.0],[0.0, 1.0]], requires_grad=True, device='cuda')
@@ -446,6 +445,169 @@ back propagated derivative values.
Again, to understand all the details of the automatic differentiation system, please refer to the
[Automatic Differentiation](07-autodiff.md) chapter for a detailed explanation.
+## Manually binding kernels
+`[AutoPyBindCUDA]` works for most use cases, but in certain situations, it may be necessary to write the *host* function by hand. The host function can also be written in Slang, and `slangpy` handles its compilation to C++.
+
+Here's the same `square` example from before, but with a hand-written host function:
+
+```csharp
+// square.slang
+float square(float x)
+{
+ return x * x;
+}
+```
+
+This function is self-explanatory. To use it in PyTorch, we need to write a GPU kernel function (that maps to a
+`__global__` CUDA function) that defines how to compute each element of the input tensor. So we continue to write
+the following Slang function:
+
+```csharp
+[CudaKernel]
+void square_fwd_kernel(TensorView<float> input, TensorView<float> output)
+{
+ uint3 globalIdx = cudaBlockIdx() * cudaBlockDim() + cudaThreadIdx();
+
+ if (globalIdx.x > input.size(0) || globalIdx.y > input.size(1))
+ return;
+ float result = square(input[globalIdx.xy]);
+ output[globalIdx.xy] = result;
+}
+```
+
+This code follows the standard pattern of a typical CUDA kernel function. It takes as input
+two tensors, `input` and `output`.
+It first obtains the global dispatch index of the current thread and performs range check to make sure we don't read or write out
+of the bounds of input and output tensors, and then calls `square()` to compute the per-element result, and
+store it at the corresponding location in `output` tensor.
+
+With a kernel function defined, we then need to expose a CPU(host) function that defines how this kernel is dispatched:
+```csharp
+[TorchEntryPoint]
+TorchTensor<float> square_fwd(TorchTensor<float> input)
+{
+ var result = TorchTensor<float>.zerosLike(input);
+ let blockCount = uint3(1);
+ let groupSize = uint3(result.size(0), result.size(1), 1);
+ __dispatch_kernel(square_fwd_kernel, blockCount, groupSize)(input, result);
+ return result;
+}
+```
+Here, we mark the function with the `[TorchEntryPoint]` attribute, so it will be exported to Python. In the function body, we call `TorchTensor<float>.zerosLike` to allocate a 2D-tensor that has the same size as the input.
+`zerosLike` returns a `TorchTensor<float>` object that represents a CPU handle of a PyTorch tensor.
+Then we launch `square_fwd_kernel` with the `__dispatch_kernel` syntax. Note that we can directly pass
+`TorchTensor<float>` arguments to a `TensorView<float>` parameter and the compiler will automatically convert
+the type and obtain a view into the tensor that can be accessed by the GPU kernel function.
+
+### Calling Slang module from Python
+
+Next, let's see how we can call the `square_fwd` function we defined in the Slang module.
+To do so, we use a python package called `slangpy`. You can obtain it with
+
+```bash
+pip install slangpy
+```
+
+With that, you can use the following code to call `square_fwd` from Python:
+
+```python
+import torch
+import slangpy
+
+m = slangpy.loadModule("square.slang")
+
+x = torch.randn(2,2)
+print(f"X = {x}")
+y = m.square_fwd(x)
+print(f"Y = {y.cpu()}")
+```
+
+Result output:
+```
+X = tensor([[ 0.1407, 0.6594],
+ [-0.8978, -1.7230]])
+Y = tensor([[0.0198, 0.4349],
+ [0.8060, 2.9688]])
+```
+
+And that's it! `slangpy.loadModule` uses JIT compilation to compile your Slang source into CUDA binary.
+It may take a little longer the first time you execute the script, but the compiled binaries will be cached and as
+long as the kernel code is not changed, future runs will not rebuild the CUDA kernel.
+
+Because the PyTorch JIT system requires `ninja`, you need to make sure `ninja` is installed on your system
+and is discoverable from the current environment, you also need to have a C++ compiler available on the system.
+On Windows, this means that Visual Studio need to be installed.
+
+### Manual binding for kernel derivatives
+
+The above example demonstrates how to write a simple kernel function in Slang and call it from Python.
+Another major benefit of using Slang is that the Slang compiler support generating backward derivative
+propagation functions automatically.
+
+In the following section, we walk through how to use Slang to generate a backward propagation function
+for `square`, and expose it to PyTorch as an autograd function.
+
+First we need to tell Slang compiler that we need the `square` function to be considered a differentiable function, so Slang compiler can generate a backward derivative propagation function for it:
+```csharp
+[Differentiable]
+float square(float x)
+{
+ return x * x;
+}
+```
+This is done by simply adding a `[Differentiable]` attribute to our `square`function.
+
+With that, we can now define `square_bwd_kernel` that performs backward propagation as:
+
+```csharp
+[CudaKernel]
+void square_bwd_kernel(TensorView<float> input, TensorView<float> grad_out, TensorView<float> grad_propagated)
+{
+ uint3 globalIdx = cudaBlockIdx() * cudaBlockDim() + cudaThreadIdx();
+
+ if (globalIdx.x > input.size(0) || globalIdx.y > input.size(1))
+ return;
+
+ DifferentialPair<float> dpInput = diffPair(input[globalIdx.xy]);
+ var gradInElem = grad_out[globalIdx.xy];
+ bwd_diff(square)(dpInput, gradInElem);
+ grad_propagated[globalIdx.xy] = dpInput.d;
+}
+```
+
+Note that the function follows the same structure of `square_fwd_kernel`, with the only difference being that
+instead of calling into `square` to compute the forward value for each tensor element, we are calling `bwd_diff(square)`
+that represents the automatically generated backward propagation function of `square`.
+`bwd_diff(square)` will have the following signature:
+```csharp
+void bwd_diff_square(inout DifferentialPair<float> dpInput, float dOut);
+```
+
+Where the first parameter, `dpInput` represents a pair of original and derivative value for `input`, and the second parameter,
+`dOut`, represents the initial derivative with regard to some latent variable that we wish to back-prop through. The resulting
+derivative will be stored in `dpInput.d`. For example:
+
+```csharp
+// construct a pair where the primal value is 3, and derivative value is 0.
+var dp = diffPair(3.0);
+bwd_diff(square)(dp, 1.0);
+// dp.d is now 6.0
+```
+
+Similar to `square_fwd`, we can define the host side function `square_bwd` as:
+
+```csharp
+[TorchEntryPoint]
+TorchTensor<float> square_bwd(TorchTensor<float> input, TorchTensor<float> grad_out)
+{
+ var grad_propagated = TorchTensor<float>.zerosLike(input);
+ let blockCount = uint3(1);
+ let groupSize = uint3(input.size(0), input.size(1), 1);
+ __dispatch_kernel(square_bwd_kernel, blockCount, groupSize)(input, grad_out, grad_propagated);
+ return grad_propagated;
+}
+```
+
## Builtin Library Support for PyTorch Interop
As shown in previous tutorial, Slang has defined the `TorchTensor<T>` and `TensorView<T>` type for interop with PyTorch
@@ -526,6 +688,17 @@ Atomically swaps `val` into the element at `index`. Available for `float` and 32
#### `void TensorView<T>.InterlockedCompareExchange(vector<uint, N> index, T compare, T val)`
Atomically swaps `val` into the element at `index` if the element equals to `compare`. Available for `float` and 32/64 bit integer types only.
+### `DiffTensorView` methods
+
+#### `float DiffTensorView.load(vector<uint, N> index)`
+Loads the 32-bit floating point data at the specified multi-dimensional `index`. This method is **differentiable**, and in reverse-mode will perform an atomic-add.
+
+#### `void DiffTensorView.store(vector<uint, N> index, float val)`
+Stores the 32-bit floating point value `val` at the specified multi-dimensional `index`. This method is **differentiable**, and in reverse-mode will perform an *atomic exchange* to retrieve the derivative and replace with 0.
+
+#### `uint DiffTensorView.size(int dim)`
+Returns the tensor's size (in number of elements) at `dim`.
+
### CUDA Support Functions
#### `cudaThreadIdx()`
@@ -551,8 +724,59 @@ Marks a function for export to Python. Functions marked with `[TorchEntryPoint]`
#### `[CudaDeviceExport]` attribute
Marks a function as a CUDA device function, and ensures the compiler to include it in the generated CUDA source.
+#### `[AutoPyBindCUDA]` attribute
+Markes a cuda kernel for automatic binding generation so that it may be invoked from python without having to hand-code the torch entry point. The marked function **must** also be marked with `[CudaKernel]`. If the marked function is also marked with `[Differentiable]`, this will also generate bindings for the derivative methods.
+
+Restriction: methods marked with `[AutoPyBindCUDA]` will not operate
+
## Type Marshalling Between Slang and Python
+
+### Python-CUDA type marshalling for functions using `[AutoPyBindCUDA]`
+
+When using auto-binding, aggregate types like structs are converted to Python `namedtuples` and are made available when using `slangpy.loadModule`.
+
+```csharp
+// mesh.slang
+struct Mesh
+{
+ TensorView<float> vertices;
+ TensorView<int> indices;
+};
+
+[AutoPyBindCUDA]
+[CUDAKernel]
+void processMesh(Mesh mesh)
+{
+ /* ... */
+}
+```
+
+Here, since `Mesh` is being used by `renderMesh`, the loaded module will provide `Mesh` as a python `namedtuple` with named fields.
+While using the `namedtuple` is the best way to use structured arguments, they can also be passed as a python `dict` or `tuple`
+
+```python
+m = slangpy.loadModule('mesh.slang')
+
+vertices = torch.tensor()
+indices = torch.tensor()
+
+# use namedtuple to provide structured input.
+mesh = m.Mesh(vertices=vertices, indices=indices)
+m.processMesh(mesh=mesh).launchRaw(blockSize=(32, 32, 1), gridSize=(1, 1, 1))
+
+# use dict to provide input.
+mesh = {'vertices': vertices, 'indices':indices}
+m.processMesh(mesh=mesh).launchRaw(blockSize=(32, 32, 1), gridSize=(1, 1, 1))
+
+# use tuple to provide input (warning: user responsible for right order)
+mesh = (vertices, indices)
+m.processMesh(mesh=mesh).launchRaw(blockSize=(32, 32, 1), gridSize=(1, 1, 1))
+```
+
+
+### Python-CUDA type marshalling for functions using `[TorchEntryPoint]`
+
The return types and parameters types of an exported `[TorchEntryPoint]` function can be a basic type (e.g. `float`, `int` etc.), a vector type (e.g. `float3`), a `TorchTensor<T>` type, an array type, or a struct type.
When you use struct or array types in the function signature, it will be exposed as a Python tuple.