GPU Programs and Memory

While GFunction provides a simple interface for running computations on GPU, GProgram and GBufferRegion offer more control over GPU memory management and program execution. These are the building blocks for more complex GPU workloads.

GProgram

A GProgram represents a single GPU compute shader. It defines:

Layout - What buffers and uniforms the program needs
Dispatch - How many work groups to launch
Body - The actual computation performed by each GPU thread

import io.computenode.cyfra.core.GProgram
import io.computenode.cyfra.core.layout.Layout
import io.computenode.cyfra.dsl.{*, given}

case class DoubleLayout(
  input: GBuffer[Float32], 
  output: GBuffer[Float32]
) derives Layout

val doubleProgram: GProgram[Int, DoubleLayout] = GProgram.static[Int, DoubleLayout](
  layout = size => DoubleLayout(
    input = GBuffer[Float32](size), 
    output = GBuffer[Float32](size)
  ),
  dispatchSize = size => size,
): layout =>
  val idx = GIO.invocationId
  GIO.when(idx < 256):
    val value = layout.input.read(idx)
    layout.output.write(idx, value * 2.0f)

There are many elements in that snippet that may seem unfamiliar at the moment, but they will be addressed in this article:

What are the layout and dispatch?
What is GIO and its members like invocationId, when, write, read?
How to execute a GProgram, and how does it execute?

Layout

The layout is a case class that contains all the buffers (GBuffer) and uniforms (GUniform) your program needs. It must derive Layout for Cyfra to understand its structure.

GBuffer[T] - A GPU buffer that stores an array of values of type T
GUniform[T] - A uniform value (constant for all invocations) of type T (must extend GStruct)

Layout is a product of GBuffers and GUniforms and it does not enforce any structure. Any of them can be an input, output, or some intermediate step in computation.

Dispatch and Execution Model

GPU programs execute in parallel across many threads. The execution is organized in a hierarchy:

Invocations (threads) - Individual parallel executions of your program body
Workgroups - Groups of invocations that can share memory and synchronize
Dispatch - The total number of workgroups to launch

When you dispatch a program, the GPU runs:

Total invocations = workgroups.x × workgroups.y × workgroups.z × workgroupSize.x × workgroupSize.y × workgroupSize.z

GProgram.static

The simplest way to create a program. You specify how many elements to process, and Cyfra calculates the workgroup count automatically:

val program = GProgram.static[Int, MyLayout](
  layout = size => MyLayout(...),
  dispatchSize = size => size
): layout =>
  // body

For example, with dispatchSize = 256 and default workgroupSize = (128, 1, 1):

Cyfra computes workgroups = ((256 + 127) / 128, 1, 1) = (2, 1, 1)
Total invocations = 2 × 128 = 256 threads running in parallel

Each thread gets a unique GIO.invocationId from 0 to 255.

GProgram.apply

For full control over dispatch, use the explicit API:

import io.computenode.cyfra.core.GProgram.{StaticDispatch, DynamicDispatch}

val program = GProgram[Int, MyLayout](
  layout = size => MyLayout(...),
  dispatch = (layout, size) => StaticDispatch(((size + 255) / 256, 1, 1)),
  workgroupSize = (256, 1, 1),
): layout =>
  // body

Dispatch options:

StaticDispatch(x, y, z) - Fixed number of workgroups, known at compile time
DynamicDispatch(buffer, offset) - Workgroup count read from a GPU buffer at runtime (for indirect dispatch)

GIO - Mutable operations

GIO represents GPU operations. Because it's a monad, you must compose operations using for/yield or flatMap/map.

layout =>
  val idx = GIO.invocationId
  val in = layout.input.read(idx)
  for
    _ <- layout.outputA.write(idx, value) // Buffer writes return GIOs
    _ <- layout.outputB.write(idx, value * 2.0f)
  yield ()

GBufferRegion

Now that we understand GProgram, we need a way to execute it. GBufferRegion manages GPU memory allocation and provides a structured way to:

Allocate GPU buffers
Run programs that use those buffers
Read results back to CPU

DoubleProgram and DoubleLayout are the custom definitions that were shown in the previous section. Those define your custom program.

import io.computenode.cyfra.core.GBufferRegion
import io.computenode.cyfra.runtime.VkCyfraRuntime

@main
def run = VkCyfraRuntime.using:
  val size = 256
  val inputData = (0 until size).map(_.toFloat).toArray
  val results = Array.ofDim[Float](size)

  val region = GBufferRegion
    .allocate[DoubleLayout]
    .map: layout =>
      doubleProgram.execute(size, layout)

  region.runUnsafe(
    // Init values on start of the pipeline
    init = DoubleLayout(
      input = GBuffer(inputData),
      output = GBuffer[Float32](size),
    ),
    // Read results when done
    onDone = layout => layout.output.readArray(results),
  )

  println(s"Results: ${results.take(5).mkString(", ")}...")
  // Results: 0.0, 2.0, 4.0, 6.0, 8.0...

How GBufferRegion Works

allocate[Layout] - Declares what buffers need to be allocated
map - Chains operations (like executing programs) on the allocated buffers
runUnsafe - Actually allocates GPU memory, runs the computation, and cleans up

The init parameter provides initial data for buffers:

GBuffer(array) - Initialize a buffer with data from a Scala array
GBuffer[T](size) - Allocate an empty buffer of the given size
GUniform(value) - Initialize a uniform with a struct value

The onDone callback lets you read results:

buffer.readArray(targetArray) - Copy GPU buffer contents to a Scala array

Complete Example: Parameterized Program

Here's a complete example with a uniform parameter:

import io.computenode.cyfra.core.{GBufferRegion, GProgram}
import io.computenode.cyfra.core.layout.Layout
import io.computenode.cyfra.dsl.{*, given}
import io.computenode.cyfra.runtime.VkCyfraRuntime

// Define a struct for uniform parameters
case class MulParams(factor: Float32) extends GStruct[MulParams]

// Define the program layout
case class MulLayout(
  input: GBuffer[Float32], 
  output: GBuffer[Float32], 
  params: GUniform[MulParams]
) derives Layout

// Create the program
val mulProgram: GProgram[Int, MulLayout] = GProgram.static[Int, MulLayout](
  layout = size => MulLayout(
    input = GBuffer[Float32](size), 
    output = GBuffer[Float32](size), 
    params = GUniform[MulParams]()
  ),
  dispatchSize = size => size,
): layout =>
  val idx = GIO.invocationId
  GIO.when(idx < 256):
    val value = layout.input.read(idx)
    val factor = layout.params.read.factor
    layout.output.write(idx, value * factor)

@main
def runMultiply(): Unit = VkCyfraRuntime.using:
  val size = 256
  val inputData = (0 until size).map(_.toFloat).toArray
  val results = Array.ofDim[Float](size)

  val region = GBufferRegion
    .allocate[MulLayout]
    .map: layout =>
      mulProgram.execute(size, layout)

  region.runUnsafe(
    init = MulLayout(
      input = GBuffer(inputData),
      output = GBuffer[Float32](size),
      params = GUniform(MulParams(3.0f)), // Multiply by 3
    ),
    onDone = layout => layout.output.readArray(results),
  )

In this example you can also see how GUniform can be used to pass a single value to the program (contrary to GBuffer that represents an array of values).

Running Multiple Programs

You can run multiple programs within a single GBufferRegion.map block. The key insight is that .execute returns the Layout, so you can use the output buffers from one program as input to the next.

import io.computenode.cyfra.core.{GBufferRegion, GProgram}
import io.computenode.cyfra.core.layout.Layout
import io.computenode.cyfra.dsl.{*, given}
import io.computenode.cyfra.runtime.VkCyfraRuntime

// Program 1: Double the input
case class DoubleLayout(input: GBuffer[Float32], output: GBuffer[Float32]) derives Layout

val doubleProgram = GProgram.static[Int, DoubleLayout](
  layout = size => DoubleLayout(GBuffer[Float32](size), GBuffer[Float32](size)),
  dispatchSize = size => size
): layout =>
  val idx = GIO.invocationId
  GIO.when(idx < 256):
    val value = layout.input.read(idx)
    layout.output.write(idx, value * 2.0f)
  yield ()

// Program 2: Add a constant
case class AddParams(addend: Float32) extends GStruct[AddParams]
case class AddLayout(input: GBuffer[Float32], output: GBuffer[Float32], params: GUniform[AddParams]) derives Layout

val addProgram = GProgram.static[Int, AddLayout](
  layout = size => AddLayout(GBuffer[Float32](size), GBuffer[Float32](size), GUniform[AddParams]()),
  dispatchSize = size => size
): layout =>
  val idx = GIO.invocationId
  GIO.when(idx < 256):
    val value = layout.input.read(idx)
    val addend = layout.params.read.addend
    layout.output.write(idx, value + addend)
  yield ()

// Combined layout with intermediate buffer
case class PipelineLayout(
  input: GBuffer[Float32],
  intermediate: GBuffer[Float32],  // Output of program 1, input of program 2
  output: GBuffer[Float32],
  addParams: GUniform[AddParams]
) derives Layout

@main
def runPipeline(): Unit = VkCyfraRuntime.using:
  val size = 256
  val inputData = (0 until size).map(_.toFloat).toArray
  val results = Array.ofDim[Float](size)

  val region = GBufferRegion
    .allocate[PipelineLayout]
    .map: layout =>
      // Program 1: input -> intermediate (doubles values)
      val afterDouble = doubleProgram.execute(size, DoubleLayout(layout.input, layout.intermediate))
      
      // Program 2: use intermediate from afterDouble as input, write to output
      addProgram.execute(size, AddLayout(afterDouble.output, layout.output, layout.addParams))
      
      // Return the original layout
      layout

  region.runUnsafe(
    init = PipelineLayout(
      input = GBuffer(inputData),
      intermediate = GBuffer[Float32](size),
      output = GBuffer[Float32](size),
      addParams = GUniform(AddParams(10.0f)),
    ),
    onDone = layout => layout.output.readArray(results),
  )

In this example, you can see that there is an intermediate buffer introduced, that is neither an input or an output of our GPU pipeline. It just is used to transfer data between programs.

Using java.nio's ByteBuffers

While Scala arrays are convenient for small datasets, ByteBuffer provides a more efficient alternative for large data or when integrating with native libraries.

Initializing GBuffer from ByteBuffer

Use GBuffer[T](byteBuffer) to create a buffer from existing data:

import java.nio.{ByteBuffer, ByteOrder}

val data = (0 until 1024).toArray
val byteBuffer = ByteBuffer.allocateDirect(data.length * 4).order(ByteOrder.nativeOrder())
byteBuffer.asIntBuffer().put(data)
byteBuffer.flip()

region.runUnsafe(
  init = MyLayout(
    input = GBuffer[Int32](byteBuffer),  // Initialize from ByteBuffer
    output = GBuffer[Int32](1024),
  ),
  onDone = layout => ...
)

Reading Results to ByteBuffer

Use buffer.read(byteBuffer) to copy GPU data directly into a ByteBuffer:

import java.nio.{ByteBuffer, ByteOrder}

val resultBuffer = ByteBuffer.allocateDirect(1024 * 4).order(ByteOrder.nativeOrder())

region.runUnsafe(
  init = MyLayout(...),
  onDone = layout => layout.output.read(resultBuffer),  // Read into ByteBuffer
)

// Access results
val intView = resultBuffer.asIntBuffer()
val firstValue = intView.get(0)

Writing to GPU from ByteBuffer

For dynamic updates, use buffer.write(byteBuffer):

val updateBuffer = ByteBuffer.allocateDirect(1024 * 4).order(ByteOrder.nativeOrder())
updateBuffer.asFloatBuffer().put(newData)
updateBuffer.flip()

// Inside onDone or map block:
layout.someBuffer.write(updateBuffer)

When to Use ByteBuffers

Large datasets - Avoids array copy overhead
Native interop - Works with LWJGL, JNI, or memory-mapped files
Streaming data - Reuse buffers across multiple runs
Custom layouts - Direct control over memory layout and byte ordering

For more complex compositions with better type safety and reusability, see GPU Pipelines.

GProgram​

Layout​

Dispatch and Execution Model​

GProgram.static​

GProgram.apply​

GIO - Mutable operations​

GBufferRegion​

How GBufferRegion Works​

Complete Example: Parameterized Program​

Running Multiple Programs​

Using java.nio's ByteBuffers​

Initializing GBuffer from ByteBuffer​

Reading Results to ByteBuffer​

Writing to GPU from ByteBuffer​

When to Use ByteBuffers​