Optimization

December 20, 2024

The shader described in the first part computes the blur for each pixel by applying the formula:

I'(x, y) = \sum_{i=-k}^{k} \sum_{j=-k}^{k} I(x+i, y+j) \cdot G(i, j)

where $G$ is the 2D Gaussian kernel, $k$ is the radius, and $I(x, y)$ is the image to process. The definition implies that for each output pixel, $(2k + 1) \cdot (2k + 1) = \mathcal{O}(k^2)$ pixels are read from the input image.

Plotting the execution times as a function of the kernel size, a quadratic trend clearly emerges:

In this article, we will see how to leverage a particular property of the Gaussian kernel to compute $I'$ more efficiently.

#Separable kernels

A kernel $K(x, y)$ is called separable if it can be expressed as the product of two independent functions:

K(x, y) = K_1(x) \cdot K_2(y)

The Gaussian kernel $G(x, y)$ has this property, indeed:

\begin{align*} G(x, y) &= \frac{1}{2\pi \sigma^2} \; \exp \left(-\frac{x^2 + y^2}{2\sigma^2} \right) \\ &= \left[\frac{1}{\sqrt{2\pi \sigma^2}} \; \exp \left(-\frac{x^2}{2\sigma^2} \right)\right] \cdot \left[\frac{1}{\sqrt{2\pi \sigma^2}} \; \exp \left(-\frac{y^2}{2\sigma^2} \right)\right] \\ &= G_1(x) \cdot G_2(y) \end{align*}

The independent factors $G_1(x)$ and $G_1(y)$ are both 1D Gaussian kernels.

#Discrete convolution

We define the discrete convolution of $I(x, y)$ with respect to $K(x, y)$ as the operator:

(I \circ K)(x, y) = \sum_{i=-k}^{k} \sum_{j=-k}^{k} I(x+i, y+j) \cdot K(i, j)

Given a one-dimensional kernel $K(t)$ , we define the partial convolution with respect to the x-axis (or horizontal convolution) of $I$ with respect to $K$ :

(I \circ_x K)(x, y) = \sum_{i=-k}^{k} I(x+i, y) \cdot K(i)

Similarly, we define the partial convolution with respect to the y-axis (or vertical convolution) of $I$ with respect to $K$ :

(I \circ_y K)(x, y) = \sum_{j=-k}^{k} I(x, y+j) \cdot K(j)

#Convolution of separable kernels

Computing $I'$ is equivalent to the discrete convolution of $I$ with the Gaussian kernel $G(x, y)$ :

(I \circ G)(x, y) = \sum_{i=-k}^{k} \sum_{j=-k}^{k} I(x+i, y+j) \cdot G(i, j) = I'(x, y)

Since the Gaussian kernel is separable, we can rewrite:

I'(x, y) = \sum_{i=-k}^{k} \sum_{j=-k}^{k} I(x+i, y+j) \cdot G_1(i) \cdot G_2(j)

The factor $G_1(i)$ is constant with respect to the inner sum, so we can move it outside:

I'(x, y) = \sum_{i=-k}^{k} G_1(i) \cdot \sum_{j=-k}^{k} I(x+i, y+j) \cdot G_2(j)

The inner sum represents the pixel $(x + i, y)$ of the vertical convolution of $I$ with respect to $G_2$ :

(I \circ_y G_2)(x + i, y) = \sum_{j=-k}^{k} I(x+i, y+j) \cdot G_2(j)

Substituting:

I'(x, y) = \sum_{i=-k}^{k} G_1(i) \cdot (I \circ_y G_2)(x + i, y)

The remaining sum is the horizontal convolution of $(I \circ_y G_2)$ with respect to kernel $G_1$ :

[(I \circ_y G_2) \circ_x G_1](x, y) = \sum_{i=-k}^{k} G_1(i) \cdot (I \circ_y G_2)(x + i, y) = I'(x, y)

The full convolution is therefore equivalent to:

Applying a vertical convolution with kernel $G_2$ .
Subsequently, applying a horizontal convolution with kernel $G_1$ (or vice versa, by swapping the order of the sums).

Since each partial convolution has complexity $\mathcal{O}(k)$ , the combined computation has complexity $\mathcal{O}(k)$ per pixel, thus achieving a significant advantage over the direct computation, which has complexity $\mathcal{O}(k^2)$ per pixel.

#Implementation

The optimized version of the Gaussian filter is implemented in the Gauss2dBlurOptimized class, which maintains the same interface as Gauss2dBlur. The main difference lies in the use of two distinct passes to apply the partial convolutions, leveraging an intermediate working texture.

#Working texture

To handle the two partial convolutions, the class uses an intermediate working texture, represented by the private property workingTexture. The texture is dynamically allocated and managed based on the input texture size, as shown in the following code:

export class Gauss2dBlurOptimized {
  /* ...Same as Gauss2dBlur... */
  private workingTexture: GPUTexture | undefined;

  private ensureTextures(
    inputTexture: GPUTexture
  ) {
    if (this.needsRecreation(inputTexture)) {
      this.destroyTextures();

      console.log("Creating textures...");
      this.workingTexture = this.device.createTexture({
        size: {
          width: inputTexture.width,
          height: inputTexture.height
        },
        format: inputTexture.format,
        usage: GPUTextureUsage.RENDER_ATTACHMENT |
               GPUTextureUsage.TEXTURE_BINDING
      });
    }
  }

  private needsRecreation(
    inputTexture: GPUTexture
  ) {
    return !this.workingTexture ||
           this.workingTexture.width < inputTexture.width ||
           this.workingTexture.height < inputTexture.height;
  }

  destroyTextures() {
    console.log("Destroying textures...");
    this.workingTexture?.destroy();
  }

  destroy() {
    this.destroyTextures();
    this.uniformBuffer?.destroy();
  }
}

The ensureTextures() method checks whether the existing working texture is compatible with the input texture. If not, it deallocates the previous one and creates a new one. The input texture format cannot change without recreating the shader, so it is not considered in needsRecreation().

The working texture is destroyed with the destroy() method along with the other allocated resources, or via destroyTextures() which can be used to free memory without invalidating the class.

#Blur operation

Unlike the basic version, which directly computes the 2D convolution in a single pass, this implementation separates the two convolutions along the axes, using the working texture for the intermediate pass. The following code illustrates the process:

export class Gauss2dBlurOptimized {
  static async create(
    device: GPUDevice,
    inputFormat: GPUTextureFormat,
    timer?: GPUTimer
  ) {
    /* ...Same as Gauss2dBlur... */

    // Create uniform buffer for kernel size and blur direction
    const uniformBuffer = device.createBuffer({
      size: 8,
      usage: GPUBufferUsage.UNIFORM |
             GPUBufferUsage.COPY_DST
    });

    /* ...Same as Gauss2dBlur... */
  }

  async blur(
    inputTexture: GPUTexture,
    kernelRadius: number,
    outputTexture?: GPUTexture
  ): Promise<GPUTexture> {
    /* ...Same as Gauss2dBlur... */

    // Ensure that working texture is usable
    this.ensureTextures(inputTexture);

    // Perform two 1D pass blur
    this.gaussBlur1dPass(
      inputTexture,
      this.workingTexture!,
      kernelRadius,
      0
    );
    this.gaussBlur1dPass(
      this.workingTexture!,
      outputTexture,
      kernelRadius,
      1
    );

    return outputTexture;
  }

  private gaussBlur1dPass(
    inputTex: GPUTexture,
    outTex: GPUTexture,
    kernelRadius: number,
    blurDirection: 0 | 1
  ) {
    const commandEncoder = this.device.createCommandEncoder();

    // Update the uniform buffer
    this.device.queue.writeBuffer(
      this.uniformBuffer!,
      0,
      new Int32Array([
        kernelRadius,
        blurDirection
      ])
    );

    /* ...Same as Gauss2dBlur... */
  }
}

In the first pass, the horizontal convolution is computed (with blurDirection = 0), and the result is saved in the working texture. In the second pass, the vertical convolution is performed (with blurDirection = 1), producing the final image in the output texture. Additionally, the uniform buffer, besides the kernel radius, now also includes the blur direction, which can be horizontal or vertical.

#Shader

In the fragment shader, depending on the convolution direction, either the blurW() or blurH() function is called to compute the partial convolution. These functions generate the Gaussian blur kernel and apply the convolution in the appropriate direction:

function gauss2dBlurShader(format: GPUTextureFormat): string {
  const formatInfo = TEXTURE_FORMAT_INFO[format]!;

  // language=WGSL
  return `
    ```wgsl
    /*... no changes ...*/

    struct Params {
      kernelRadius: i32,

      // 0 = W, 1 = H
      direction: i32
    }

    /*... no changes ...*/

    fn blurW(coords: vec2i, stdDev: f32) -> ${formatInfo.texelType} {
      let norm = 1.0 / sqrt(2.0 * PI * stdDev * stdDev);

      // Gaussian blur kernel generation
      var blur = ${castToFloat(formatInfo.texelType)}(0);

      // Since we are discretizing the Gaussian kernel
      // the sum of the samples won't add up perfectly to 1
      var weightSum = 0.0f;

      for (var i = -params.kernelRadius; i <= params.kernelRadius; i++) {
        let weight = exp(-(f32(i * i) / (2.0 * stdDev * stdDev)));
        let I = textureLoad(inputTexture, coords + vec2i(i, 0), 0).${channelMask(formatInfo.channelsCount)};
        let gij = norm * weight;

        blur += ${castToFloat(formatInfo.texelType)}(I) * gij;
        weightSum += gij;
      }

      // Normalize the result by dividing by the sum of the weights
      blur /= weightSum;
      ${formatInfo.channelsCount === 4 ? "blur.a = 1.0f;" : "" }

      return ${formatInfo.texelType}(blur);
    }

    fn blurH(coords: vec2i, stdDev: f32) -> ${formatInfo.texelType} {
      /*...similar to blurW ...*/
    }

    @fragment
    fn fs(@builtin(position) coord: vec4f) -> @location(0) ${formatInfo.texelType} {
      // 99% of Gaussian values fall within 3 * stdDev
      // P(mu - 3s <= X <= mu + 3s) = 0.9973
      let stdDev = f32(params.kernelRadius) / 3.0;

      // Gaussian blur kernel generation
      let pixelCoords = vec2i(coord.xy - 0.5);
      var blur: ${formatInfo.texelType};
      if (params.direction == 0) {
        blur = blurW(pixelCoords, stdDev);
      } else {
        blur = blurH(pixelCoords, stdDev);
      }

      return blur;
    }
  ```
  `;
}

#Benchmarking

WebGPU offers a feature called timestamp queries to measure execution times on the GPU with precision. To use them, it is necessary to enable the timestamp-query feature when requesting the device:

const device = await adapter.requestDevice({
  requiredFeatures: [
    'timestamp-query'
  ]
});

#Query allocation

To allocate a set of timestamp queries, the createQuerySet() method is used:

const querySet = device.createQuerySet({
  type: 'timestamp',
  count: 2
});

To store the query results, a dedicated buffer must be allocated:

const resolveBuffer = device.createBuffer({
  size: querySet.count * 8,
  usage: GPUBufferUsage.QUERY_RESOLVE |
         GPUBufferUsage.COPY_SRC
});

The buffer must support QUERY_RESOLVE usage, and each timestamp occupies 8 bytes (equivalent to a BigInt in JavaScript). To read the timestamps on the CPU, it will be necessary to transfer resolveBuffer into a mappable buffer:

const stagingBuffer = device.createBuffer({
  size: resolveBuffer.size,
  usage: GPUBufferUsage.COPY_DST |
         GPUBufferUsage.MAP_READ
});

#Query configuration

Each render pass can record up to two timestamps (start and end), by configuring the timestampWrites parameter at creation time:

const encoder = device.createCommandEncoder();
const renderPass = encoder.beginRenderPass({
  timestampWrites: {
    querySet,
    beginningOfPassWriteIndex: 0,
    endOfPassWriteIndex: 1
  }
});

In this example, the start timestamp is saved in querySet[0] while the end timestamp in querySet[1].

#Saving timestamps

After completing the render pass, but before calling finish() on the command encoder, it is necessary to use the resolveQuerySet() method to write the results to the destination buffer:

// ...draw stuff...
renderPass.end();
encoder.resolveQuerySet(
  // queries
  querySet,

  // first query
  0,

  // query count
  querySet.count,

  // destination
  resolveBuffer,

  // destination offset
  0
);

#Reading results

To access the data on the CPU, the resolveBuffer must be copied to the staging buffer:

encoder.copyBufferToBuffer(
  resolveBuffer,

  // source offset
  0,

  stagingBuffer,

  // destination offset
  0,

  stagingBuffer.size
);

device.queue.submit([
  commandEncoder.finish()
]);

At this point, stagingBuffer is ready to be mapped on the CPU:

try {
  await stagingBuffer.mapAsync(GPUMapMode.READ);
  const times = new BigInt64Array(
    this.stagingBuffer.getMappedRange()
  );
  const elapsedMs = Number(times[1] - times[0]) * 1e-6;
} finally {
  stagingBuffer.unmap();
}

Timestamps are expressed in nanoseconds and queries can be reused, but it is essential to release the resources when they are no longer needed:

resolveBuffer.destroy();
stagingBuffer.destroy();
querySet.destroy();

#The GPUTimer utility

The GPUTimer utility class simplifies the use of timestamp queries, and can be used as follows:

Create a timer object from the device:

const timer = new GPUTimer(device);

Add the timer to the render pass:

const commandEncoder = device.createCommandEncoder();
const renderPass = commandEncoder.beginRenderPass({
  ...passDescriptor,
  // Adds timestamp measurements
  ...timer.renderPass
});

// ...draw stuff...

Resolve the query set before submit:

// ...draw stuff...
renderPass.end();

timer.resolveQuerySet(commandEncoder);
device.queue.submit([commandEncoder.finish()]);

Read the duration after submit:

// Later, read the timing:
const duration = await timer.read();
console.log(`Operation took ${timer.fmtTime(duration)}`);

Release the timer when no longer needed:

timer.destroy();

#Final results

With the new implementation, the Gaussian filter is applied in two separate passes, thus optimizing the process:

As can be observed, the new version offers a clear advantage in terms of performance, with a linear trend of the execution time with respect to the kernel size.

#Conclusions and next steps

Gaussian convolution, while being computationally intensive in its direct form, can be optimized through kernel separability, reducing the computational complexity from $\mathcal{O}(k^2)$ to $\mathcal{O}(k)$ .

The effectiveness of the optimization was confirmed through precise measurements of execution times, using the timestamp queries provided by WebGPU.

In the third and final part, we will certify correctness and robustness of the implementation through the creation of a dedicated unit test suite.

Stay tuned! 🚀

Previous Basic implementation

Next Testing and benchmarking