Timestamp queries
The code described in this section can be run from the
timestamp_queries.rs example with
cargo run --features derive --example timestamp_queries
. Complete source code is also provided at the
end of this page.
Timestamp queries are a bit impractical to use at the moment. Their API is still being worked on.
Timestamp queries aim to answer the question: "How long did this compute operation run on the GPU?" It is essentially a basic, occasionally unreliable, benchmarking tools that measure shader runtimes. This is achieved by:
- Instructing the GPU to insert timestamps at specific locations of the execution pipeline.
- Instructing the GPU to resolve the queries.
- Reading back the timestamp values from GPU memory to CPU land.
- Converting the timestamps to milliseconds and subtracting them to obtain the runtimes.
Platform support for timestamp queries can be very limited so it is advised to not abort the execution of your application if they are not supported.
Be sure to enable the wgpu::Features::TIMESTAMP_QUERY
feature when initializing your gpu device
(in DeviceDescriptor::required_features
).
The wgcore
crate exposes the GpuTimestamps
structure that helps with all the four steps mentioned above:
// Initialize a query set that can contain up to 2 timestamps
// (here we need only 2 timestamps since we are measuring a single compute pass).
let mut timestamps = GpuTimestamps::new(gpu.device(), 2);
// Initialize the invocation queue as usual.
let mut queue = KernelInvocationQueue::new(gpu.device());
// ## Step 1 ##
// Start a compute pass with a custom name. The `true` indicates that timestamps
// will have to be pushed for this pass.
queue.compute_pass("timestamp_queries_test", true);
// TODO: queue one or multiple kernels here.
let mut encoder = gpu.device().create_command_encoder(&Default::default());
// ## STEP 1 (bis) ##
// When encoding the queue, pass the `GpuTimestamps` to actually create our timestamps.
// Note that if `None` is given instead, timestamps will be ignored.
queue.encode(&mut encoder, Some(&mut timestamps));
// ## STEP 2 ##
// Resolve the timestamp queries.
timestamps.resolve(&mut encoder);
// Submit the calculations.
gpu.queue().submit(Some(encoder.finish()));
// ## STEP 3 & 4 ##
// Read back the timestamp values, in milliseconds.
// The compute pass’s runtime is given by `timestamps_read[1] - timestamps_read[2]`.
let timestamps_read = timestamps.wait_for_results_ms(gpu.device(), gpu.queue());
Each compute pass requires 2 timestamps for measuring their runtime: one for when the pass starts, and one for when its ends. The actual runtime of the compute pass is the the difference between the two.
Note that there is currently no way to know which timestamp is related to which compute pass unless you know exactly in
which order all the calls to queue.compute_pass
happened: their timestamps will be in the same order.
The examples in this page only work on a native program. On WASM, reading the timestamps from the gpu can be a bit more
convoluted due the async nature of web platforms. See how it is being done in the wgsparkl
crate as an example.
Complete example
- main.rs
- kernel.wgsl
#[cfg(not(feature = "derive"))]
std::compile_error!(
r#"
###############################################################
## The `derive` feature must be enabled to run this example. ##
###############################################################
"#
);
use nalgebra::{DVector, Vector4};
use std::ops::Div;
use wgcore::composer::ComposerExt;
use wgcore::gpu::GpuInstance;
use wgcore::hot_reloading::HotReloadState;
use wgcore::kernel::{KernelInvocationBuilder, KernelInvocationQueue};
use wgcore::tensor::{GpuScalar, GpuVector};
use wgcore::timestamps::GpuTimestamps;
use wgcore::Shader;
use wgpu::{BufferUsages, ComputePipeline};
#[derive(Shader)]
#[shader(src = "timestamp_queries.wgsl", composable = false)]
struct ShaderTimestampQueries {
main: ComputePipeline,
}
#[async_std::main]
async fn main() -> anyhow::Result<()> {
// Initialize the gpu device and its queue.
//
// Note that `GpuInstance` is just a simple helper struct for initializing the gpu resources.
// You are free to initialize them independently if more control is needed, or reuse the ones
// that were already created/owned by e.g., a game engine.
let gpu = GpuInstance::new().await?;
// Load and compile our kernel. The `from_device` function was generated by the `Shader` derive.
// Note that its dependency to `Composable` is automatically resolved by the `Shader` derive
// too.
let mut kernel = ShaderTimestampQueries::from_device(gpu.device())?;
// Create the buffers.
const LEN: u32 = 2_000_000;
let buffer = GpuVector::init(
gpu.device(),
vec![0u32; LEN as usize],
BufferUsages::STORAGE | BufferUsages::COPY_SRC,
);
// Init hot-reloading.
// We are setting up hot-reloading so that we can change somme elements in the shader
// (like the iteration count) and see how that affects performances live.
let mut hot_reload = HotReloadState::new()?;
ShaderTimestampQueries::watch_sources(&mut hot_reload)?;
// Init timestamp queries.
// To measure the time of one kernel, we need two timestamps (one for when it starts and one for
// when it stopped).
let mut timestamps = GpuTimestamps::new(gpu.device(), 2);
// Queue the operation.
println!("#############################");
println!("Edit the file `timestamp_queries.wgsl` (for example by multiplying or dividing NUM_ITERS by 10).\nThe updated runtime will be printed below whenever a change is detected.");
println!("#############################");
for loop_id in 0.. {
// Detect & apply changes.
hot_reload.update_changes();
match kernel.reload_if_changed(gpu.device(), &hot_reload) {
Ok(changed) => {
if changed {
// Clear the timestamps to reuse in the next loop.
timestamps.clear();
// We detected a change (or this is the first loop).
// Read the result.
let mut queue = KernelInvocationQueue::new(gpu.device());
// Declare a compute pass with timestamps enabled.
queue.compute_pass("timestamp_queries_test", true);
KernelInvocationBuilder::new(&mut queue, &kernel.main)
.bind0([buffer.buffer()])
.queue(LEN.div_ceil(64));
// Encode & submit the operation to the gpu.
let mut encoder = gpu.device().create_command_encoder(&Default::default());
// Run our kernel.
queue.encode(&mut encoder, Some(&mut timestamps));
// Resolve the timestamp queries.
timestamps.resolve(&mut encoder);
gpu.queue().submit(Some(encoder.finish()));
// Read and print the kernel’s runtime.
let timestamps_read = timestamps.wait_for_results_ms(gpu.device(), gpu.queue());
println!(
"Current run time: {}ms",
timestamps_read[1] - timestamps_read[0]
);
}
}
Err(e) => {
// Hot-reloading failed, likely due to a syntax error in the shader.
println!("Hot reloading error: {:?}", e);
}
}
}
Ok(())
}
@group(0) @binding(0)
var<storage, read_write> a: array<u32>;
@compute @workgroup_size(64, 1, 1)
fn main(@builtin(global_invocation_id) invocation_id: vec3<u32>) {
let i = invocation_id.x;
if i < arrayLength(&a) {
const NUM_ITERS: u32 = 10000u;
for (var k = 0u; k < NUM_ITERS; k++) {
a[i] = collatz_iterations(a[i] * 7919);
}
}
}
// This is taken from the wgpu "hello_compute" example:
// https://github.com/gfx-rs/wgpu/blob/6f5014f0a3441bcbc3eb4223aee454b95904b087/examples/src/hello_compute/shader.wgsl
// (Apache 2 / MIT license)
//
// The Collatz Conjecture states that for any integer n:
// If n is even, n = n/2
// If n is odd, n = 3n+1
// And repeat this process for each new n, you will always eventually reach 1.
// Though the conjecture has not been proven, no counterexample has ever been found.
// This function returns how many times this recurrence needs to be applied to reach 1.
fn collatz_iterations(n_base: u32) -> u32{
var n: u32 = n_base;
var i: u32 = 0u;
loop {
if (n <= 1u) {
break;
}
if (n % 2u == 0u) {
n = n / 2u;
}
else {
// Overflow? (i.e. 3*n + 1 > 0xffffffffu?)
if (n >= 1431655765u) { // 0x55555555u
return 4294967295u; // 0xffffffffu
}
n = 3u * n + 1u;
}
i = i + 1u;
}
return i;
}