Matrix multiplication

4/2/2023

Let's see what each thread within each block is doing. For the sake of this example, let's say the threads is organized into a 2x2 block, and there are 4 blocks in a grid.

Since C consists of 16 elements, where each element is computed through a dot product of a row of A and a column of B, then let's launch 16 threads, where each thread calculates 1 output element. Given a 4x4 input matrix A and a 4x4 input matrix B, I want to calculate a 4x4 output matrix C. Grid: group of blocks - all threads in a grid has access to global memory and constant memory.Block: group of threads - all threads in a block has access to a shared memory called shared memory.Thread: single unit of execution - each thread has its own memory called registers.Below is the organization of threads in CUDA terms. When you have a lot of workers (threads) to manage, you might want to organize them in a way. To finish off this analogy, each one of your friends is a worker, or an unit of execution, a thread. This means in 1 hour, your assignment would be finished. But what if you have 3 other friends with the same assignment? Then you tell your friends to each solve 1 problem and then you all will share the solutions.because sharing is caring. You can spend 4 hours and do all 4 problems by yourself. Imagine you have an assignment with 4 math problems to solve, each problem taking 1 hour. The idea is to get more work done in less time. The main idea of using GPUs for computation is simple. Ever heard of Tensors? Yeah.everything is matrix multiplication I swear. Everything as in rendering graphics and machine learning. The efficiency of calculating matrix multiplication is the backbone of everything.

I strongly believe that writing the code (launching the kernel, index calculations.) will come easily if you understand and see what you are trying to code. Keep in mind that this post is not meant to teach you CUDA coding, but rather it is meant to help viewers gain some visual intuition on what each thread is doing in a basic tiled matrix multiplication algorithm. We will then examine the CUDA kernel code that do exactly what we see in the visualization, which shows what each thread within a block is doing to compute the output. Tiling can be seen as a way to boost execution efficiency of the kernel. We will especially look at a method called "tiling," which is used to reduce global memory accesses by taking advantage of the shared memory on the GPU. This is an algorithm performed on GPUs due to the parallel nature of matrix multiplication. Let's talk about tiled matrix multiplication today.

0 Comments

Author

Archives

Categories

Matrix multiplication

Leave a Reply.