Estimating Pi using the Monte Carlo Method with Nvidia GPU

Last updated on January 9, 2021

What does it feel to command 1280 parallel cores? Having earlier used GPU for machine learning with Keras, I wanted to have a look at directly programming the GPU with C/C++. This is thus my first trial with GPU programming based on Nvidia CUDA framework. I’m using Nvidia 1060 GPU but I guess this would work on others as well.

Here is my code:

#include<stdio.h>
#include <cuda.h>
#include <curand_kernel.h>
__global__ void mc_pi(unsigned long long *countti,long iter,unsigned long long *palluu){
        int ii;
        curandState_t rng;
        int tid = threadIdx.x + blockIdx.x * blockDim.x;
        //int blockid=blockIdx.x*blockDim.x;
        countti[tid]=0;
        //countti[threadIdx.x]=0;
        curand_init(clock64(), tid, 0, &rng);
        for(ii=0;ii<iter;ii++){
        float x = curand_uniform(&rng); 
        float y = curand_uniform(&rng); 
        if ((x*x+y*y) <=1 )
                {countti[tid] += 1 ;    
                }
        }
        __syncthreads();        
        if (!(tid&1)) {countti[tid]=countti[tid]+countti[tid+1];}
        __syncthreads();
        if (!(tid&3)) {countti[tid]=countti[tid]+countti[tid+2];}
        __syncthreads();
        if (!(tid&7)){countti[tid]=countti[tid]+countti[tid+4];}
        __syncthreads();
        if (!(tid&15)) {countti[tid]=countti[tid]+countti[tid+8];}
        __syncthreads();
        if (!(tid&31)) {countti[tid]=countti[tid]+countti[tid+16];}
        __syncthreads();
        if (!(tid&63)) {countti[tid]=countti[tid]+countti[tid+32];}
        __syncthreads();
        if (!(tid&127)) {countti[tid]=countti[tid]+countti[tid+64];}
        __syncthreads();
        if (threadIdx.x ==0) {countti[tid]=countti[tid]+countti[tid+128];
        palluu[blockIdx.x]=countti[tid];}       
        //palluu[blockIdx.x]=blockIdx.x;
        __syncthreads();
        
}

__global__ void reduce_blocks(unsigned long long *palluu){
int tid = threadIdx.x;// + blockIdx.x * blockDim.x;

int i=0;
int oso= 0;
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if ((oso&1) ==0 && oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+1];}}
__syncthreads();

for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%4 ==0 && oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+2];}}
        __syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%8 ==0 && oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+4];}}
        __syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%16 ==0 && oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+8];}}
        __syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%32 ==0 && oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+16];}}
        __syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%64 ==0 &&oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+32];}}
        __syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%128 ==0 && oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+64];}}
        __syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%256 ==0&&oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+128];}}
        __syncthreads();

for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%512 ==0&&oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+256];}}
 __syncthreads();

for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%1024 ==0&&oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+512];}}
 __syncthreads();


for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%2048 ==0&&oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+1024];}}
 __syncthreads();

for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%4096 ==0&&oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+2048];}}
 __syncthreads();


}



int main() {
        int N=999936;
        int blocks=(N+255)/256;
        int threadsperblock=256;
        int iterations=1000000;
        int  i=0;
        long total=0;
        long long totalos=0;
  unsigned long long *countti;
  unsigned long long *resu;
  unsigned long long *hostresu,hostresu2;
        printf("blocks:%d, threadsperblock:%d\n",blocks,threadsperblock);
  hostresu=(unsigned long long*)malloc(blocks*sizeof(long long));
  cudaMalloc(&countti, N*sizeof(unsigned long long));
  cudaMalloc(&resu,blocks * sizeof(unsigned long long));
   mc_pi<<<blocks,threadsperblock>>>(countti,iterations,resu); 
   //reduce_blocks<<<1,651>>>(resu);
  reduce_blocks<<<1,651>>>(resu);

  cudaMemcpy((long long*)&hostresu2,resu,sizeof(long long),cudaMemcpyDeviceToHost);
  printf("Estimating Pi using the Monte Carlo Method in Nvidia 1060 parallel computation\n: %.15f, based on %llu iterations\n",(double)hostresu2/threadsperblock/blocks/iterations*4,threadsperblock*blocks*iterations);

return 0;
}

I’m using two kernel functions: mc_pi is used for actual pi monte carlo computations as well as for reducing the results on cuda block level. Second kernel reduce_blocks is used for reduction over 3906 blocks. The block size and number of blocks were quickly picked to come close to one million threads in total.

Compilation, when you have Nvidia cuda framework installed:

nvcc -o pi4cu pi4.cu

Performance:

time ./pi4cu 
blocks:3906, threadsperblock:256
Estimating Pi using the Monte Carlo Method in Nvidia 1060 parallel computation
: 3.141592344990079, based on 3503587328 iterations

real    0m27.299s
user    0m18.580s
sys     0m8.704s

So 27 seconds it takes for 3.5 billion iterations maxing GPU to 100%. When compared again my single threaded python version of pi monte carlo function below, the performance improvement is 87x for GPU version. I’m almost sure this is still far from optimal. I read e.g. that GPU isn’t fast in the modulo operations in the reducers, thus in the mc_pi I already changed the modulos to bitwise operators but didn’t yet do it for the block level reducer.

For reducers generally there would be different approaches to choose between. You can read more about those here: https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

import random
import math

sis=0
lukum=40000000

for i in range(lukum):
        a=random.random()
        b=random.random()
        if (a**2+b**2) <= 1:
                sis+=1
print("4*sis/lukum:", 4*sis/lukum)

Accuracy?

My estimate is 3.141592344990079. Let's ask google:

pi-3.141592344990079

What next with cuda – I would like to try some financial monte carlo simulations or perhaps deoloying graph algorithms on GPU.