First snow covering the magnolia in our garden on January 6th 2021.

This is one of my very first shots taken with the 4×5 large format camera – a 1941 Graflex Speed Graphic and 1943 Kodak Aero-Ektar 178mm F/2.5 aerial lens. Both the lens and camera are American and from the era of second world war.

I shot the magnolia tree on our backyard both wide open and slightly stopped down (above) to F/5.6 and due to the shot with larger DOF i chose the above one.

While not having a proper film scanner for large format, this was “scanned” with Nikon Z6 and 55mm micro-nikkor lens using Philips bright light for illumination.

The shot is taken on Kodak TMAX-100 and home developed with Adox Rodinal + Adofix, using SP-445 developing tank.

This was not supposed to be the “final” scan from the negative – please forgive the white blobs from the film holder (aka my fingers) on the top left corner.

This is a small post about another new (for me) lens in the camerabag – the East-German exakta mount (exa) Carl Zeiss Jena Biometar 80mm F/2.8. I regard Biometar as sibling of the legendary Biotar 75mm F/1.5.

My copy of the lens is the latest (third) version of the biometar, with six aperture blades and marked as Bm 80mm 1:2.8 “aus Jena” on the front ring.

The lens feels quite light, weighting only 295 g (for example compared to Zeiss Planar 85mm F/1.4 ZF.2, which is 570 g).

Two quick snaps of the local model, lens wide open at F/2.8.

The cloud free mornings i Denmark in late November can be scarce, but every now and then it happens.

This time the morning of November 23rd, 2020, was almost free of clouds. I packed the Nikon Z6 and the vintage Canon FD 500mm F/4.5L, my old and heavy Manfrotto tripod and last but not least the Canon FD 2x-A teleconverter.

Here is one of the many shots taken under challenging contrast conditions and some semi-disturbing clouds. The shot is taken over Øresund from Taarbæk, some 31 km from Turning Torso, the tallest building in the Nordics, located on the Swedish side of the sea.

Canon FD 500mm F/4.5L with 2x-a teleconverter, wide open, ISO400, exposure time of 1/1250 seconds.

These are taken on June 22.7.2020 right after midnight, when it was dark enough to see the comet even with bare eye and binoculars. I used Nikon Z6.

First a wide field photo taken with Nikkor 50mm F/1.8G, ISO 2000, F/2.2, 2 seconds shutter time.

Second shot is with a vintage 1980s Canon FD 500mm F/4.5L manual focus lens. For this shot I used ISO 8000, 3 sec shutter time and wide open aperture (4.5). Two tripods were used: one to support the lens and second to support the camera.

The next night I got a new change even that it seemed to me that the comet was’t that bright any more. Here is the picture with the same 500mm tele lens. This is taken on 23.7. at 00:37.

What does it feel to command 1280 parallel cores? Having earlier used GPU for machine learning with Keras, I wanted to have a look at directly programming the GPU with C/C++. This is thus my first trial with GPU programming based on Nvidia CUDA framework. I’m using Nvidia 1060 GPU but I guess this would work on others as well.

Here is my code:

#include<stdio.h>
#include <cuda.h>
#include <curand_kernel.h>
__global__ void mc_pi(unsigned long long *countti,long iter,unsigned long long *palluu){
int ii;
curandState_t rng;
int tid = threadIdx.x + blockIdx.x * blockDim.x;
//int blockid=blockIdx.x*blockDim.x;
countti[tid]=0;
//countti[threadIdx.x]=0;
curand_init(clock64(), tid, 0, &rng);
for(ii=0;ii<iter;ii++){
float x = curand_uniform(&rng);
float y = curand_uniform(&rng);
if ((x*x+y*y) <=1 )
{countti[tid] += 1 ;
}
}
__syncthreads();
if (!(tid&1)) {countti[tid]=countti[tid]+countti[tid+1];}
__syncthreads();
if (!(tid&3)) {countti[tid]=countti[tid]+countti[tid+2];}
__syncthreads();
if (!(tid&7)){countti[tid]=countti[tid]+countti[tid+4];}
__syncthreads();
if (!(tid&15)) {countti[tid]=countti[tid]+countti[tid+8];}
__syncthreads();
if (!(tid&31)) {countti[tid]=countti[tid]+countti[tid+16];}
__syncthreads();
if (!(tid&63)) {countti[tid]=countti[tid]+countti[tid+32];}
__syncthreads();
if (!(tid&127)) {countti[tid]=countti[tid]+countti[tid+64];}
__syncthreads();
if (threadIdx.x ==0) {countti[tid]=countti[tid]+countti[tid+128];
palluu[blockIdx.x]=countti[tid];}
//palluu[blockIdx.x]=blockIdx.x;
__syncthreads();
}
__global__ void reduce_blocks(unsigned long long *palluu){
int tid = threadIdx.x;// + blockIdx.x * blockDim.x;
int i=0;
int oso= 0;
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if ((oso&1) ==0 && oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+1];}}
__syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%4 ==0 && oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+2];}}
__syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%8 ==0 && oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+4];}}
__syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%16 ==0 && oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+8];}}
__syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%32 ==0 && oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+16];}}
__syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%64 ==0 &&oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+32];}}
__syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%128 ==0 && oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+64];}}
__syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%256 ==0&&oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+128];}}
__syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%512 ==0&&oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+256];}}
__syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%1024 ==0&&oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+512];}}
__syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%2048 ==0&&oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+1024];}}
__syncthreads();
for(i=0;i<6;i++) {
oso=tid+(i*blockDim.x);
if (oso%4096 ==0&&oso<3906) {palluu[oso]=palluu[oso]+palluu[oso+2048];}}
__syncthreads();
}
int main() {
int N=999936;
int blocks=(N+255)/256;
int threadsperblock=256;
int iterations=1000000;
int i=0;
long total=0;
long long totalos=0;
unsigned long long *countti;
unsigned long long *resu;
unsigned long long *hostresu,hostresu2;
printf("blocks:%d, threadsperblock:%d\n",blocks,threadsperblock);
hostresu=(unsigned long long*)malloc(blocks*sizeof(long long));
cudaMalloc(&countti, N*sizeof(unsigned long long));
cudaMalloc(&resu,blocks * sizeof(unsigned long long));
mc_pi<<<blocks,threadsperblock>>>(countti,iterations,resu);
//reduce_blocks<<<1,651>>>(resu);
reduce_blocks<<<1,651>>>(resu);
cudaMemcpy((long long*)&hostresu2,resu,sizeof(long long),cudaMemcpyDeviceToHost);
printf("Estimating Pi using the Monte Carlo Method in Nvidia 1060 parallel computation\n: %.15f, based on %llu iterations\n",(double)hostresu2/threadsperblock/blocks/iterations*4,threadsperblock*blocks*iterations);
return 0;
}

I’m using two kernel functions: mc_pi is used for actual pi monte carlo computations as well as for reducing the results on cuda block level. Second kernel reduce_blocks is used for reduction over 3906 blocks. The block size and number of blocks were quickly picked to come close to one million threads in total.

Compilation, when you have Nvidia cuda framework installed:

nvcc -o pi4cu pi4.cu

Performance:

time ./pi4cu
blocks:3906, threadsperblock:256
Estimating Pi using the Monte Carlo Method in Nvidia 1060 parallel computation
: 3.141592344990079, based on 3503587328 iterations
real 0m27.299s
user 0m18.580s
sys 0m8.704s

So 27 seconds it takes for 3.5 billion iterations maxing GPU to 100%. When compared again my single threaded python version of pi monte carlo function below, the performance improvement is 87x for GPU version. I’m almost sure this is still far from optimal. I read e.g. that GPU isn’t fast in the modulo operations in the reducers, thus in the mc_pi I already changed the modulos to bitwise operators but didn’t yet do it for the block level reducer.

import random
import math
sis=0
lukum=40000000
for i in range(lukum):
a=random.random()
b=random.random()
if (a**2+b**2) <= 1:
sis+=1
print("4*sis/lukum:", 4*sis/lukum)

Accuracy?

My estimate is 3.141592344990079. Let's ask google:
pi-3.141592344990079

What next with cuda – I would like to try some financial monte carlo simulations or perhaps deoloying graph algorithms on GPU.