Home Forex OpenCL :: Exploring the first dimension (half 3 : right benchmark) – Analytics & Forecasts – 3 Could 2023

OpenCL :: Exploring the first dimension (half 3 : right benchmark) – Analytics & Forecasts – 3 Could 2023

0
OpenCL :: Exploring the first dimension (half 3 : right benchmark) – Analytics & Forecasts – 3 Could 2023

[ad_1]

Learn Half 1 

Learn Half 2 

Run the check couple of instances , we get the identical execution time throughout many iterations which means one thing is fallacious or we’re getting cached responses.

And that’s the timer perform , let’s examine if it really works , it does :

2023.05.03 02:45:59.711 blog_benchmark_2 (USDJPY,H1)    Work Gadgets (1) Iterations(1000)
2023.05.03 02:45:59.711 blog_benchmark_2 (USDJPY,H1)    Work Teams (1)
2023.05.03 02:45:59.711 blog_benchmark_2 (USDJPY,H1)    Milliseconds (15)

Okay now , what are we in search of ? 

No matter is completed in parallel might be shut in time proper ? so we’re in search of that additional delay that’s not tiny in comparison with the general execution 

So let’s begin 2048 objects and we’ll enhance iterations in order that we will discover instances let’s strive 1 million once more . 

Truly lets ship 1 merchandise and 1million iterations .

2023.05.03 02:49:57.474 blog_benchmark_2 (USDJPY,H1)    Work Gadgets (1) Iterations(1000000)
2023.05.03 02:49:57.474 blog_benchmark_2 (USDJPY,H1)    Work Teams (1)
2023.05.03 02:49:57.474 blog_benchmark_2 (USDJPY,H1)    Milliseconds (16)

Superior ,1 objects 100million iterations

2023.05.03 02:51:17.223 blog_,benchmark_2 (USDJPY,H1)    Work Gadgets (1) Iterations(100000000)
2023.05.03 02:51:17.223 blog_benchmark_2 (USDJPY,H1)    Work Teams (1)
2023.05.03 02:51:17.223 blog_benchmark_2 (USDJPY,H1)    Milliseconds (16)

Okay modified kernel once more to this : 

    string kernel="__kernel void memtests(__global int* group_id,"
                                         "int iterations){"
                                         "double sum=(double)0.0;"
                                         "double inc=(double)-2.6;"
                                         "double step=(double)0.01;"
                                         "double radi=(double)0.0;"
                                         "double rstep=(double)0.00001;"
                                         "for(int i=0;i<iterations;i++){"
                                         "sum=((double)tanh(inc))*((double)sin(radi));"
                                         "inc+=step;"
                                         "radi+=rstep;"
                                         "if(inc>2.6&&step>0.0){step=(double)-0.01;}"
                                         "if(inc<-2.6&&step<0.0){step=(double)0.01;}"
                                         "if(radi>6.28&&rstep>0.0){step=(double)-0.00001;}"
                                         "if(radi<0.0&&rstep<0.0){step=(double)0.00001;}"                                         
                                         "}"
                                         "group_id[get_global_id(0)]=get_group_id(0);}";

Dropped to 10000 iterations , usually do not overdo it as a result of you might be getting cache responses and the second you modify the calculation as an example you will hear your gpu complain.

So 10000 iterations :

2023.05.03 03:07:20.865 blog_benchmark_2 (USDJPY,H1)    Work Gadgets (1) Iterations(10000)
2023.05.03 03:07:20.865 blog_benchmark_2 (USDJPY,H1)    Work Teams (1)
2023.05.03 03:07:20.865 blog_benchmark_2 (USDJPY,H1)    Milliseconds (31)

lets enhance to 100K 

i am going to get 31 ms once more i am certain however this time i am going to check it after a restart too .

So 1 merchandise 100K iterations no restart of the computer.

2023.05.03 03:11:24.106 blog_benchmark_2 (USDJPY,H1)    Work Gadgets (1) Iterations(100000)
2023.05.03 03:11:24.106 blog_benchmark_2 (USDJPY,H1)    Work Teams (1)
2023.05.03 03:11:24.106 blog_benchmark_2 (USDJPY,H1)    Milliseconds (47)

Okay and 1 merchandise 100K iterations restart of the computer.

2023.05.03 03:13:26.448 blog_benchmark_2 (USDJPY,H1)    Work Gadgets (1) Iterations(100000)
2023.05.03 03:13:26.448 blog_benchmark_2 (USDJPY,H1)    Work Teams (1)
2023.05.03 03:13:26.448 blog_benchmark_2 (USDJPY,H1)    Milliseconds (47)

identical time yay okay , we’re good with this kernel FILALLY aand look who determined to point out up :

2023.05.03 03:13:26.448 blog_benchmark_2 (USDJPY,H1)    Kernel non-public mem (40)

the non-public reminiscence measurement stopped being zero for the primary time . Okay so thats 40 what ? 40 bytes per merchandise ?

Lets measure the kernel , we depend non globals and non locals and non arguments so :

yeah 5 doubles 5 by 8 its 40 bytes . Okay in order that works too . Superior.

Non-public reminiscence is tremendous quick you do not fear for it .How we measure the restrict that is a later query.

Alright so , can it take 1million iterations now ?

2023.05.03 03:18:41.921 blog_benchmark_2 (USDJPY,H1)    Work Gadgets (1) Iterations(1000000)
2023.05.03 03:18:41.921 blog_benchmark_2 (USDJPY,H1)    Work Teams (1)
2023.05.03 03:18:41.921 blog_benchmark_2 (USDJPY,H1)    Milliseconds (203)

Okay however is that this quantity one thing that can enable us to identify the “fold” ?

Let’s examine , lets begin rising the objects , i wont paste the logs i am going to simply listing them

objects# Teams# ms
1 1 203
100 1 203
256 1 188
512 2 297
1024 4 578
2048 8 1125
4096 1 2235

Hmm what went down there ? we went from 8 teams to 1 group and the display screen flashed

So 1024 is the max objects per compute unit then and 256 is the max group measurement that signifies 4 1024 concurrent processes and 4 teams that may be working in the identical reminiscence ? (for 1D assuming we don’t do any splitting ourselves)

We should do not forget that we’re letting it determine on the splits on this benchmark and it selected 1 group , or it didn’t do something in parallel right here and the file we output confirms this 

GROUP.ID[4095]=0

the final entry was 0 . So it calculated nothing and glitched in all probability.

Sooo trying on the figures above i do not assume any concurrency is there , if 4 teams executed in parallel or 2 teams or 8 teams we might see the identical time roughly with 1 group proper ? So what’s operating in parallel is the 256 objects possibly.

Hmm , so now let’s introduce one other parameter into our check and name it benchmark 3 . The native dimensions parameter within the execution name 

bool  CLExecute(
   int          kernel,                   
   uint         work_dim,                 
   const uint&  global_work_offset[],     
   const uint&  global_work_size[],       
   const uint&  local_work_size[]         
   );

 what if i set this to 32 ? we will get 8 sub teams .

These 8 subgroups will execute on the identical time but when i we deploy 10 sub teams (i.e 2 teams) we’ll get a better execution time ?

I believe i need not check this however let’s verify .

It is my understanding i’ve a tool that may calculate 256 objects on the identical time and this is the reason the integer 

CL_MAX_WORK_GROUP_SIZE returns this quantity too the group measurement will be 256 as a result of 256 that run on the identical time can share the native reminiscence .

May it’s that straightforward ?

Let’s check how the teams break up if i set the parameter local_work_size to 32 (matching the warp/wavefront) for various # of things

    uint work_items_per_group[]={32};
    lengthy msStarted=GetTickCount();
    CLExecute(ker,1,offsets,works,work_items_per_group);

I do not assume i want to vary anything .

Let’s examine the variety of teams it creates

(sidenote the 4096 did certainly crash as i obtained a context error on the following run needing a restart)

objects teams
1 none , error okay ofcourse
32 1 (203ms)
64 2 (203ms)
128 4 (203ms)
256 8 (203ms)
512 16 (297ms)
1024 32 (578ms)
2048 64 (1125ms)
4096 1 (2234ms)

Okay so , clearly the 

lengthy  CLGetInfoInteger(ctx,CL_DEVICE_MAX_WORK_GROUP_SIZE);

which returns 1024 in my gadget , refers back to the most parallel objects that may run on the identical time .

Meaning in the event you set the native measurement to 1024 it wont run 1024 objects in parallel it’ll break up it to packs of 1024 and it’ll doubtless fail and decelerate however in the event you set it to 1 it’ll deal with every work merchandise as its personal group , proper ? 

So for this check which doesn’t want teams , if i set the native measurement to 1 do i get the quickest execution if the objects are lifeless on 1024 ? 

Nope , i get a glitch once more . What if the entire objects is 256 ?

nope , glitch … ermm multiples of two maybe ? 

Nnnope , glitch …so what on earth is happening once more ?

What if i’ve 8 objects and 1 native , so 1 merchandise per group 8 teams primarily 

No glitch , so there’s a restrict within the variety of teams however it has nothing to do with this : 

CL_DEVICE_MAX_WORK_GROUP_SIZE

And the best way to find the group restrict is by in search of the drop off from the incrementing # of teams to 1 adopted by the display screen glitching ???

Nevertheless it additionally tolerated 2048 work objects so , i feel we’re seeing the non – communication between the OpenCL and the {hardware} . 

Is it 8 per dimension although , so 8^3 ? 512 ? or the cubic root of 1024 (the max objects ? ) hmmm

thats ~10 so 10 teams max ? (per dimension i assume) 

How we might know ? 320 objects with 32 native would work in all probability

It did okay however 64 labored too 2048 objects with 32 native … anyway .. im demolished i am going to choose it up an different time.

I am attaching the benchmarks…

Additionally there’s something to take from all these checks ,and its that the worth 

CLGetInfoInteger(ker,CL_KERNEL_WORK_GROUP_SIZE);

Is telling you what number of kernel situations can run in parallel in this gadget (or per compute unit however that may be examined with the cpu)

 

[ad_2]

LEAVE A REPLY

Please enter your comment!
Please enter your name here