OpenCL :: Exploring the first Dimension (part2 the benchmark) – Different – 2 Could 2023

Stock

OpenCL :: Exploring the first Dimension (part2 the benchmark) – Different – 2 Could 2023

bizadmin

May 2, 2023

OpenCL :: Exploring the first Dimension (part2 the benchmark) – Different – 2 Could 2023

[ad_1]

Learn half 1

Okay so what we’ll do right here is that very same as earlier than however use a double sort this time

We modify our kernel code to this :

    string kernel="__kernel void bench(__global double* _tangent,"
                                      "int iterations){"
                                      "double sum=(double)0.0;"
                                      "double of=(double)_tangent[get_global_id(0)];"
                                      "for(int i=0;i<iterations;i++){"
                                      "sum+=((double)tanh(of-sum))/((double)iterations);"
                                      "}"
                                      "_tangent[get_global_id(0)]=sum;}";

We create a double array with its dimension matching the quantity of our kernels and we’re ever vigilant to instantiate that rigorously , so we drop again down to five kernels and deploy this :

      
        double tangents[];
        ArrayResize(tangents,kernels_to_deploy,0);
        double vary=5.2;
        for(int i=0;i<ArraySize(tangents);i++){
           double r=(((double)MathRand())/((double)32767.0)*vary)-2.6;
           tangents[i]=r;
           }         
        int tangents_id=CLBufferCreate(ctx,kernels_to_deploy*8,CL_MEM_READ_WRITE);
        
          bool args_set=true;
          for(int i=0;i<kernels_to_deploy;i++){
             ResetLastError();
             if(!CLSetKernelArgMem(KERNELS[i].deal with,0,tangents_id)){
               Print("Can not assign buffer to kernel("+i+") error #"+IntegerToString(GetLastError()));
               args_set=false;
               }else{
               CLSetKernelArg(KERNELS[i].deal with,1,iterations);
               }
             }
         if(args_set){
         Print("All arguments for all kernels set!");
         }else{
         Print("Can not setup kernel args!");
         }

We create the buffer after which we should connect it to all kernels proper ? Let’s have a look at if we are able to do this !

However , don’t forget we should unload the buffer too , so add this after the unload loop

(we transfer the tangents id out of the loop , now in a standard distribution this might be wrapped and managed inside a construction however we’re testing so there is no such thing as a want for this to have the ability to land the lunar mission!)

    CLBufferFree(tangents_id);
    CLProgramFree(prg);

We add it there so , we run it and with one run we determine 2 issues !

2023.05.02 20:49:49.762 blog_kernel_times_benchmark (USDJPY,H1) All arguments for all kernels set!
2023.05.02 20:49:49.776 blog_kernel_times_benchmark (USDJPY,H1) Time to load and unload 5 kernels = 94ms

Good , now let’s do 50 kernels , we need to gauge for any deployment delays

2023.05.02 20:50:51.875 blog_kernel_times_benchmark (USDJPY,H1) Deployed all kernels!
2023.05.02 20:50:51.875 blog_kernel_times_benchmark (USDJPY,H1) All arguments for all kernels set!
2023.05.02 20:50:51.891 blog_kernel_times_benchmark (USDJPY,H1) Time to load and unload 50 kernels = 93ms

Pretty , and 5000 kernels ?

Okay slight delay there but it surely appears okay

2023.05.02 20:52:03.356 blog_kernel_times_benchmark (USDJPY,H1) All arguments for all kernels set!
2023.05.02 20:52:03.373 blog_kernel_times_benchmark (USDJPY,H1) Time to load and unload 5000 kernels = 110ms

Lets get to the purpose now , lastly !

We should quickly gentle up the kernels set their offset and work masses and setup an interval of 1ms , jesus , and in addition we should not enter this portion of the timer operate once more in any other case we’ll run into hassle . So . Bool indication kernelsRunning=false; 😊

If that’s true then we transfer right into a timer loop the place we’re simply gathering completion notices and storing them.

Let’s additionally add a accomplished indication in our kernel_info object , and that i’ll take away the deal with from the setup since i didn’t use it , that may be a pleasant operate mql5 , CLExecuteKernelList , just like the CommandQueue within the authentic OpenCL api .

When all is full we are going to tally the instances ,however we’ll take care of that later , so first we drop down to five kernels once more.

that is how our class seems like now :

class kernel_info{
      public:
 bool accomplished;
  int offset;
  int deal with;
ulong start_microSeconds;
ulong end_microSeconds;
      kernel_info(void){reset();}
     ~kernel_info(void){reset();}
 void reset(){
      accomplished=false;
      offset=-1;
      deal with=INVALID_HANDLE;
      start_microSeconds=0;
      end_microSeconds=0;
      }
 void setup(ulong _start,int _offset){
      start_microSeconds=_start;
      offset=_offset;
      }
 void cease(ulong _end){
      end_microSeconds=_end;
      }
};

We gate the prevailing timer instructions with

if(!kernelsRunning)
{
}

And we’re very cautious right here , we should anticipate the unloading of the kernels upon the check ending or the check not beginning in any respect so :

We add an exitNow variable on the prime , set to true if the check fails or the check ends .

all of the contexts turn out to be variables of the worldwide scope …

We take away some stuff from the outdated part , don’t be concerned i saved it because it was within the supply file ,… so our timer seems like this now :

  bool exitNow=false;
  if(!kernelsRunning)
  {
  EventKillTimer();
  ctx=CLContextCreate(CL_USE_GPU_DOUBLE_ONLY);
  
  int kernels_to_deploy=5;  
  tangents_id=INVALID_HANDLE;
  if(ctx!=INVALID_HANDLE){
    string kernel="__kernel void bench(__global double* _tangent,"
                                      "int iterations){"
                                      "double sum=(double)0.0;"
                                      "double of=(double)_tangent[get_global_id(0)];"
                                      "for(int i=0;i<iterations;i++){"
                                      "sum+=((double)tanh(of-sum))/((double)iterations);"
                                      "}"
                                      "_tangent[get_global_id(0)]=sum;}";
    string errors="";
    prg=CLProgramCreate(ctx,kernel,errors);
    if(prg!=INVALID_HANDLE){
    ResetLastError();
    
      int iterations=1000;
      ArrayResize(KERNELS,kernels_to_deploy,0);
      bool deployed=true;
      for(int i=0;i<kernels_to_deploy;i++){
         KERNELS[i].deal with=CLKernelCreate(prg,"bench");
         if(KERNELS[i].deal with==INVALID_HANDLE){deployed=false;}
         }
      
      if(deployed){
      Print("Deployed all kernels!");    
      
        double tangents[];
        ArrayResize(tangents,kernels_to_deploy,0);
        double vary=5.2;
        for(int i=0;i<ArraySize(tangents);i++){
           double r=(((double)MathRand())/((double)32767.0)*vary)-2.6;
           tangents[i]=r;
           }         
        tangents_id=CLBufferCreate(ctx,kernels_to_deploy*8,CL_MEM_READ_WRITE);
        
          bool args_set=true;
          for(int i=0;i<kernels_to_deploy;i++){
             ResetLastError();
             if(!CLSetKernelArgMem(KERNELS[i].deal with,0,tangents_id)){
               Print("Can not assign buffer to kernel("+i+") error #"+IntegerToString(GetLastError()));
               args_set=false;
               }else{
               CLSetKernelArg(KERNELS[i].deal with,1,iterations);
               }
             }
         if(args_set){
         Print("All arguments for all kernels set!");
         
         
         
         }else{
         Print("Can not setup kernel args!");
         exitNow=true;
         }
      }else{
      Print("Can not deploy all kernels!");
      exitNow=true;
      }
    
    }else{Alert(errors);exitNow=true;}
    }
  else{
    Print("Can not create ctx");
    exitNow=true;
    }
  }
  
  
  
  
    if(exitNow){
    if(tangents_id!=INVALID_HANDLE){CLBufferFree(tangents_id);}
    for(int i=0;i<ArraySize(KERNELS);i++){
       if(KERNELS[i].deal with!=INVALID_HANDLE){CLKernelFree(KERNELS[i].deal with);}
       }
    if(prg!=INVALID_HANDLE){CLProgramFree(prg);}
    if(ctx!=INVALID_HANDLE){CLContextFree(prg);}
    Print("DONE");
    ExpertRemove();
    }

Okay , now … let’s assume …

Let’s deal with the completion first , that’s the simple half

Upon all finishing , for now , we exit and cease the timer.

Be aware we now have not “launched” something but, we’d get a myriad of errors after we do.!

  else if(!Busy&&kernelsRunning){
  Busy=true;
  
  
    bool still_running=false;
    for(int i=0;i<ArraySize(KERNELS);i++){
    if(!KERNELS[i].accomplished){
      if(CLExecutionStatus(KERNELS[i].deal with)==CL_COMPLETE){
      KERNELS[i].accomplished=true;
      }else{still_running=true;}
      }
    }
  
    if(!still_running){
      EventKillTimer();
      exitNow=true;
      }
  
  if(!exitNow){Busy=false;}
  }

It seems easy sufficient and proper , i believe :

we go into the listing of kernels
if one thing has completed we set it to accomplished
if not we gentle up the nonetheless operating flag
exit the loop
if nothing remains to be operating kill the timer
gentle up exit now
dont shut off busy indication

Op , forgot to measure the tip time ! add this below accomplished

KERNELS[i].cease(GetMicrosecondCount());

and that is the execution name :

         
           uint offsets[]={0};
           uint works[]={1};
           for(int i=0;i<ArraySize(KERNELS);i++){
              offsets[0]=i;
              CLExecute(KERNELS[i].deal with,1,offsets,works);
              KERNELS[i].setup(GetMicrosecondCount(),i);
              }
           kernelsRunning=true;
           EventSetMillisecondTimer(1);

Let’s have a look at what occurs , i can not see something now , it exited although . So .. that was with 1000 iterations on 5 kernels

Now the duty is to develop the execution time of every kernel above the timer interval … pffft ..

To do this we have to output our findings to a file !

      int f=FileOpen("OCLkernel_bench.txt",FILE_WRITE|FILE_TXT);
      if(f!=INVALID_HANDLE){
        for(int i=0;i<ArraySize(KERNELS);i++){
           ulong micros=KERNELS[i].end_microSeconds-KERNELS[i].start_microSeconds;
           if(KERNELS[i].accomplished){
             FileWriteString(f,"Ok["+IntegerToString(i)+"] accomplished in ("+IntegerToString(micros)+")microSecondsn");
             }
           else
             {
             FileWriteString(f,"Ok["+IntegerToString(i)+"] not completedn");
             }
           }
        FileClose(f);
        }

We add this to the exit block and we wait and see.

and voilla

Ok[0] accomplished in (87334)microSeconds
Ok[1] accomplished in (87320)microSeconds
Ok[2] accomplished in (87300)microSeconds
Ok[3] accomplished in (87279)microSeconds
Ok[4] accomplished in (87261)microSeconds

now , what do these imply ? nothing they should be beneath our execution threshold i factor . lets see

One microsecond is … 1000000th of a second , or , one second is 1000000 microseconds so what we see right here is 87 milliseconds and we’re accessing the interval at 1 ms , okay . I do not belief it as a result of there could also be a delay for the loop too .

So … let’s make the calcs heavier (extra iterations) i am sending a million iterations . now , these will finish on the similar time roughly

I am additionally shutting mt5 down and restarting it for every run , do not know if theres any caching happening however i need to keep away from it.

-i assume i need to maintain the check going for as lengthy on the standing of the kernels is operating or in line to be executed or smth-

It seems to be caught or smth … i anticipated an 80 second run , its been 5 minutes now…quarter-hour okay , one thing dealer . letsssss add some circumstances there … @#%!#!%$@^

  
    bool still_running=false;
    int running_total=0;
    int completed_total=0;
    int queued_total=0;
    int submitted_total=0;
    int unknown_total=0;
    for(int i=0;i<ArraySize(KERNELS);i++){
    if(!KERNELS[i].accomplished){
      ENUM_OPENCL_EXECUTION_STATUS standing=CLExecutionStatus(KERNELS[i].deal with);
      if(standing==CL_COMPLETE){
      completed_total++;
      KERNELS[i].accomplished=true;
      KERNELS[i].cease(GetMicrosecondCount());
      }else if(standing==CL_RUNNING){running_total++;still_running=true;}
      else if(standing==CL_QUEUED){queued_total++;}
      else if(standing==CL_SUBMITTED){submitted_total++;}
      else if(standing==CL_UNKNOWN){unknown_total++;}
      }else{
      completed_total++;
      }
    }
    string message="Working("+IntegerToString(running_total)+")n";
           message+="Accomplished("+IntegerToString(completed_total)+")n";
           message+="Queued("+IntegerToString(queued_total)+")n";
           message+="Submitted("+IntegerToString(submitted_total)+")n";
           message+="Unknown("+IntegerToString(unknown_total)+")n";
    Remark(message);

altering the async ready loop to this … lets see why the f*** it fails…

Okay i used to be a bit naive earlier , i assume that it’s going to full anyway so , lets not let it exit if its queued or submitted or unkown and lets drop to 1000 iterations once more.

Loop now adjustments to this :

  
    bool still_running=false;
    int running_total=0;
    int completed_total=0;
    int queued_total=0;
    int submitted_total=0;
    int unknown_total=0;
    for(int i=0;i<ArraySize(KERNELS);i++){
    if(!KERNELS[i].accomplished){
      ENUM_OPENCL_EXECUTION_STATUS standing=CLExecutionStatus(KERNELS[i].deal with);
      if(standing==CL_COMPLETE){
      completed_total++;
      KERNELS[i].accomplished=true;
      KERNELS[i].cease(GetMicrosecondCount());
      }else if(standing==CL_RUNNING){running_total++;still_running=true;}
      else if(standing==CL_QUEUED){queued_total++;still_running=true;}
      else if(standing==CL_SUBMITTED){submitted_total++;still_running=true;}
      else if(standing==CL_UNKNOWN){unknown_total++;still_running=true;}
      }else{
      completed_total++;
      }
    }
    string message="Working("+IntegerToString(running_total)+")n";
           message+="Accomplished("+IntegerToString(completed_total)+")n";
           message+="Queued("+IntegerToString(queued_total)+")n";
           message+="Submitted("+IntegerToString(submitted_total)+")n";
           message+="Unknown("+IntegerToString(unknown_total)+")n";
    Remark(message);
  
    if(!still_running){
      EventKillTimer();
      exitNow=true;
      }

Dropped to 1000 iterations i believe i noticed it undergo the kernels one after the other . lets add x10 iterations and see.

Similar , 100k iterations … similar . okay is there an issue with the decimal precision or one thing and we won’t hit 1 million ?

There we go , sure , there may be one unknown left and 4 accomplished kernels with a million iterations , however why ?

It get’s caught there however thankfully it doesn’t appear to trigger any points on the gadget !

However why it hangs there ? , though , the instances of 1000 10000 and 100000 have been virtually immediate so let’s perform a little change if we’re not hitting the time above the interval we’d like , let’s not compound the operations within the kernel simply calculate a ton of s*** after which cross it.

I do not assume it issues (if it does and you already know it let me know)

So we take away the += however now the issue is will probably be serving the tanh worth from its cache …. so …. lets flip this to an addition … 😛

"sum=tanh(of)+((double)iterations)/((double)100.0);"

that is the calc now , let’s run once more for 1 million iterations … aaand sure there was a precision difficulty or one thing .

Now , let’s examine the instances .

Ok[0] accomplished in (370644)microSeconds
Ok[1] accomplished in (479982)microSeconds
Ok[2] accomplished in (604963)microSeconds
Ok[3] accomplished in (729959)microSeconds
Ok[4] accomplished in (839271)microSeconds

theres definately a queue motion happening right here . Let’s improve the kernels to 50.

caught once more . 19 accomplished 31 unknown . Okay we should see what the error is .

Added an error verify on execution , no bueno . The difficulty will not be there so it should be developing on unknown if we assume it goes from submitted->queued->running->unknown or accomplished .

Let’s have a look at what the docs say about it :

Air , attention-grabbing … 😊 okay . lets error gate the standing too .

2023.05.02 22:14:21.458 blog_kernel_times_benchmark (USDJPY,H1) Unknown standing for kernel(44) #5101
2023.05.02 22:14:21.458 blog_kernel_times_benchmark (USDJPY,H1) ----: unknown OpenCL error 65536

unknown error …kay #5101 is …. inner error , okay very enlightening …

That is telling us this won’t work clearly so we cannot be capable of benchmark this fashion and in a method the “api” is appropriate as its asking us why on earth are we making an attempt to do what its imagined to do itself .

I am going to take a break right here however i am going to publish the blogs , them being incomplete (for now) might give somebody an thought or two .

Im additionally attaching the two sources i used to this point.

—- okay —–

I am doing this improper . I believe the time check can happen with one kernel and a number of objects .I overdid it a bit there so

What adjustments :

[ad_2]

LEAVE A REPLY Cancel reply