Home Stock OpenCL :: Exploring the first Dimension (part2 the benchmark) – Different – 2 Could 2023

OpenCL :: Exploring the first Dimension (part2 the benchmark) – Different – 2 Could 2023

0
OpenCL :: Exploring the first Dimension (part2 the benchmark) – Different – 2 Could 2023

[ad_1]

Learn half 1 

Okay so what we’ll do right here is that very same as earlier than however use a double sort this time 

We modify our kernel code to this :

    string kernel="__kernel void bench(__global double* _tangent,"
                                      "int iterations){"
                                      "double sum=(double)0.0;"
                                      "double of=(double)_tangent[get_global_id(0)];"
                                      "for(int i=0;i<iterations;i++){"
                                      "sum+=((double)tanh(of-sum))/((double)iterations);"
                                      "}"
                                      "_tangent[get_global_id(0)]=sum;}";

We create a double array with its dimension matching the quantity of our kernels and we’re ever vigilant to instantiate that rigorously , so we drop again down to five kernels and deploy this :

      
        double tangents[];
        ArrayResize(tangents,kernels_to_deploy,0);
        double vary=5.2;
        for(int i=0;i<ArraySize(tangents);i++){
           double r=(((double)MathRand())/((double)32767.0)*vary)-2.6;
           tangents[i]=r;
           }         
        int tangents_id=CLBufferCreate(ctx,kernels_to_deploy*8,CL_MEM_READ_WRITE);
        
          bool args_set=true;
          for(int i=0;i<kernels_to_deploy;i++){
             ResetLastError();
             if(!CLSetKernelArgMem(KERNELS[i].deal with,0,tangents_id)){
               Print("Can not assign buffer to kernel("+i+") error #"+IntegerToString(GetLastError()));
               args_set=false;
               }else{
               CLSetKernelArg(KERNELS[i].deal with,1,iterations);
               }
             }
         if(args_set){
         Print("All arguments for all kernels set!");
         }else{
         Print("Can not setup kernel args!");
         }

 We create the buffer after which we should connect it to all kernels proper ? Let’s have a look at if we are able to do this !

However , don’t forget we should unload the buffer too , so add this after the unload loop 

(we transfer the tangents id out of the loop , now in a standard distribution this might be wrapped and managed inside a construction however we’re testing so there is no such thing as a want for this to have the ability to land the lunar mission!)

    CLBufferFree(tangents_id);
    CLProgramFree(prg);

We add it there so , we run it and with one run we determine 2 issues !

2023.05.02 20:49:49.762 blog_kernel_times_benchmark (USDJPY,H1) All arguments for all kernels set!
2023.05.02 20:49:49.776 blog_kernel_times_benchmark (USDJPY,H1) Time to load and unload 5 kernels = 94ms

Good , now let’s do 50 kernels , we need to gauge for any deployment delays 

2023.05.02 20:50:51.875 blog_kernel_times_benchmark (USDJPY,H1) Deployed all kernels!
2023.05.02 20:50:51.875 blog_kernel_times_benchmark (USDJPY,H1) All arguments for all kernels set!
2023.05.02 20:50:51.891 blog_kernel_times_benchmark (USDJPY,H1) Time to load and unload 50 kernels = 93ms

Pretty , and 5000 kernels ?

Okay slight delay there but it surely appears okay 

2023.05.02 20:52:03.356 blog_kernel_times_benchmark (USDJPY,H1) All arguments for all kernels set!
2023.05.02 20:52:03.373 blog_kernel_times_benchmark (USDJPY,H1) Time to load and unload 5000 kernels = 110ms

Lets get to the purpose now , lastly !

We should quickly gentle up the kernels set their offset and work masses and setup an interval of 1ms , jesus , and in addition we should not enter this portion of the timer operate once more in any other case we’ll run into hassle . So . Bool indication kernelsRunning=false; 😊

If that’s true then we transfer right into a timer loop the place we’re simply gathering completion notices and storing them.

Let’s additionally add a accomplished indication in our kernel_info object , and that i’ll take away the deal with from the setup since i didn’t use it , that may be a pleasant operate mql5 , CLExecuteKernelList , just like the CommandQueue within the authentic OpenCL api .

When all is full we are going to tally the instances ,however we’ll take care of that later , so first we drop down to five kernels once more.

that is how our class seems like now : 

class kernel_info{
      public:
 bool accomplished;
  int offset;
  int deal with;
ulong start_microSeconds;
ulong end_microSeconds;
      kernel_info(void){reset();}
     ~kernel_info(void){reset();}
 void reset(){
      accomplished=false;
      offset=-1;
      deal with=INVALID_HANDLE;
      start_microSeconds=0;
      end_microSeconds=0;
      }
 void setup(ulong _start,int _offset){
      start_microSeconds=_start;
      offset=_offset;
      }
 void cease(ulong _end){
      end_microSeconds=_end;
      }
};

We gate the prevailing timer instructions with 

if(!kernelsRunning)
{
}

And we’re very cautious right here , we should anticipate the unloading of the kernels upon the check ending or the check not beginning in any respect so :

We add an exitNow variable on the prime , set to true if the check fails or the check ends .

all of the contexts turn out to be variables of the worldwide scope …

We take away some stuff from the outdated part , don’t be concerned i saved it because it was within the supply file ,… so our timer seems like this now : 

  bool exitNow=false;
  if(!kernelsRunning)
  {
  EventKillTimer();
  ctx=CLContextCreate(CL_USE_GPU_DOUBLE_ONLY);
  
  int kernels_to_deploy=5;  
  tangents_id=INVALID_HANDLE;
  if(ctx!=INVALID_HANDLE){
    string kernel="__kernel void bench(__global double* _tangent,"
                                      "int iterations){"
                                      "double sum=(double)0.0;"
                                      "double of=(double)_tangent[get_global_id(0)];"
                                      "for(int i=0;i<iterations;i++){"
                                      "sum+=((double)tanh(of-sum))/((double)iterations);"
                                      "}"
                                      "_tangent[get_global_id(0)]=sum;}";
    string errors="";
    prg=CLProgramCreate(ctx,kernel,errors);
    if(prg!=INVALID_HANDLE){
    ResetLastError();
    
      int iterations=1000;
      ArrayResize(KERNELS,kernels_to_deploy,0);
      bool deployed=true;
      for(int i=0;i<kernels_to_deploy;i++){
         KERNELS[i].deal with=CLKernelCreate(prg,"bench");
         if(KERNELS[i].deal with==INVALID_HANDLE){deployed=false;}
         }
      
      if(deployed){
      Print("Deployed all kernels!");    
      
        double tangents[];
        ArrayResize(tangents,kernels_to_deploy,0);
        double vary=5.2;
        for(int i=0;i<ArraySize(tangents);i++){
           double r=(((double)MathRand())/((double)32767.0)*vary)-2.6;
           tangents[i]=r;
           }         
        tangents_id=CLBufferCreate(ctx,kernels_to_deploy*8,CL_MEM_READ_WRITE);
        
          bool args_set=true;
          for(int i=0;i<kernels_to_deploy;i++){
             ResetLastError();
             if(!CLSetKernelArgMem(KERNELS[i].deal with,0,tangents_id)){
               Print("Can not assign buffer to kernel("+i+") error #"+IntegerToString(GetLastError()));
               args_set=false;
               }else{
               CLSetKernelArg(KERNELS[i].deal with,1,iterations);
               }
             }
         if(args_set){
         Print("All arguments for all kernels set!");
         
         
         
         }else{
         Print("Can not setup kernel args!");
         exitNow=true;
         }
      }else{
      Print("Can not deploy all kernels!");
      exitNow=true;
      }
    
    }else{Alert(errors);exitNow=true;}
    }
  else{
    Print("Can not create ctx");
    exitNow=true;
    }
  }
  
  
  
  
    if(exitNow){
    if(tangents_id!=INVALID_HANDLE){CLBufferFree(tangents_id);}
    for(int i=0;i<ArraySize(KERNELS);i++){
       if(KERNELS[i].deal with!=INVALID_HANDLE){CLKernelFree(KERNELS[i].deal with);}
       }
    if(prg!=INVALID_HANDLE){CLProgramFree(prg);}
    if(ctx!=INVALID_HANDLE){CLContextFree(prg);}
    Print("DONE");
    ExpertRemove();
    }

Okay , now … let’s assume …

Let’s deal with the completion first , that’s the simple half 

Upon all finishing , for now , we exit and cease the timer.

Be aware we now have not “launched” something but, we’d get a myriad of errors after we do.!

  else if(!Busy&&kernelsRunning){
  Busy=true;
  
  
    bool still_running=false;
    for(int i=0;i<ArraySize(KERNELS);i++){
    if(!KERNELS[i].accomplished){
      if(CLExecutionStatus(KERNELS[i].deal with)==CL_COMPLETE){
      KERNELS[i].accomplished=true;
      }else{still_running=true;}
      }
    }
  
    if(!still_running){
      EventKillTimer();
      exitNow=true;
      }
  
  if(!exitNow){Busy=false;}
  }

It seems easy sufficient and proper , i believe :

  • we go into the listing of kernels
  • if one thing has completed we set it to accomplished
  • if not we gentle up the nonetheless operating flag
  • exit the loop 
  • if nothing remains to be operating kill the timer 
  • gentle up exit now 
  • dont shut off busy indication

Op , forgot to measure the tip time ! add this below accomplished

KERNELS[i].cease(GetMicrosecondCount());

and that is the execution name :

         
           uint offsets[]={0};
           uint works[]={1};
           for(int i=0;i<ArraySize(KERNELS);i++){
              offsets[0]=i;
              CLExecute(KERNELS[i].deal with,1,offsets,works);
              KERNELS[i].setup(GetMicrosecondCount(),i);
              }
           kernelsRunning=true;
           EventSetMillisecondTimer(1);

Let’s have a look at what occurs , i can not see something now , it exited although . So .. that was with 1000 iterations on 5 kernels 

Now the duty is to develop the execution time of every kernel above the timer interval … pffft .. 

To do this we have to output our findings to a file !

      int f=FileOpen("OCLkernel_bench.txt",FILE_WRITE|FILE_TXT);
      if(f!=INVALID_HANDLE){
        for(int i=0;i<ArraySize(KERNELS);i++){
           ulong micros=KERNELS[i].end_microSeconds-KERNELS[i].start_microSeconds;
           if(KERNELS[i].accomplished){
             FileWriteString(f,"Ok["+IntegerToString(i)+"] accomplished in ("+IntegerToString(micros)+")microSecondsn");
             }
           else
             {
             FileWriteString(f,"Ok["+IntegerToString(i)+"] not completedn");
             }
           }
        FileClose(f);
        }

We add this to the exit block and we wait and see.

and voilla 

Ok[0] accomplished in (87334)microSeconds
Ok[1] accomplished in (87320)microSeconds
Ok[2] accomplished in (87300)microSeconds
Ok[3] accomplished in (87279)microSeconds
Ok[4] accomplished in (87261)microSeconds

now , what do these imply ? nothing they should be beneath our execution threshold i factor . lets see 

One microsecond is … 1000000th of a second , or , one second is 1000000 microseconds so what we see right here is 87 milliseconds and we’re accessing the interval at 1 ms , okay . I do not belief it as a result of there could also be a delay for the loop too .

So … let’s make the calcs heavier (extra iterations) i am sending a million iterations . now , these will finish on the similar time roughly 

I am additionally shutting mt5 down and restarting it for every run , do not know if theres any caching happening however i need to keep away from it.

-i assume i need to maintain the check going for as lengthy on the standing of the kernels is operating or in line to be executed or smth-

It seems to be caught or smth … i anticipated an 80 second run , its been 5 minutes now…quarter-hour okay , one thing dealer . letsssss add some circumstances there … @#%!#!%$@^

  
    bool still_running=false;
    int running_total=0;
    int completed_total=0;
    int queued_total=0;
    int submitted_total=0;
    int unknown_total=0;
    for(int i=0;i<ArraySize(KERNELS);i++){
    if(!KERNELS[i].accomplished){
      ENUM_OPENCL_EXECUTION_STATUS standing=CLExecutionStatus(KERNELS[i].deal with);
      if(standing==CL_COMPLETE){
      completed_total++;
      KERNELS[i].accomplished=true;
      KERNELS[i].cease(GetMicrosecondCount());
      }else if(standing==CL_RUNNING){running_total++;still_running=true;}
      else if(standing==CL_QUEUED){queued_total++;}
      else if(standing==CL_SUBMITTED){submitted_total++;}
      else if(standing==CL_UNKNOWN){unknown_total++;}
      }else{
      completed_total++;
      }
    }
    string message="Working("+IntegerToString(running_total)+")n";
           message+="Accomplished("+IntegerToString(completed_total)+")n";
           message+="Queued("+IntegerToString(queued_total)+")n";
           message+="Submitted("+IntegerToString(submitted_total)+")n";
           message+="Unknown("+IntegerToString(unknown_total)+")n";
    Remark(message);

altering the async ready loop to this … lets see why the f*** it fails…

Okay i used to be a bit naive earlier , i assume that it’s going to full anyway so , lets not let it exit if its queued or submitted or unkown and lets drop to 1000 iterations once more.

Loop now adjustments to this :

  
    bool still_running=false;
    int running_total=0;
    int completed_total=0;
    int queued_total=0;
    int submitted_total=0;
    int unknown_total=0;
    for(int i=0;i<ArraySize(KERNELS);i++){
    if(!KERNELS[i].accomplished){
      ENUM_OPENCL_EXECUTION_STATUS standing=CLExecutionStatus(KERNELS[i].deal with);
      if(standing==CL_COMPLETE){
      completed_total++;
      KERNELS[i].accomplished=true;
      KERNELS[i].cease(GetMicrosecondCount());
      }else if(standing==CL_RUNNING){running_total++;still_running=true;}
      else if(standing==CL_QUEUED){queued_total++;still_running=true;}
      else if(standing==CL_SUBMITTED){submitted_total++;still_running=true;}
      else if(standing==CL_UNKNOWN){unknown_total++;still_running=true;}
      }else{
      completed_total++;
      }
    }
    string message="Working("+IntegerToString(running_total)+")n";
           message+="Accomplished("+IntegerToString(completed_total)+")n";
           message+="Queued("+IntegerToString(queued_total)+")n";
           message+="Submitted("+IntegerToString(submitted_total)+")n";
           message+="Unknown("+IntegerToString(unknown_total)+")n";
    Remark(message);
  
    if(!still_running){
      EventKillTimer();
      exitNow=true;
      }

Dropped to 1000 iterations i believe i noticed it undergo the kernels one after the other . lets add x10 iterations and see.

Similar , 100k iterations … similar . okay is there an issue with the decimal precision or one thing and we won’t hit 1 million ?

There we go , sure , there may be one unknown  left and 4 accomplished kernels with a million iterations , however why ?

It get’s caught there however thankfully it doesn’t appear to trigger any points on the gadget !

However why it hangs there ? , though , the instances of 1000 10000 and 100000 have been virtually immediate so let’s perform a little change if we’re not hitting the time above the interval we’d like , let’s not compound the operations within the kernel simply calculate a ton of s*** after which cross it.

I do not assume it issues (if it does and you already know it let me know)

So we take away the += however now the issue is will probably be serving the tanh worth from its cache …. so …. lets flip this to an addition … 😛

"sum=tanh(of)+((double)iterations)/((double)100.0);"

that is the calc now , let’s run once more for 1 million iterations … aaand sure there was a precision difficulty or one thing .

Now , let’s examine the instances .

Ok[0] accomplished in (370644)microSeconds
Ok[1] accomplished in (479982)microSeconds
Ok[2] accomplished in (604963)microSeconds
Ok[3] accomplished in (729959)microSeconds
Ok[4] accomplished in (839271)microSeconds

theres definately a queue motion happening right here . Let’s improve the kernels to 50.

caught once more . 19 accomplished 31 unknown . Okay we should see what the error is .

Added an error verify on execution , no bueno . The difficulty will not be there so it should be developing on unknown if we assume it goes from submitted->queued->running->unknown or accomplished .

Let’s have a look at what the docs say about it :

Air , attention-grabbing … 😊 okay . lets error gate the standing too .

2023.05.02 22:14:21.458 blog_kernel_times_benchmark (USDJPY,H1) Unknown standing for kernel(44) #5101
2023.05.02 22:14:21.458 blog_kernel_times_benchmark (USDJPY,H1) ----: unknown OpenCL error 65536

unknown error …kay #5101 is …. inner error , okay very enlightening …

That is telling us this won’t work clearly so we cannot be capable of benchmark this fashion and in a method the “api” is appropriate as its asking us why on earth are we making an attempt to do what its imagined to do itself . 

I am going to take a break right here however i am going to publish the blogs , them being incomplete (for now) might give somebody an thought or two .

Im additionally attaching the two sources i used to this point.

—- okay —–

I am doing this improper . I believe the time check can happen with one kernel and a number of objects .I overdid it a bit there so 

What adjustments :

[ad_2]

LEAVE A REPLY

Please enter your comment!
Please enter your name here