[ad_1]
Okay so what we’ll do right here is that very same as earlier than however use a double sort this time
We modify our kernel code to this :
string kernel="__kernel void bench(__global double* _tangent," "int iterations){" "double sum=(double)0.0;" "double of=(double)_tangent[get_global_id(0)];" "for(int i=0;i<iterations;i++){" "sum+=((double)tanh(of-sum))/((double)iterations);" "}" "_tangent[get_global_id(0)]=sum;}";
We create a double array with its dimension matching the quantity of our kernels and we’re ever vigilant to instantiate that rigorously , so we drop again down to five kernels and deploy this :
double tangents[]; ArrayResize(tangents,kernels_to_deploy,0); double vary=5.2; for(int i=0;i<ArraySize(tangents);i++){ double r=(((double)MathRand())/((double)32767.0)*vary)-2.6; tangents[i]=r; } int tangents_id=CLBufferCreate(ctx,kernels_to_deploy*8,CL_MEM_READ_WRITE); bool args_set=true; for(int i=0;i<kernels_to_deploy;i++){ ResetLastError(); if(!CLSetKernelArgMem(KERNELS[i].deal with,0,tangents_id)){ Print("Can not assign buffer to kernel("+i+") error #"+IntegerToString(GetLastError())); args_set=false; }else{ CLSetKernelArg(KERNELS[i].deal with,1,iterations); } } if(args_set){ Print("All arguments for all kernels set!"); }else{ Print("Can not setup kernel args!"); }
We create the buffer after which we should connect it to all kernels proper ? Let’s have a look at if we are able to do this !
However , don’t forget we should unload the buffer too , so add this after the unload loop
(we transfer the tangents id out of the loop , now in a standard distribution this might be wrapped and managed inside a construction however we’re testing so there is no such thing as a want for this to have the ability to land the lunar mission!)
CLBufferFree(tangents_id); CLProgramFree(prg);
We add it there so , we run it and with one run we determine 2 issues !
2023.05.02 20:49:49.762 blog_kernel_times_benchmark (USDJPY,H1) All arguments for all kernels set! 2023.05.02 20:49:49.776 blog_kernel_times_benchmark (USDJPY,H1) Time to load and unload 5 kernels = 94ms
Good , now let’s do 50 kernels , we need to gauge for any deployment delays
2023.05.02 20:50:51.875 blog_kernel_times_benchmark (USDJPY,H1) Deployed all kernels! 2023.05.02 20:50:51.875 blog_kernel_times_benchmark (USDJPY,H1) All arguments for all kernels set! 2023.05.02 20:50:51.891 blog_kernel_times_benchmark (USDJPY,H1) Time to load and unload 50 kernels = 93ms
Pretty , and 5000 kernels ?
Okay slight delay there but it surely appears okay
2023.05.02 20:52:03.356 blog_kernel_times_benchmark (USDJPY,H1) All arguments for all kernels set! 2023.05.02 20:52:03.373 blog_kernel_times_benchmark (USDJPY,H1) Time to load and unload 5000 kernels = 110ms
Lets get to the purpose now , lastly !
We should quickly gentle up the kernels set their offset and work masses and setup an interval of 1ms , jesus , and in addition we should not enter this portion of the timer operate once more in any other case we’ll run into hassle . So . Bool indication kernelsRunning=false; 😊
If that’s true then we transfer right into a timer loop the place we’re simply gathering completion notices and storing them.
Let’s additionally add a accomplished indication in our kernel_info object , and that i’ll take away the deal with from the setup since i didn’t use it , that may be a pleasant operate mql5 , CLExecuteKernelList , just like the CommandQueue within the authentic OpenCL api .
When all is full we are going to tally the instances ,however we’ll take care of that later , so first we drop down to five kernels once more.
that is how our class seems like now :
class kernel_info{ public: bool accomplished; int offset; int deal with; ulong start_microSeconds; ulong end_microSeconds; kernel_info(void){reset();} ~kernel_info(void){reset();} void reset(){ accomplished=false; offset=-1; deal with=INVALID_HANDLE; start_microSeconds=0; end_microSeconds=0; } void setup(ulong _start,int _offset){ start_microSeconds=_start; offset=_offset; } void cease(ulong _end){ end_microSeconds=_end; } };
We gate the prevailing timer instructions with
if(!kernelsRunning)
{
}
And we’re very cautious right here , we should anticipate the unloading of the kernels upon the check ending or the check not beginning in any respect so :
We add an exitNow variable on the prime , set to true if the check fails or the check ends .
all of the contexts turn out to be variables of the worldwide scope …
We take away some stuff from the outdated part , don’t be concerned i saved it because it was within the supply file ,… so our timer seems like this now :
bool exitNow=false; if(!kernelsRunning) { EventKillTimer(); ctx=CLContextCreate(CL_USE_GPU_DOUBLE_ONLY); int kernels_to_deploy=5; tangents_id=INVALID_HANDLE; if(ctx!=INVALID_HANDLE){ string kernel="__kernel void bench(__global double* _tangent," "int iterations){" "double sum=(double)0.0;" "double of=(double)_tangent[get_global_id(0)];" "for(int i=0;i<iterations;i++){" "sum+=((double)tanh(of-sum))/((double)iterations);" "}" "_tangent[get_global_id(0)]=sum;}"; string errors=""; prg=CLProgramCreate(ctx,kernel,errors); if(prg!=INVALID_HANDLE){ ResetLastError(); int iterations=1000; ArrayResize(KERNELS,kernels_to_deploy,0); bool deployed=true; for(int i=0;i<kernels_to_deploy;i++){ KERNELS[i].deal with=CLKernelCreate(prg,"bench"); if(KERNELS[i].deal with==INVALID_HANDLE){deployed=false;} } if(deployed){ Print("Deployed all kernels!"); double tangents[]; ArrayResize(tangents,kernels_to_deploy,0); double vary=5.2; for(int i=0;i<ArraySize(tangents);i++){ double r=(((double)MathRand())/((double)32767.0)*vary)-2.6; tangents[i]=r; } tangents_id=CLBufferCreate(ctx,kernels_to_deploy*8,CL_MEM_READ_WRITE); bool args_set=true; for(int i=0;i<kernels_to_deploy;i++){ ResetLastError(); if(!CLSetKernelArgMem(KERNELS[i].deal with,0,tangents_id)){ Print("Can not assign buffer to kernel("+i+") error #"+IntegerToString(GetLastError())); args_set=false; }else{ CLSetKernelArg(KERNELS[i].deal with,1,iterations); } } if(args_set){ Print("All arguments for all kernels set!"); }else{ Print("Can not setup kernel args!"); exitNow=true; } }else{ Print("Can not deploy all kernels!"); exitNow=true; } }else{Alert(errors);exitNow=true;} } else{ Print("Can not create ctx"); exitNow=true; } } if(exitNow){ if(tangents_id!=INVALID_HANDLE){CLBufferFree(tangents_id);} for(int i=0;i<ArraySize(KERNELS);i++){ if(KERNELS[i].deal with!=INVALID_HANDLE){CLKernelFree(KERNELS[i].deal with);} } if(prg!=INVALID_HANDLE){CLProgramFree(prg);} if(ctx!=INVALID_HANDLE){CLContextFree(prg);} Print("DONE"); ExpertRemove(); }
Okay , now … let’s assume …

Let’s deal with the completion first , that’s the simple half
Upon all finishing , for now , we exit and cease the timer.
Be aware we now have not “launched” something but, we’d get a myriad of errors after we do.!
else if(!Busy&&kernelsRunning){ Busy=true; bool still_running=false; for(int i=0;i<ArraySize(KERNELS);i++){ if(!KERNELS[i].accomplished){ if(CLExecutionStatus(KERNELS[i].deal with)==CL_COMPLETE){ KERNELS[i].accomplished=true; }else{still_running=true;} } } if(!still_running){ EventKillTimer(); exitNow=true; } if(!exitNow){Busy=false;} }
It seems easy sufficient and proper , i believe :
- we go into the listing of kernels
- if one thing has completed we set it to accomplished
- if not we gentle up the nonetheless operating flag
- exit the loop
- if nothing remains to be operating kill the timer
- gentle up exit now
- dont shut off busy indication
Op , forgot to measure the tip time ! add this below accomplished
KERNELS[i].cease(GetMicrosecondCount());
and that is the execution name :
uint offsets[]={0}; uint works[]={1}; for(int i=0;i<ArraySize(KERNELS);i++){ offsets[0]=i; CLExecute(KERNELS[i].deal with,1,offsets,works); KERNELS[i].setup(GetMicrosecondCount(),i); } kernelsRunning=true; EventSetMillisecondTimer(1);
Let’s have a look at what occurs , i can not see something now , it exited although . So .. that was with 1000 iterations on 5 kernels
Now the duty is to develop the execution time of every kernel above the timer interval … pffft ..
To do this we have to output our findings to a file !
int f=FileOpen("OCLkernel_bench.txt",FILE_WRITE|FILE_TXT); if(f!=INVALID_HANDLE){ for(int i=0;i<ArraySize(KERNELS);i++){ ulong micros=KERNELS[i].end_microSeconds-KERNELS[i].start_microSeconds; if(KERNELS[i].accomplished){ FileWriteString(f,"Ok["+IntegerToString(i)+"] accomplished in ("+IntegerToString(micros)+")microSecondsn"); } else { FileWriteString(f,"Ok["+IntegerToString(i)+"] not completedn"); } } FileClose(f); }
We add this to the exit block and we wait and see.
and voilla
Ok[0] accomplished in (87334)microSeconds Ok[1] accomplished in (87320)microSeconds Ok[2] accomplished in (87300)microSeconds Ok[3] accomplished in (87279)microSeconds Ok[4] accomplished in (87261)microSeconds
now , what do these imply ? nothing they should be beneath our execution threshold i factor . lets see
One microsecond is … 1000000th of a second , or , one second is 1000000 microseconds so what we see right here is 87 milliseconds and we’re accessing the interval at 1 ms , okay . I do not belief it as a result of there could also be a delay for the loop too .
So … let’s make the calcs heavier (extra iterations) i am sending a million iterations . now , these will finish on the similar time roughly
I am additionally shutting mt5 down and restarting it for every run , do not know if theres any caching happening however i need to keep away from it.
-i assume i need to maintain the check going for as lengthy on the standing of the kernels is operating or in line to be executed or smth-
It seems to be caught or smth … i anticipated an 80 second run , its been 5 minutes now…quarter-hour okay , one thing dealer . letsssss add some circumstances there … @#%!#!%$@^
bool still_running=false; int running_total=0; int completed_total=0; int queued_total=0; int submitted_total=0; int unknown_total=0; for(int i=0;i<ArraySize(KERNELS);i++){ if(!KERNELS[i].accomplished){ ENUM_OPENCL_EXECUTION_STATUS standing=CLExecutionStatus(KERNELS[i].deal with); if(standing==CL_COMPLETE){ completed_total++; KERNELS[i].accomplished=true; KERNELS[i].cease(GetMicrosecondCount()); }else if(standing==CL_RUNNING){running_total++;still_running=true;} else if(standing==CL_QUEUED){queued_total++;} else if(standing==CL_SUBMITTED){submitted_total++;} else if(standing==CL_UNKNOWN){unknown_total++;} }else{ completed_total++; } } string message="Working("+IntegerToString(running_total)+")n"; message+="Accomplished("+IntegerToString(completed_total)+")n"; message+="Queued("+IntegerToString(queued_total)+")n"; message+="Submitted("+IntegerToString(submitted_total)+")n"; message+="Unknown("+IntegerToString(unknown_total)+")n"; Remark(message);
altering the async ready loop to this … lets see why the f*** it fails…
Okay i used to be a bit naive earlier , i assume that it’s going to full anyway so , lets not let it exit if its queued or submitted or unkown and lets drop to 1000 iterations once more.
Loop now adjustments to this :
bool still_running=false; int running_total=0; int completed_total=0; int queued_total=0; int submitted_total=0; int unknown_total=0; for(int i=0;i<ArraySize(KERNELS);i++){ if(!KERNELS[i].accomplished){ ENUM_OPENCL_EXECUTION_STATUS standing=CLExecutionStatus(KERNELS[i].deal with); if(standing==CL_COMPLETE){ completed_total++; KERNELS[i].accomplished=true; KERNELS[i].cease(GetMicrosecondCount()); }else if(standing==CL_RUNNING){running_total++;still_running=true;} else if(standing==CL_QUEUED){queued_total++;still_running=true;} else if(standing==CL_SUBMITTED){submitted_total++;still_running=true;} else if(standing==CL_UNKNOWN){unknown_total++;still_running=true;} }else{ completed_total++; } } string message="Working("+IntegerToString(running_total)+")n"; message+="Accomplished("+IntegerToString(completed_total)+")n"; message+="Queued("+IntegerToString(queued_total)+")n"; message+="Submitted("+IntegerToString(submitted_total)+")n"; message+="Unknown("+IntegerToString(unknown_total)+")n"; Remark(message); if(!still_running){ EventKillTimer(); exitNow=true; }
Dropped to 1000 iterations i believe i noticed it undergo the kernels one after the other . lets add x10 iterations and see.
Similar , 100k iterations … similar . okay is there an issue with the decimal precision or one thing and we won’t hit 1 million ?
There we go , sure , there may be one unknown left and 4 accomplished kernels with a million iterations , however why ?

It get’s caught there however thankfully it doesn’t appear to trigger any points on the gadget !
However why it hangs there ? , though , the instances of 1000 10000 and 100000 have been virtually immediate so let’s perform a little change if we’re not hitting the time above the interval we’d like , let’s not compound the operations within the kernel simply calculate a ton of s*** after which cross it.
I do not assume it issues (if it does and you already know it let me know)
So we take away the += however now the issue is will probably be serving the tanh worth from its cache …. so …. lets flip this to an addition … 😛
"sum=tanh(of)+((double)iterations)/((double)100.0);"
that is the calc now , let’s run once more for 1 million iterations … aaand sure there was a precision difficulty or one thing .
Now , let’s examine the instances .
Ok[0] accomplished in (370644)microSeconds Ok[1] accomplished in (479982)microSeconds Ok[2] accomplished in (604963)microSeconds Ok[3] accomplished in (729959)microSeconds Ok[4] accomplished in (839271)microSeconds
theres definately a queue motion happening right here . Let’s improve the kernels to 50.
caught once more . 19 accomplished 31 unknown . Okay we should see what the error is .
Added an error verify on execution , no bueno . The difficulty will not be there so it should be developing on unknown if we assume it goes from submitted->queued->running->unknown or accomplished .
Let’s have a look at what the docs say about it :
Air , attention-grabbing … 😊 okay . lets error gate the standing too .
2023.05.02 22:14:21.458 blog_kernel_times_benchmark (USDJPY,H1) Unknown standing for kernel(44) #5101 2023.05.02 22:14:21.458 blog_kernel_times_benchmark (USDJPY,H1) ----: unknown OpenCL error 65536
unknown error …kay #5101 is …. inner error , okay very enlightening …
That is telling us this won’t work clearly so we cannot be capable of benchmark this fashion and in a method the “api” is appropriate as its asking us why on earth are we making an attempt to do what its imagined to do itself .
I am going to take a break right here however i am going to publish the blogs , them being incomplete (for now) might give somebody an thought or two .
Im additionally attaching the two sources i used to this point.
—- okay —–
I am doing this improper . I believe the time check can happen with one kernel and a number of objects .I overdid it a bit there so
What adjustments :
[ad_2]