No announcement yet.

Threads, performance, response times

  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    the code as I used it does appear to perform differently in PBWin. I'll look into it later.

    On my PC, the results are:
    cached - 62.312 secs.
    non-cached - 48.109 sec.s
    And the results on mine, with PBCC, are
    cached - 12 secs.
    non-cached - 17 sec.s

    Such a test should not access any input/output devices.
    The whole point of the test is that it is accessing I/O which is slow and asynchronous. That's how the speed gain is achieved. Threads allow the possibility of using the otherwise idle time spent waiting for data to appear from I/O.

    Buffers in Windows and hard drives are not relevant as the file is too large to fit in the buffers and is accessed randomly to make sure that cache hits rarely, if ever, occur. That's why in a previous post in this thread a machine with lots of RAM took the same time for both tests until the test file was made larger than available RAM .



    • #22
      After a little consideration of Paul's code I've ended up with this for the ratio of cached and non-cached.

      max(Read, CPU)/(Read + CPU ) where Read = Total Input time and CPU = Total number crunching time ... [1]

      This observation is got from


      with a Read intensive application giving 10/11



      and a CPU intensive application giving 10/11 again.

      In the first case we cover all of the CPU time against the Read time and in the second case we cover only a small part of the CPU time with Read time.

      If we let Read = k*CPU then [1] becomes

      max(k*CPU, CPU)/(k*CPU + CPU) ... [2]

      For k<1 [2] => CPU/(k*CPU + CPU) = 1/(k + 1) ie CPU intensive app
      For k=1 [2] => CPU/(CPU + CPU) = 1/2
      For k>1 [2] => k*CPU/(k*CPU + CPU) = k/(k+1) ie READ intensive app.

      Consider a set of k

      0.5 0.67
      0.6 0.63
      0.7 0.59
      0.8 0.56
      0.9 0.53
      1.0 0.5
      1.1 0.53
      1.2 0.55
      1.3 0.57
      1.4 0.58
      1.5 0.6

      What we have is an inverted bell shape and the optimum is at k=1. This, of course, is where all of the CPU time is covered by the total Read time.

      With Paul's code and varying values of waste& I get

      10 0.71
      15 0.65
      20 0.60
      30 0.61
      40 0.61

      80 0.80

      There's our inverted bell shape again and the optimum seems to be around the waste&=20 area. That, of course, is the value used in Paul's code. Methinks Paul had done some tinkering.

      The bell shape is skewed and notice that we do not bottom out at 0.5. The slope is greater as we reduce the number crunching compared with increasing the number crunching.

      In practice, we may have no control on the value of waste& but if we have then code could be optimised.

      All very well but will it behave the same on someone elses machine?

      I may be wrong but if we erred on the gentler sloping area we may find that a variety of machines turn out similar results.

      Paul is getting 30% to 35% ( 0.7 to 0.65 with my formula). Now, this could be either side of the optimum so both a tweak up and down of waste& will be needed to determine which side.

      Of course, Paul's machine may not bottom out at 0.6 as mine does.

      It may well be that other machines behave in very different ways rendering the method a purely local one to be tinkered with.

      On a final note I had a look at an MD5 routine. I set up two buffers and filled the first with the MD5 routine itself. I then created a thread to fill the second buffer whilst the MD5 routine crunched the first buffer. The MD5 then crunched the second buffer whilst the thread filled the first buffer and so on. It didn't give me an improvement at all - the results were similar to a few milliseconds on a 1400ms test case. I had hoped to see that 1400ms drop to below 1000ms. The MD5 routine is in a PBWin app.

      May I be pedantic? The codes listed so far are not closing the file nor the thread and f& is not defined in the thread. May not be relevant but I crashed a few times during the threading part.
      Last edited by David Roberts; 19 Oct 2007, 07:49 PM.


      • #23
        Apparently I did spurn on quite the debate. Which can be good because the sharing of knowledge

        Cliff, seems to me you are often looking for ways to "beat the system."
        it may seem that way Michael, but more to the point it is my incessant urge to learn from the past (where I just used the tools at hand instead of how they work, and why they work vs "Years down the line find out the tool does NOT work in a particular case because it was not DESIGNED to work for that particular case)

        You can get real results a whole lot sooner if you just use the tools Windows provides as they were designed.
        I agree with you there, as long as the "Tool" is useful (which many are).

        "Go With The Flow," if you will.
        Excellent point, no prob with that. I just want to learn how and why when I have time (stop learning, you soon become outdated)

        Many curse the WinAPI as too 'complex,'
        I found this at first (beyond the simple API functions that are straight forwarded

        'complexity' is the byproduct of 'choices', and 'choices' are what deliver 'power' to the programmer and make the Win/32 environment simply terrific for developing applications.
        I agree, The deeper I dig into API, the more I find it is what I thought was overwhelming, if you take it piece by piece and understand what the parameters mean rather then just doing, you pick it up pretty quickly
        (Although I will be the 1st to say I got a LONNNNNNNNG way to learn yet.) which is why I love the pioneers ahead of me that not only see the problem, but how to simplify it to the user, until if/when they are willing to learn the complexities. (Take PB for example...instead of the complexities of SDK, they went with DDT to simplify things like "DIALOG NEW" so that learners do not "just give up" because SDK could be too 'complex' for a beginner)

        Tools are great as long as they were designed for the task at hand, Understanding the tool is even better, Making a better tool is even better yet.

        It all boils down to the idea of what you want to do, and how you want to do it.

        Example: I need a new muffler
        • I could take my truck to the shop and let them do it, and I pay for it (lets say 1hr for them to do it with the correct tools, $ is worth the time I saved)
        • I could change it myself (lets say it takes me all day, but I do it,, but I may not have the tools needed, but make do with what I got, maybe I was better off saving the time and let the shop do it)
        • I could learn how the muffler works, and then get the tools and change it (now I used time to learn, buy the tools, and then do it myself... (lets say it took me a week) what have I lost? Depends on if time or money is more valuable to you than the knowledge you gained)
        • Learned how a muffler works, and built a better one (maybe wondering why somebody else never thought of this?)...(lets say this took me 5 years)...(what have I lost? Nothing, I already gave into the idea that it was a learning process and the knicks and knacks, so it became a hobby, and the truck long since died)....what have I lost? (Nothing...I gained a hobby, learned new things, and future developments may learn even more because I took the time to learn the past)

        I can see how you would think I am trying to circumvent a tool, just because I am trying to Understand it

        Case in point, thanx to our lil debate over "Multiple 'copies' of Dll's" and my ways to attempt to 'circumvent' the ways it worked, I learned the why and how it worked (THANK YOU by the way), and once I understood it, saved another from following the same path with less knowledge of what they were up against.

        Anyways, enough of my long post...I am pretty well satisfied with the answer...Diminishing returns, vs the purpose of why I would want many or few threads...the short answer is "It depends"
        Engineer's Motto: If it aint broke take it apart and fix it

        "If at 1st you don't succeed... call it version 1.0"

        "Half of Programming is coding"....."The other 90% is DEBUGGING"

        "Document my code????" .... "WHYYY??? do you think they call it CODE? "


        • #24
          FWIW, Chris I'm getting a 40% reduction with your code as well but only when I use a 3GB file. With a 1GB file there is very little in it. At the moment I have over 1.5GB of system cache available. That could be why I'm bottoming out at 60% instead of 50%.

          Added: Nope. I just did a Gosub test1, rebooted and did a Gosub test2. Now I get a 24% reduction. Question is have I traversed up the slope now and need to tweak waste& again. I cannot face all those reboots. With the MD5 test I used READFILE without buffering. That would defeat Paul's code but I used local buffers via VIRTUALALLOC as learnt in a recent thread.
          Last edited by David Roberts; 19 Oct 2007, 10:02 PM.


          • #25
            Just for fun

            Originally posted by Chris Boss View Post

            Here is your code so it runs on PB Win 8.04:

            I ran the program twice and got the following times:

            My PC is a 2.5 ghz CPU, 256 meg Ram, Windows XP Home.
            I ran the code you posted several times, Chris and got similar results (maybe 1 tenth of a s second dif in each run 8.2 vs 8.3). Running Dell XPS 710 with XP Pro & 4 g mem (Dunno how many other processes running though. Task Mgr shows 30-40)

            I don't want any yes-men around me.
            I want everybody to tell me the truth
            even if it costs them their jobs.
            Samuel Goldwyn
            It's a pretty day. I hope you enjoy it.


            JWAM: (Quit Smoking):
            LDN - A Miracle Drug:


            • #26
              I think that we are missing something here. File buffering occurs with writes as well as reads. With 4GB of RAM Gösta that 1000000 byte file is in the system cache after writing and therefore available in RAM whichever method is employed first whether it be caching or non-caching. The timings then should be pretty much the same, which is what you are getting. After writing the file have a look at your drive light and you'll see nothing happening for both tests.

              However, I cannot figure out why Chris is not seeing an improvement since he only has 256MB of RAM so he has hardly any system cache to speak of when dealing with a 1000000 byte file. The records are only coming in at 1000 bytes each so the caching approach should be working, unless with so little RAM the system works differently.


              • #27
                Of course, if we didn't have a filecache then both methods would be getting data from the drive and the timings would be about the same, the same result as Gösta but for a different reason.

                With a Put/Get combo there is only one way to do this: Create a file, Restart, run one method, Restart and then run the other method. Since I'm restarting I don't need to use a 3GB file now so used the 1000000 byte file. I got 26.375s with no caching and 16.141s with caching giving a 38.8% reduction.

                The 24% I got earlier did not see me Restart after the file creation and since my system cache is about half the size of the 3GB file there would have been stuff left over so getting random records would probably have found some in the system cache.
                Last edited by David Roberts; 21 Oct 2007, 04:14 AM.


                • #28
                  Below is a different version of the same thing. For brevity it doesn't include the creation of the 1GB test file but assumes it's already on disk. It doesn't matter what's in the file.

                  The original version I posted assumed programs and threads worked nicely together and freed unused timeslices in an orderly way allowing me to release timeslices frequently and have them taken up by other threads immediately. This worked well in PBCC but did not work well in PBWin.
                  It appears a PBWin EXE does not free up released timeslices but instead holds on to them even when there is no useful work to do.

                  Since I know the method is good then the solution is to stop assuming nice cooperation between threads and hog the CPU whenever it suits until the OS decides enough is enough, i.e. if there is work that can be done then don't ever give up a timeslice. Only give it up if there is no useful work to do.

                  The following code does that.
                  It needs to cache ahead a few more records to do this efficiently but now works in PBCC and PBWin EXEs.

                  'PBCC4/PBWin8 program
                  'use threads in a non-cooperative environment to speed up disk access
                  #DIM ALL
                  %PrefetchSize = 10
                  DECLARE FUNCTION QueryPerformanceFrequency LIB "KERNEL32.DLL" ALIAS "QueryPerformanceFrequency" (lpFrequency AS QUAD) AS LONG
                  DECLARE FUNCTION QueryPerformanceCounter   LIB "KERNEL32.DLL" ALIAS "QueryPerformanceCounter" (lpPerformanceCount AS QUAD) AS LONG
                  'this is the record type for the test file.
                  TYPE DataType
                      item AS STRING*1000
                  END TYPE
                  GLOBAL CacheIndex() AS LONG
                  GLOBAL CacheData() AS DataType
                  GLOBAL PrefetchPointer AS LONG
                  GLOBAL ProcessPointer AS LONG
                  GLOBAL FileNumber AS LONG
                  GLOBAL QuitFlag AS LONG
                  'this thread causes the data to be asynchronously prefetched saving as much as 50% of the normal time.
                  FUNCTION MyThread(BYVAL junk AS DWORD) AS DWORD
                      WHILE ((PrefetchPointer - ProcessPointer) < %PrefetchSize) AND (PrefetchPointer <> %RunLimit)
                          'I should prefetch more data
                          INCR PrefetchPointer
                      SLEEP 0      'if the data is already cached far enough ahead then give up the remainder of the timeslice
                  LOOP UNTIL QuitFlag   'loop until main code sets this flag then quit.
                  END FUNCTION
                  FUNCTION PBMAIN () AS LONG
                  LOCAL freq AS QUAD
                  LOCAL count0 AS QUAD
                  LOCAL count1 AS QUAD
                  LOCAL r AS LONG
                  LOCAL waste AS LONG
                  LOCAL waste2 AS LONG
                  LOCAL sum AS LONG
                  LOCAL MyRecord AS DataType
                  LOCAL Rec AS LONG
                  LOCAL a AS STRING
                  DIM CacheIndex(%RunLimit)     'an array of record numbers
                  DIM CacheData(%RunLimit)      'the prefetched records
                  'Get timer frequency.
                  QueryPerformanceFrequency freq
                  OPEN $TestFile FOR RANDOM AS FileNumber LEN=SIZEOF(DataType)
                  RANDOMIZE TIMER
                  'first, the non-threaded way
                  QueryPerformanceCounter count0
                      'work out the records to be processed
                      FOR r=1 TO %RunLimit
                      'now process them
                      FOR r = 1 TO %RunLimit     'do required number of items
                          GET #FileNumber,CacheIndex(r),MyRecord
                          'give the CPU something to do with the data that takes a significant time
                          FOR waste=1 TO 20
                          FOR waste2=1 TO 1000
                  QueryPerformanceCounter count1
                  a="Non-Threaded time="+FORMAT$((count1-Count0)/freq,"###,###.000")
                  'Second, the threaded way
                  QueryPerformanceCounter count0
                      'work out the records to be processed
                      FOR r=0 TO %RunLimit-1
                      'initiate the prefetching thread
                      THREAD CREATE MyThread(0) TO waste
                          WHILE ProcessPointer<PrefetchPointer
                              'process the data
                              FOR waste=1 TO 20
                                  FOR waste2=1 TO 1000
                              INCR ProcessPointer
                      'if data has not been prefetched yet then give up the remainder of the timeslice rather than waste it
                      SLEEP 0
                      LOOP UNTIL ProcessPointer = %RunLimit
                  QueryPerformanceCounter count1
                  ?a+$CRLF+"    Threaded time="+FORMAT$((count1-Count0)/freq,"###,###.000")
                  QuitFlag = 1   'force all threads to quit
                  CLOSE FileNumber
                  END FUNCTION


                  • #29
                    For Paul's last code, my results are:

                    Non-Threaded time=8.299
                    Threaded time=8.414
                    It's a pretty day. I hope you enjoy it.


                    JWAM: (Quit Smoking):
                    LDN - A Miracle Drug:


                    • #30
                      Methinks Paul had done some tinkering
                      I deny it!
                      I aimed to have the access time and CPU time to be comparable, both a few millisecinds, but I didn't specifically tweak it for maximum performance.

                      The codes listed so far are not closing the file nor the thread and f& is not defined in the thread.
                      Files are closed when the program ends but it should have been closed explicitly as it's good proctice.
                      I noticed the file number wasn't defined in the thread! I've fixed it now but haven't looked into why it worked.

                      I used READFILE without buffering. That would defeat Paul's code
                      Not any more. Originally I only intended to show that prefetching the data into the cache would speed things up. I always knew it wasn't efficient as the data is then fetched a second time when it's used. It'd be much better to just fetch it once in advance as I do in the version I just posted.
                      Since it doesn't rely on the filesystem cache anymore, it might work in your program.

                      I do realise that, as posted, the last version is a little wasteful of memory as it fetches everything and stores it in an array. I did this to keep it simple so as not to hide the main principle that I'm trying to demonstrate which is that by using threads significant gains in speed can be achieved.
                      In practice, a real program would probably used a circular buffer of size equal to the number of items to be prefetched so the same small buffer will be reused throughout.

                      Last edited by Paul Dixon; 21 Oct 2007, 11:17 AM.


                      • #31
                        >Non-Threaded time=8.299
                        >Threaded time=8.414

                        Call me naive, but I would score that a wash, meaning: The use of multiple threads of execution in this application should not be viewed as a performance enhancer; any use of mutliple TOEs in this application should be done for other reasons.

                        Might this be a good application for asynchronous (overlapped) I-O?

                        Starter demo here: Asynchronous (Overlapped) I-O Demo
                        Michael Mattias
                        Tal Systems (retired)
                        Port Washington WI USA
                        [email protected]


                        • #32
                          you're naive. The original code was developed in PBCC and works well in PBCC gaining anything from 25%-40%. That's a significant gain.

                          You chose to quote the results from someone who ran it in PBWin for which the code was not written.

                          The latest version now works in both PBWin and PBCC and gains upto 50%. Twice as fast is not "a wash".



                          • #33
                            With PBWin8 I get 20.837s cached and 12.114s non-cached, 42% faster.
                            With PBCC4 I get 23.182s cached and 11.485s non-cached, 50% faster.

                            With 4GB RAM you may have the whole file cached for both runs so you need to make sure that it isn't cached.
                            Does the test file exist on your hard disk before you run the test? If not, you need to create it first.

                            Does the hard drive light flash continuously during the test? It should do.
                            If it doesn't then try making the testfile 5GB instead of 1GB as the 1GB was chosen to make sure it was larger than my 512MB RAM to guarantee it isn't still in cache when the test begins.

                            Also, the test only works well on hard disks. Solid state disks and USB memory drives show a much smaller improvement. On my PC with a USB memory I get a consistent 10% improvement.



                            • #34
                              Threads can be great in some apps

                              If you want to see a huge increase in speed, use them in an application that is waiting on outside responses.

                              For example I get a huge increase in speed when using multiple threads to "ping" remote devices on a network. On a single threaded application, it takes about 1.5 seconds to timeout on a ping of a dead IP address, so pinging a full IP range can take some time. But doing up to 50 at the same time, I can scan a 256 IP segment in about 7 seconds. Do the math on that. (assuming all dead IPs would be 384 seconds verses a threaded 7.6 seconds) And then do the math on the 800,000+ IP addresses I have one of my scanners running each day. (our full network)

                              So threads can be a huge help in speed... in the right situations. But increasing them too much will also hurt performance. For example if I am just pinging IP addresses I can get away with about 50 threads running the same time, but if I am collecting information also from the PCs via WMI or registry etc, I start losing speed when I get above 15 or so threads. This is because things start running in to bottlenecks. (NIC, HD, CPU and other shared resources)

                              In the case of threads, mileage will definitely vary by application. :coffee3:
                              Last edited by William Burns; 21 Oct 2007, 03:31 PM.
                              "I haven't lost my mind... its backed up on tape... I think??" :D


                              • #35
                                Paul, I can't wait to try out your latest code as I'm sure it will greatly benefit Message Digest/Secure Hash algorithms as they seem ideal candidates for pre-fetching data prior to hashing.

                                In the meantime you guys may like to have a look at this to avoid restarting your machines or writing massive files, Gösta.


                                • #36
                                  Just for the record....

                                  You don't need to use ReadFile and WriteFile to use the FILE_FLAG_NO_BUFFERING (or any other "non standard" 'open' flags)...
                                   hSys = CreateFile (......  special flags....)
                                   hFile  = FREEFILE
                                   OPEN HANDLE hSys for <PB syntax mode> AS Hfile
                                   ' -------------------------------------
                                   ' GET/PUT here 
                                   ' --------------------------------------
                                   CLOSE hFile
                                   CloseHandle hSys
                                  This does not eliminate the buffersize and alignment requirements for the "var" in the PUT or GETwhen using the FILE_FLAG_NO_BUFFERING option , but some might find PUT/GET a bit easier - or at least a bit more familiar - to work with.

                                  Michael Mattias
                                  Tal Systems (retired)
                                  Port Washington WI USA
                                  [email protected]


                                  • #37
                                    That is an interesting aspect of RTFM. It is a long time since I read OPEN in detail. It was read on first getting PBWin but I probably glossed over any reference to the API as it meant little too me at the time. It has occurred to me of late that some aspects of the manual may be worth reading again as a very much higher percentage may be absorbed now compared to the initial read. I tend to use the manual when using an aspect of PB that I have not ventured into much, or at all.

                                    On the issue of "buffersize and alignment requirements" I adhere exactly to the requirements in a working app but the idea used in the link above totally ignores any such requirements. Perhaps I should look at it again to see if the results differ any by adherence. Whilst on this subject I have not been able to find any comments about possible mishaps if such requirements are not adhered to.


                                    • #38
                                      While OPEN HANDLE can be used to effect an OPEN with 'non-default' options otherwise not (yet?) controllable thru PB syntax, it's more common use would be passing an open file handle to a function which resides in another code module.

                                      Since PB internal handles (e g "#12" or whatever is returned by FREEFILE) are only valid in the code module in which that handle was obtained, you can't pass it to such a function and expect it to be valid. Instead you do something like...

                                      ' MAIN.BAS
                                      #COMPILE EXE 
                                      DECLARE FUNCTION SupportFunction LIB "MyDLL.DLL" (BYVAL h AS LONG) AS LONG
                                        hFile = FREEFILE
                                        OPEN   "byfile"  FOR OUTPUT as hFile
                                        hSys  = FILEATTR(hFIle, 2)    ' get system handle
                                        CALL    SupportFunction ( hSys)
                                        PRINT #hFile, "End of Report"
                                        CLOSE hFile
                                      ' =================================================
                                      #COMPILE DLL
                                      FUNCTION SupportFunction (BYVAL hSys AS LONG) EXPORT AS LONG
                                          hFile = FREEFILE
                                          OPEN   HANDLE hSys FOR OUTPUT AS hFile 
                                          PRINT #hFile, "Hello World"    ' <<< PB syntax not "Writefile"
                                          CLOSE hFile     ' does not close underlying system handle
                                          FUNCTION = %TRUE
                                      END FUNCTION
                                      Michael Mattias
                                      Tal Systems (retired)
                                      Port Washington WI USA
                                      [email protected]


                                      • #39
                                        Paul Purvis asked me a little while back why I used a function called GetFileSize instaed of LOF and I told him that the handle returned by CREATEFILE didn't work.

                                        Now I realise that I could have simply written

                                        Open Handle hSys As #1
                                        Filelength = Lof(#1)
                                        Close #1

                                        Live and learn.


                                        • #40
                                          Real Men Don't Need No Stinkin' Handle...Write once, use many....The Power of MACROs.....<insert platitude of choice here>...

                                          MACRO EQ32(a,b) = (BITS???(a)=BITS???(b))
                                          MACRO FUNCTION ANY_FILESIZE (szFile) 
                                             MACROTEMP 32, hSearch , fsize 
                                             DIM w32 AS WIN32_Find_Data, hSearch AS LONG, fsize AS LONG
                                             hSearch       = FindFirstFile (szFile, W32)
                                             IF eq32(hsearch, %INVALID_HANDLE_VALUE) THEN
                                                fsize = -1&   ' "not found" 
                                                FindClose hSearch
                                                fSize =  w32.nFilesizeLow
                                             END IF
                                          END MACRO = fSize
                                             fsize = ANY_FILESIZE (szFile)
                                          (Fails if file open and size is being changed since directory is not updated until file is CLOSEd)
                                          (Also, not written for filesize > 2 Gb)
                                          Michael Mattias
                                          Tal Systems (retired)
                                          Port Washington WI USA
                                          [email protected]