Announcement

Collapse
No announcement yet.

fastest number cruncher?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fastest number cruncher?

    I want to use PBCC to develop a simulation model that does a lot of calculations. Each run uses maybe 5 million calculations and then I change the conditions and do the next run and so on. The program reads in about 1.7 megs of data for each run.
    I and finally mowing away from DOS PB3.2 to PBCC. It has taken me a long time to develp the program and the change over is something I have put off for a decade. I want to crunch the numbers and read in the data as fast as possible so I can keep the processing time down (as well as enjoying the challenge of speed). My impression of most Windows programs is they are slow to crunch numbers as there is so much else happening in the background there is little CPU time left for me. Is is possible that a DOS system is still faster?

    Before I get started can anyone make some low level suggestions on how to structure the set-up for maximum speed. No problem to use a dedicated PC but I think the program structure will be more important than the PC clock speed??
    Thanks

  • #2
    Peter,
    My impression of most Windows programs is they are slow to crunch numbers as there is so much else happening in the background there is little CPU time left for me
    That's not right.
    There is negligible interference from other Windows processes unless you run other intensive tasks at the same time.
    My PC has been on for hours now running all sorts of non-intensive stuff and it shows 98.7% idle time. That's all available for any intensive task I might run.

    With Windows, depending on the nature of your calculation, you may be able to use threads to run different parts of your task simultaneously on different CPU cores in you PC. This can't be done with DOS programs but it can easily quadruple throughput with a quad core CPU. To take advantage of this you ideally need calculations which are independent of each other but it might be possible to take advantage anyway depending on the nature of the calculations.


    1.7MB of data is almost trivial for Windows, if it's the same 1.7MB you work on then it'll fit in the CPU cache for a modern CPU. If it's many different 1.7MB blocks of data then you might benefit from something like this:
    http://www.powerbasic.com/support/pb...d.php?p=365901
    In post #10 there is code that allows the next block of data to be fetched while the previous one is written to disk and the current one is being processed in order to avoid minimise any waiting for data/hard disks. If you have lots of data to read AND write then get a PC with 2 fast hard disks and read the data from 1 while writing results to the other.

    Other suggestions are dependent on the details of you task.

    Paul.

    Comment


    • #3
      Independent calculations?

      If the data can be calculated in parallel, independent calculations that don't depend on previous calculations, you can look at CUDA Toolkit from NVida. Works with most newer 3d graphics cards. Off load the processing to the graphics cards.

      The high end cards, Tesla, can do 500 gflops double precisions, and 1Tflops single precision.

      The calculations must be of a specific problem domain. Matrix Math, Fast Fourier Transforms are some problems that can be really perform. Basically, if the calculations are independent and there are a lot of them, it will be worthi it. If the calculations are dependent there may be some minor gains. In addition there is overhead to send the data to/from the GPU.

      My replacement laptop, when I replace it, will have a cuda capable video card

      Comment


      • #4
        I never had performance problems with applications developed in PowerBASIC, in fact, they always ran very fast.

        Regarding data parallel problems, the GPU is more suitable than CPU, which is not problem of PowerBASIC or Windows, but of the hardware design (at the moment).

        Brians CUDA suggestion is good, as an alternative you might want to pick up OpenCL, which is not limited to NVidia hardware, but runs on AMD Radeon GPUs and AMD CPUs. Intel support for OpenCL is "strange" at the moment, currently in pre prelease stage.

        I posted an starter example on using this technology from PowerBASIC sometime ago:
        http://www.jose.it-berater.org/smffo...p?topic=3327.0

        Doing data parallel programming is something you need to make your head used to, but even the cheap NVidia cards can outperform current 4 core CPUs in data parallel problems easily.

        If you are interested in some read, my older work included porting algorithm from CPU to GPU, and the difference was changing the time of calculation from 15 minutes to 32 seconds

        You can read the paper and see some charts here (PDF download) (my ThinBASIC forum post, as I had no other place to put it, but of course, OpenCL can be used from PB as well, as my previous link to Josés forum shows)

        But I would first recommend to try to develop it in PB/Win or PB/CC, the performance will be most likely more than enough, if the algorithm is designed well and not bound by hard drive reads. GPU programming is not a thing which one could learn during weekend, while PB is quite straightforward and giving nice performance even if you stay quite highlevel.


        Petr
        Last edited by Petr Schreiber jr; 27 Aug 2011, 01:01 PM.
        [email protected]

        Comment


        • #5
          don't forget about calculations that may lend themselves to a look up table. often retrieving the pre-calculated info is faster then creating it on the fly every time it's needed, memory is cheap.

          first make it run, them make it run fast.
          Last edited by Frank Fenti; 27 Aug 2011, 08:29 PM.
          BASIC shampoo - DO:LATHER:RINSE:LOOP UNTIL CLEAN <>0

          Comment


          • #6
            Thank you all for your suggestions. (I feel a bit like some guy who has been lost in the jungle wondering if the war is over.) I am going to enjoy trying your suggestions. Yes I can run parallel process. In a first pass I think I can just run a multi-core PC and run multiple copies of the program, each doing some of the work. I never would have thought of a graphics card but I guess this is just a number crunching accessory.
            Q#1: I used to use a RAM disk to speed up data read/write. If the cache will do this anyway is it correct to assume there is no value in doing this (approx 1.5 meg read size)?
            Q#2: Is there any value in using an XP OS with a minimal install to minimise the potential for competing processes? It seems unnecessary.
            Q3: I had a read of Petr's paper and it seems that he used OpenCL. However if I just stick to PBCC and add a graphics card do I need to specifically instruct the graphics card to work or will it do this by default?
            Q4: Assuming I could (theoretically) split the calculations into an infinite number of separate processes or just one larger, identical process BUT I am confined to just one processor (core) will it be faster to parallel process or does Windows XP "multi-task" merely by moving sequentially from one task to another (albeit quickly). If multi-tasking is just sequential rather than parallel then there seems no benefit??
            Q5: XP is now replaced by Win7. Is this a better OS for multi-tasking?

            Thanks again for your comments.

            Comment


            • #7
              Hi Peter,

              Q3: I had a read of Petr's paper and it seems that he used OpenCL. However if I just stick to PBCC and add a graphics card do I need to specifically instruct the graphics card to work or will it do this by default?
              OpenCL is technology usable from PB/CC - on Windows, it is "just" set of functions in DLL installed by graphic driver (NVidia) or SDK (ATi/AMD/Intel), which allow to setup the computation for GPU. The GPU program itself is written in something based on C99.

              So typically - you load the data from hard drive using PB/CC, organize them to variables/arrays in PB/CC, you call the OpenCL run time functions from the PB/CC to initialize GPU, create queue, compile the GPU program and run it on GPU cores, you use PB/CC to pickup the crunched data back.

              Last note - it is lot of code to just setup the calculation on GPU, and it is hell to debug. If it runs, it is extremely fast, if you run into driver problem or some card specific issue, you can spend half of the month just debugging.

              So ... performance comes at a price of gray hair in case of OpenCL
              Check the complete example I linked on Josés forum, it shows the most simple task of summing two arrays to third. In pure PB, it is 3 lines of code, with OpenCL, you decorate whole thing with 100 lines of code


              Petr
              [email protected]

              Comment


              • #8
                I would start by doing this in PBCC and just writing the code. I think you will find it runs faster than the current DOS version.

                Use the built in PB profiler. This will tell you where the application spends most of its time and is the performance bottle neck.

                Improve performance in this area, repeat.

                NVidia does simplify things using the CUDA toolkit, but Petr correct, there is still setup to work with the graphics card. By the way there is a debugger if you have two graphics cards.

                What type of calculations are you doing? I'd be interested in helping if you can post the current code or explain the project. I am sure others will add there ideas too.

                Comment


                • #9
                  Peter,
                  Q#1: I used to use a RAM disk to speed up data read/write. If the cache will do this anyway is it correct to assume there is no value in doing this (approx 1.5 meg read size)?
                  The Windows file system will cache that for you anyway so you don't need a RAM disk.

                  Q#2: Is there any value in using an XP OS with a minimal install to minimise the potential for competing processes? It seems unnecessary.
                  No value, it's not necessary.


                  Q3: I had a read of Petr's paper and it seems that he used OpenCL. However if I just stick to PBCC and add a graphics card do I need to specifically instruct the graphics card to work or will it do this by default?
                  I'd forget the suggestion of using a graphics card. It's a very specialised area which will give you benefits in very restricted circumstances and requires you to rewrite your code. First get your program working in Windows then, if it doesn't perform well enough, look at other alternatives.

                  Q4: Assuming I could (theoretically) split the calculations into an infinite number of separate processes or just one larger, identical process BUT I am confined to just one processor (core) will it be faster to parallel process or does Windows XP "multi-task" merely by moving sequentially from one task to another (albeit quickly). If multi-tasking is just sequential rather than parallel then there seems no benefit??
                  You must split your process up into individual threads to take advantage of multiple cores. It's not that difficult.
                  Windows will then schedule the threads on the available CPU cores.
                  If you have only 1 thread then it will only ever use 1 core.

                  Search this place and you'll find plenty of examples of using multiple threads to speed things up such as:
                  http://www.powerbasic.com/support/pb...ad.php?t=44282
                  http://www.powerbasic.com/support/pb...=41843&page=10



                  Q5: XP is now replaced by Win7. Is this a better OS for multi-tasking?
                  No. Just go with whicher OS you're most comfortable with.

                  Paul.

                  Comment


                  • #10
                    Originally posted by peter edmiston View Post
                    ...structure the set-up for maximum speed....
                    Been thinking about this for a while. Since you didn't specify one way or the other: Avoid screen print/write routines like the plague.

                    That, in itself, will speed things up a minimum of 10x.

                    If you absolutely have to have screen updates, try updating every 10,000 (or so) calculations.

                    But it all depends on what you are trying to accomplish.
                    There are no atheists in a fox hole or the morning of a math test.
                    If my flag offends you, I'll help you pack.

                    Comment


                    • #11
                      If you still want to use a ramdisk you might try,

                      http://memory.dataram.com/products-a...ftware/ramdisk

                      They have free version up to 4GB using NFTS (unpaid-free), and 2GB with Fat32. I have not tried it yet but in forums users seem happy. Fairly good PDF doc's

                      One advantage of memory drive is less wear and tear on hard drive + much faster i/o. They have another feature for 32bit OS, where memory installed above 4GB can be used as memory drive.

                      Comment


                      • #12
                        Thanks again for the detailed replies. I gather there is no such thing as a free lunch so the graphics card is on the back burner. I just wish no one had told me what kind of speed is possible, if only I was clever...
                        At this stage the most practical thing is a 4 or 8 core PC with 4/8 threads running. That is probably enough to get started.
                        I agree, writing to the screen is deadly slow and stick to integers (in the belief this is faster?).

                        I just tried an experiment with a simple counting loop. The experiment was run an a dual core PC. I then copied the same program 5 times and ran them independently.
                        Running 1 9 seconds each
                        Running 2 9 seconds each
                        Running 5 ~24 seconds average time

                        Which seems to be almost exactly the same as

                        Core 1 Running 2 x programs, each taking 18 seconds
                        Core 2 Running 3 x programs, eaching taking 27 seconds
                        Average of (2x18 + 3 x 27)/5=23.4 seconds

                        So it seems there is no obvious advantage in splitting the program into more than the maximum number of cores available.

                        I can't help thinking there is a lot of unused CPU despite the 100% usage as shown on task manager.

                        Brian: The code is just simple maths used for a hydrological simulation model. There are daily weather inputs and then the program calculates plant growth and soil water and runoff for 120 years of data. Then it changes a variable and repeats the process. If the program is faster I can then use smaller steps and spend more time on optimisation. There are some thousands of lines of code but only a small part of this does the iterative processing.
                        However if it takes 30 lines of GPU code to use one of PBCC code then it is going to be slow to take advantage of this opportunity.

                        It sounds like a job for Uncle Bob: Power GPU

                        Thanks again.

                        Comment


                        • #13
                          Peter,
                          a 4 or 8 core PC with 4/8 threads running
                          Be aware that Intel HyperThreading allows you to run twice the number of threads on a CPU but it is not twice as fast. Instead, each core shares its resources between the 2 threads which can make better use of the CPU resources by getting 20%-30% more work done, but not 100% more.


                          writing to the screen is deadly slow
                          Not if you do it sensibly. Update the screen only when needed and only at a rate that's useful for the user and you'll not notice the extra time it takes but you may well benefit from the feedback it provides.


                          stick to integers (in the belief this is faster?)
                          Depends on the job you're doing.


                          So it seems there is no obvious advantage in splitting the program into more than the maximum number of cores available
                          Which seems to be almost exactly the same as
                          They aren't exactly the same! You complete faster with more threads in that case beacuse your CPU is 100% utilised for 24s instead of 100% utilised for 18s and 50% utilised for the next 9s.
                          As usual, it depends on what you're doing but if each thread was to take 1s then you might not notice but if each thread was to take 12 hours then you'd be waiting hours longer than necessary for the final result.



                          I can't help thinking there is a lot of unused CPU despite the 100% usage as shown on task manager.
                          A lot of that unused CPU is utilised by Intel in the hyperthreading mentioned above.
                          The rest is up to you to use and you need to program with that in mind. The "100%" tells you that you had complete access to the CPU resources, it doesn't tell you how well you made use of them.


                          Paul.

                          Comment


                          • #14
                            FWIW, 1.5 Meg of data x 5M calculations = Not even breathing hard for modern computers.

                            DISCLAIMER: "how many runs" not shown. Also assumes one calculation = one arithmetic operation.
                            Michael Mattias
                            Tal Systems Inc. (retired)
                            Racine WI USA
                            [email protected]
                            http://www.talsystems.com

                            Comment


                            • #15
                              Originally posted by Michael Mattias View Post
                              FWIW, 1.5 Meg of data x 5M calculations = Not even breathing hard for modern computers.

                              DISCLAIMER: "how many runs" not shown. Also assumes one calculation = one arithmetic operation.
                              Agreed, PB in its own advertising gives an example of simple floating point math where the current compilers are over 2000 times faster than their own DOS compilers on the same computer. It may be an extreme example but shows the benefit of going to 32 Bit.
                              OP you mention accessing a 1.7 Meg file which is in todays terms small. Yes Windows will most likely keep it in cache but even faster as you now have 2GB of flat memory addressing then just load the whole file into your program in one go and use it from there.

                              Comment


                              • #16
                                My point is more, often we see here references to "Huge" or "Gigantic" or "Really Big" tasks.... which may well have been accurate descriptions running under MS-DOS on a 6 Mhz processor... but under Windows' using a modern computer it's a nothing and just not worth the effort of developing some special 'optimization scheme.'


                                MCM
                                Michael Mattias
                                Tal Systems Inc. (retired)
                                Racine WI USA
                                [email protected]
                                http://www.talsystems.com

                                Comment


                                • #17
                                  Originally posted by Michael Mattias View Post
                                  My point is more, often we see here references to "Huge" or "Gigantic" or "Really Big" tasks.... which may well have been accurate descriptions running under MS-DOS on a 6 Mhz processor... but under Windows' using a modern computer it's a nothing and just not worth the effort of developing some special 'optimization scheme.'


                                  MCM
                                  Agreed
                                  I have programmes that 15 years ago took 6 to 8 hours, today they take less than 2 minutes. They work on data files of 400MB+. The speed came from three simple reasons. Hardware, 32 Bit processing and the only real optimisation of understanding how to use the amount of memory that can be addressed.

                                  Comment


                                  • #18
                                    I just tried an small experiment running exactly the same program on PB3.5 (DOS) and PBCC ver 3.0. It takes 27 seconds for the DOS and 9 seconds for the PBCC.
                                    Please excuse my ignorance but can I assume PBCC 3.0 is a 16 bit compiller and the latest version is a 32 bit? Can I also assume that if I use a 32 bit version it will be faster than the 16?
                                    Thanks

                                    Comment


                                    • #19
                                      Both PB/CC 3 and PB/CC 6 generate 32-bit executables, so that isn't the direct issue. However, overall, PB/CC 6 will offer considerably better performance and an improved feature set. Highly recommended.

                                      Bob Zale

                                      Comment


                                      • #20
                                        FWIW Many years ago ~10 or so, I used a very early version of PBCC, on the Win 98 OS, I believe, to access and manipulate 6 years of hourly meteorological data, items such as temperature, relative humidity, barometric pressure, wind speed, wind direction, etc. I forget how many items there were, but I correlated each item with my work history for the same period.
                                        This whole process, even with my badly written code, did not take an overly long time to produce results, although I can't remember just how long it took. I know I was expecting, when I started, that it may take a few hours to produce anything usable, but it didn't take hours.
                                        I had started this in PBDOS, but gave up and switched to PBCC because of the amount of data. Just what the speed improvement was I can't say but it was substantial even at that time.

                                        Interestingly, although not related to the issue, I had expected to find that my problems with my work attendance were related to barometric pressure, that turned out not to be the indicator. There was a 92.xx% correlation between an attendance issue and 8 hour periods of unchanged relative humidity. And a 91.xxx% correlation between my attendance issues and 8 hour periods of unchanging temperature. During that period, I had missed an average of 16 or 17 days a year, and was late for work at least twice that often.

                                        Also interesting, that was the only program I wrote between 1999 and 2007.
                                        Last edited by Rodney Hicks; 1 Sep 2011, 03:50 PM.
                                        Rod
                                        I want not 'not', not Knot, not Knott, not Nott, not knot, not naught, not nought, but aught.

                                        Comment

                                        Working...
                                        X