Announcement

Collapse
No announcement yet.

Moving large amounts of data with GET$ and PUT$. Any pitfalls to consider?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Moving large amounts of data with GET$ and PUT$. Any pitfalls to consider?

    I haven't written the code yet. I'm still designing it in my head. Before I devote too much time to one design, I'm looking for comments.

    I'll be breaking apart very large files into smaller, more manageable chunks. Later, I'll need to concatenate the chunks back together. The source files will be tens of gigabytes, which can choke some applications. I wanted to break them into something like 500 MB chunks.

    I know there are applications out there that can do this (7zip and Winzip easily make "volumes" -- their word for chunks). But I'm neverthess going to write my own for a bunch of reasons that aren't relevant to the discussion. I want to do some processing on the chunks and have more programmatic control over the process.

    I realize Windows Command Line can concatenate text files. I'm working with binary files, not text. Part of me is reluctant to use ordinary concatenation because if one errant $CR or $LF gets stuck in there, it ruins a 30 GB file.

    So, here's where I'm heading...

    Code:
    LOCAL chunkstring as STRING
    
    OPEN "Recovered File.bin" FOR BINARY AS #1   ' this file will become a rebuilt concatenation of chunks
    
    FOR n=1 to 50
        OPEN "chunk" + DEC$(n) + ".bin" FOR BINARY AS #2   ' open the next chunk
        GET$ #2, LOF(#2), chunkstring       ' copy the entire file to a local string variable
        CLOSE #2
        PUT$ #1, chunk            ' accumulate the chunks
    NEXT n
    
    CLOSE #1
    My only concern is the GET$ and PUT$ commands and the strings they use. Like I said, these chunks could be 500 MB in size. Will PowerBasic choke on a string that size? I understand that string sizes in PB are essentially limitless. But certainly, they weren't exactly designed to hold 500 MB. If used inside a tight loop, do I risk over-consuming disk or memory resources? Is there another way to do this that I should consider?
    Last edited by Christopher Becker; 22 Jan 2021, 04:01 AM. Reason: typo
    Christopher P. Becker
    signal engineer in the defense industry
    Abu Dhabi, United Arab Emirates

  • #2
    I've used GET and PUT with 500+ MB files quite often and never had any problem. I have often had to manipulate the data in various ways including Parsing, converting LF to CRLF, searching for a byte sequence etc, etc. PB has alsways handled it flawlessly.

    If the file is much over 600MB, you can run into problems if you are modifying the string because you may need it twice in memory during the process. You can increase the size you can handle to some extent by using #OPTION LARGEMEM32 (The limit depends on how much memory your application is using for other purposes)
    #OPTION LARGEMEM32 - This allows your application to use more than the original limit of 2 Gigabytes of memory. Depending upon the version of Windows in use, and the installed memory, the exact increase may vary from computer to computer. In most cases, you will likely be limited to a total of approximately 3 Gigabytes.
    Note that GET$ takes a count parameter which is a LONG, but SEEK take a QUAD parameter which suggest you should be able to append A LOT of 500MB chunks to a file.

    You can actually work with chunks directly to and from the big file. Something along the lines of:
    '
    Code:
    FUNCTION DoIt() AS LONG
       LOCAL lngChunksize, lngChunkNum  AS LONG
       LOCAL qPosn AS QUAD
       LOCAL strData AS STRING
       lngChunkSize = 50000000
       '...
       OPEN "MyBigFile" FOR BINARY AS #1 BASE = 0
    
       lngCHunkNo = 4
       SEEK #1, lngChunkSize * (lngChunkNo - 1)
       GET$1 #1,lngChunkSize,strData
    
       'Do something that doesn't change the size of strData
     
       SEEK #1, lngChunkSize * (lngChunkNo -1)
       PUT$ #1, strData
       '...
    END FUNCTION
    '
    You can also do a series of SEEK, GET$ on the main file and PUT$ to separate files to do the chunking if you really want to have chunks on disk.

    Comment


    • #3
      Had a decent sized file siting in my test directory from another application so knocked out a test. Handled over 800MB with no problem.

      Testfile read in = 823 330 KB
      Test2 written = 4 939 980 KB

      (It did take around a minute to write the 5 GB on my slow laptop )
      '
      Code:
      #COMPILE EXE
      #DIM ALL
      #DEBUG ERROR ON
      #DEBUG DISPLAY ON
      %UNICODE =1
      #INCLUDE ONCE "WIN32API.INC"
      
      FUNCTION PBMAIN() AS LONG
          LOCAL strT AS STRING
          LOCAL dTimer AS DOUBLE
          dTimer= TIMER
       OPEN "testfile" FOR BINARY AS #1
       GET$ #1, LOF(1), strT
       CLOSE #1
       OPEN "test2" FOR BINARY AS #1
       PUT$ #1, strT
       PUT$ #1, strT
       PUT$ #1, strT
       PUT$ #1, strT
       PUT$ #1, strT
       PUT$ #1, strT
       CLOSE #1
         dTimer = TIMER - dtimer
        ? "done in "  & STR$(dtimer) & " seconds)
      END FUNCTION
      '

      Comment


      • #4
        Stuart, thanks for the posts. I figured no one was crazy enough to stuff hundreds of megabytes into a string like me. You've given me some hope that my design will work.
        but SEEK takes a QUAD parameter which suggests ...
        That's a pretty optimistic assumption! I like it! But what about when I need to SEEK the 9,223,372,036,854,775,808th byte in a file? What then?

        In all seriousness, I'm tempted to believe your warning about modifying such strings. I wouldn't try
        Code:
        s = LCASE$(STRREVERSE$(SHRINK$(s, " ")))
        with a 600 MB string. I can see how that would be pushing your luck (plus, Bob Zale would haunt me in my dreams for compiling that).
        Christopher P. Becker
        signal engineer in the defense industry
        Abu Dhabi, United Arab Emirates

        Comment


        • #5
          > But what about when I need to SEEK the 9,223,372,036,854,775,808th byte in a file?

          Be careful! You're pushing the file system limit

          Some resources say that the maximum file size on NTFS is 16 TB, other source say 256 TB and MS say 8 PB in certain circumstances. https://docs.microsoft.com/en-us/win.../ntfs-overview ( I haven't worked out which number is best to use, but I don't think that it will be an issue for me whichever one I choose )

          But seriously, If you are transferring files using USB flash drives, the common default formatting of FAT32 can only handle 4GB files. You need to format flash dirves as ExFAT or NTFS for larger files. If you are sharing with other OSs (*nix, Android, Mac) , ExFAT is the most commly understood format without jumping through hoops. (I learnt that from bitter experience with some 5GB files and a FAT32 flash drive)

          Comment


          • #6
            I'll be breaking apart very large files into smaller, more manageable chunks. Later, I'll need to concatenate the chunks back together. The source files will be tens of gigabytes, which can choke some applications. I wanted to break them into something like 500 MB chun

            Why?

            Why bother breaking them apart only to put them back together again? Seems you can work on a chunk... then work on another chunk .. and another/.

            I figured no one was crazy enough to stuff hundreds of megabytes into a string like me
            I think it's crazy - if there is no good reason to move the data from the disk file to a string. If you are worried about being able to work in RAM... you can always memory map chunks of the file and work on it there... and there is (as you probably already guessed)., a demo right here showing how to do that:

            Memory-mapped version of LINE INPUT 5-31-04

            Ignore the title iof that thread... Post #3 shows how you can map part of any really large file. While memory mapping part of the file can be a little tricky, I have simplified it for you in Post #3.

            Sure, dynamic strings cam be handy for some things, but do you really need it? What exactly are you doing with your data chunks? Cannot you parse or whatever just using pointer variables and/or relative offsets from the start of the chunk or file?

            I'm glad you are still thinking about the design...BEFORE you commit to a particular approach. Don't see that too much.

            Those of us who programmed in the MS-DOS days with the 640K limitation often had to handle "huge" files, up to a couple of MEGAbytes... .. which by today's standards really are not huge at all. BUT.. we developed a comfort for doing things "another way".. because there was no option to handle 5 - or in your case 500 - megabytes of data all at once.

            MCM
            Michael Mattias
            Tal Systems (retired)
            Port Washington WI USA
            [email protected]
            http://www.talsystems.com

            Comment

            Working...
            X