Announcement

Collapse
No announcement yet.

how to write an incremental backup application?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to write an incremental backup application?

    I think I could write an incremental backup program for a file of fixed-length records, if I knew or could somehow find out the record length, but how to make an incremental backup of a file of unknown structure is - er - a challenge! any ideas out there?

  • #2
    What about storing the last length backed up and storing from that point to the end ? That would not require the knowledge of the record length.

    Well, of course, that wouldn't help if an earlier part is changed. hmm.

    Aren't most incremental archives based on date of the file and the entire file is backed up rather than records within it ?
    Last edited by Fred Buffington; 3 Apr 2008, 03:57 PM.
    Client Writeup for the CPA

    buffs.proboards2.com

    Links Page

    Comment


    • #3
      ???

      "Incremental backup" usually refers to selecting a file for backup only if its archive bit is set, then resetting that bit after copying the file.

      Contrast with "differential backup", in which you select a file for backup only if its archive bit is set, but you don't reset that archive bit after copying.

      With incremental back of a group of files, to restore you need the last full backup plus ALL backup sets since the last full backup; using differential backup, you need the last full backup plus ONLY the most recent differential backup.

      Are you trying to backup pieces of a single file? That's tricky enough when you KNOW the structure; it's silly to even think you could selectively backup pieces of it if you don't know the structure.

      Then again, if you KNOW the file is ONLY appended to, all ya gotta do is keep track of the filesize at the last backup and copy only the rest of it at backup time.
      Michael Mattias
      Tal Systems (retired)
      Port Washington WI USA
      [email protected]
      http://www.talsystems.com

      Comment


      • #4
        Hi Chris,

        i have had a lot of experience in the area of backup, i remember when our first computer system took over one hour to backup less than 700kb worth of data on to a 700k floppy diskette. i was able to reduce the time down to 12 minutes with file compression.

        i complained to the people who wrote the software about something they were doing. at the end of the month, to clearout the data in the file, rather than erasing a file and creating a new file, they would just change the header in the file. they said i did not understand what goes on inside of the computer(defragmentation) and i told them i understood time and it was one hour a day, 220 days a year, let see(1hour*220days/40hours=5.5 work weeks based on a 40hours in a week , so out of a year, our computers where not being used for 5.5 weeks during work.

        now on a different system with many many more files, our backup(where the server is down is under 1 minute), i offload the data to a dedicated backup computer that does the actual backup.

        i am a programmer and user. so any saved time is mine, any lost data is mine, i pay the cost.

        when backing up data, i have always thought it was the one who wrote the program to create an effiecent way to save changed and added data(transactions that have taken place) and also the programmer's job is to make sure that when data is corrupted to warn the user or do some actions.

        in backing up, it depends on many factors,
        how much a file changes in one period
        how big are the files to backup
        how many files to backup
        how are the indexes to database files created and can they be recreated, are the indexs in separate files for the database file.

        the biggest question you have to ask yourself is that can you be 100% sure you can recover from your backup.

        one of the thing i did was write a program to create a database that duplicated the main database as far as structure goes, then every day run a program to transfer new items in the main database to this second database and also update any records with changes. in by doing this i was able to purge the main database once a year and backup the only the main database rather than this growing database that had records that where stale. because the second database had the same format and all historical data, we where able to access that data all the time and it made the main database faster to run programs on because it was always small with only active records. i purged the main database at the end of february or march each year. now that we changes software our single database has all records back to 1983, because the programmer who made this does not care about time or efficiency when it come to stale data and one never knows when you need to get to what data.

        so with the above, said compression of files and creative backup routines and strategies come into play.

        i have no idea of what or who is needing the backup, but if you a intent on doing file changes only, i would look into keep a duplicate of the most recent backed up files to compare against, then compare the files against the files that where backed up. basically after any backup, just copy all files with changes to a separate place for the next time you do comparisons for creating a backup. all this can go very wrong if it is not monitored in some way. the dates on computers are going to be very important to be accurate in comparing these files too i would guess.

        if the data files are shrinking in sizes from what was previously backed up, then backup the whole file, if the files are small and have changed backup the whole file, if the file is larger that what you would consider small and the file is equal to or larger than the backed up file, compare by blocks of the file and backup blocks
        to speed up block comparing, you might want to also want to use some kind of crc checking and store that in a file along with the block of data backed up, possibly in another separate file for checking.

        there was some fast code written in a thread on doing file comparisons that would be beneficial in comparing blocks of a files.

        a word of advice, most users are not going to validate that successful backup are actually taking place, that validation needs to be recorded somewhere, preferably in a file log for later review.

        backing up all comes down to four things; time, accuracy, recoverability, and easiness on the user. backup programs and time on such endeavors never get to be the prince and have little value, until lost of data occurs.

        paul
        p purvis

        Comment


        • #5
          when doing incremental backups, considering todays storage options, i would only feel comfortable if full copies of backup where retained on some media other than just the backup media.
          the full backup copies could be incremental in nature where there is a deletion of older backups to keep the media from being used up. this would be the first place to restore data from and the media where the data is permanently being backed up to should be a the last resort for data loss retrieval.
          p purvis

          Comment


          • #6
            > to speed up block comparing, you might want to also want to use some kind of crc checking and store that in a file along with the block of data backed up, possibly in another separate file for checking.

            > there was some fast code written in a thread on doing file comparisons that would be beneficial in comparing blocks of a files.

            For crc I wouldn't use one but would go with a hashing algorithm.

            With the fast code that Paul refers to I get about 900ms to compare two 10MB blocks with one on an internal SATA and the other on a external USB 2.0. With both on the SATA I get 540ms. Comparing a 10MB block, on the SATA, against an MD5 hash comes in at 400ms.

            Comment


            • #7
              Originally posted by Michael Mattias View Post
              it's silly to even think you could selectively backup pieces of it if you don't know the structure.
              MCM,

              Lets' use the term "differential" instead of "incremental", as you suggested.

              Starting off with "why do a differential backup at all?". The reason is to reduce the volume of stored data required to recreate the file during the "secured period", which is the span of time containing the points at which the data can be recreated. So the apllication will be related to backing up large files only. "Large here is relative to storage medium. ISTM that what we are looking for is the largest possible chunks of matching data between the stored data (base plus deltas) and the live data, irrespective of position occupied in the file, and a way of repositioning them when the data are retrieved.

              OK, there are two slippery terms in that assertion:

              What are "largest chunks of matching data" how do we determine what these chunks are, and if the method is iterative, when do we stop trying.

              What is "a way of repositioning", how is it accomplished. We also need it to do "base plus deltas", of course.

              Yours, daft as a brush....

              CJH

              Comment


              • #8
                Given you don't know the file structure (your own words), comparing blocks of data at given offsets is silly, because you don't know that the application writing to that file isn't moving data around in the file.

                E.g, the file may be indexed or keyed; inserting a record may insert those new records somewhere in the middle of the physical file... or changes may occur "in place", meaning "some interior portion (a logical record)" changes.

                I am thinking you simply "must" know something about the file structure to even consider trying what I think you are considering trying.

                I have several large files I have to back up. You know what I use? A tape drive (about $400), and (are you ready for this?) Microsoft Backup.

                You can back up a multi-GB file fairly quickly (ten minutes?) and with tape you don't have to worry about running out of media, i.e., no worries about "disk full, please insert a new disk."

                For that matter, you can get "size large" removeable hard drives at a pretty reasonable cost and back up to those things.

                My view is a couple hundred bucks of hardware solution is going to be a much better investment than will be the programming time+effort (=money) to create some kind of "find changes and backup only those changes " application.


                MCM

                PS:
                Not included: programming time and effort to put Humpty-Dumpty back together again if a 'restore' is required!
                Last edited by Michael Mattias; 4 Apr 2008, 07:45 AM.
                Michael Mattias
                Tal Systems (retired)
                Port Washington WI USA
                [email protected]
                http://www.talsystems.com

                Comment


                • #9
                  I have several large files I have to back up. You know what I use? A tape drive (about $400), and (are you ready for this?) Microsoft Backup
                  T A P E ? ? ?

                  Say it ain't so, please!

                  All these years and people still think tape is reliable? Wow. Hey, I got this bridge you just have to see!!!

                  I hope you do FULL RESTORES from those flimsy, error prone, heat prone, thin rolls of plastic you valuable data is "stored" on at least one a month... you might be surprised to find out whats NOT on them

                  (Google disk-to-disk backup solutions. You can get TERABYTE external drives for under $500. Heck, a decent 500gb external drive will run you less than $100.00 and think of all the good night sleep you can get again!)
                  Software makes Hardware Happen

                  Comment


                  • #10
                    Chris, I should 1st mention that I saw a 500GB HD a couple weeks ago for $79. Okay that said, there is a program I have used for probably a decade now called DLSuperC32 based on the IBM program SuperC that compares two text files with records of almost any length. It's advertised as Industrial Strength and I have found it to be pretty much that. It's $19.95 shareware.

                    Its output is a text file with line numbers of all the changes (and what kind of changes those are), so you could conceivably write a small utility to back up just those line numbers after any given DLSuperC run.

                    Comment


                    • #11
                      Originally posted by Michael Mattias View Post
                      ... silly, because you don't know that the application writing to that file isn't moving data around in the file.
                      1. Comparing blocks of data at given offsets is silly. There is no point in backing up just one file of the target system, unless it is a single file like a SQLite database, eg. You wouldn't run the backup while the application was running. So at the time of the backup, you would be backing up all the applications' files, though not necessarily using the same technique, and nothing would be changing.

                      2. Going back to the data comparison. Clearly, there is an overhead in the chunk descriptor. It would make no sense at all to start with a 1-byte chunksize. Maybe the best way is to start with the whole file, compute a checksum and see if it is 100% unchanged. If not, then do binary chops on file A, and using a plausible chunk size, look up file B to find matches at any location with said chunk. Having found one (process all of them in this way), attempt to grow that chunk in both directions until it becomes a non-matcher. If it bumps into another unchanged chunk, coalesce the chunks. This type of approach migh go faster if streamed, and would benefit from intelligent optimisation at Assembler level - see something by Charles Pegge about parsing the Bible in 0.7 secs, forget where I saw it, ah yes .

                      Comment


                      • #12
                        Originally posted by John Gleason View Post
                        ...that compares two text files with records of almost any length.
                        John, thanks for the thought, but I'm looking at "binary" files mostly.

                        Comment


                        • #13
                          So let's say you start looking, and byte #2 of the file is different from what it was.

                          What do you do now? Backup everything after byte #1? Gee, that saved a lot.

                          Many files ("unknown" structure) have a file header in the first "n" bytes... which changes based on the number of records stored, the last update timestamp, etc... meaning you might find a change in nearly every "generation" of that file, while actual "data changes" may be interspersed within the file or consist of new data added to the end of the file.. or both. (Not that a changed header might not be very important!)

                          Worse, what if the software has done a "reorganization" of the file, to say, remove unused records. Now great swaths of the file are 'moved' but unchanged, but you can't know that unless you somehow perform synchronization comparisons at various offsets.

                          Absent some idea of the file structure this whole project is, IMNSHO, a stone loser.

                          Just check the archive bit; if it's set, back up the file; if it's not, don't.
                          Michael Mattias
                          Tal Systems (retired)
                          Port Washington WI USA
                          [email protected]
                          http://www.talsystems.com

                          Comment


                          • #14
                            Originally posted by Michael Mattias View Post
                            So let's say...
                            If your post is in response to my latest (#11), please re-read.

                            Comment


                            • #15
                              How big is this file?
                              Michael Mattias
                              Tal Systems (retired)
                              Port Washington WI USA
                              [email protected]
                              http://www.talsystems.com

                              Comment


                              • #16
                                Originally posted by Michael Mattias View Post
                                How big is this file?
                                I'm not trying to solve a single real-life problem, although it was one that got me thinking about it. So it's not "a file", it's a "class of files" where file storage and transmission considerations rule out, for example, taking compressed copies of entire zipped files. It could be, but doesn't have to be, on a standalone or networked desktop PC. In this instance, my primary focus is on how to design the application, rather than on how to avoid using it!

                                Comment


                                • #17
                                  So, the files have binary data, but are they organized in a record-like structure? That is, when a file is modified, are the changes consistent with changes one might find in a "standard record format" file but with your own unique format? Are you limited to, for example, adding a record, deleting a record, and changing a record (length and content)?

                                  Or, can the changes simply be any bytes, any count, any location?

                                  Comment


                                  • #18
                                    >Or, can the changes simply be any bytes, any count, any location?

                                    That has to be a "YES" given the stated conditions.. that the file structure is "unknown."

                                    "IF" you know something about the target file(s), it may be practical to develop a backup scheme based on pieces of the file... but in the truly "unknown" circumstance I stand by my use of the adjective "silly."

                                    Then again, Miguel de Cervantes gave us a colloquialism still with us some four hundred years after the fact merely writing about "tilting at windmills."

                                    MCM
                                    Michael Mattias
                                    Tal Systems (retired)
                                    Port Washington WI USA
                                    [email protected]
                                    http://www.talsystems.com

                                    Comment


                                    • #19
                                      Originally posted by John Gleason View Post
                                      Or, can the changes simply be any bytes, any count, any location?
                                      Yes. Think of each file as a string or an array of bytes.

                                      Comment


                                      • #20
                                        This reports a list of file differences between two files without making assumptions about file structure. I've only tested it gently, because I don't want it to go wrong.

                                        Code:
                                        #PBFORMS CREATED V1.51
                                        #COMPILE EXE
                                        #DIM ALL
                                        
                                        #PBFORMS BEGIN INCLUDES 
                                        #IF NOT %DEF(%WINAPI)
                                            #INCLUDE "WIN32API.INC"
                                        #ENDIF
                                        #PBFORMS END INCLUDES
                                        
                                        #PBFORMS BEGIN CONSTANTS 
                                        %IDD_DIALOG1     =  101
                                        %IDC_LABEL1      = 1001
                                        %IDC_LABEL2      = 1002
                                        %IDC_LISTBOX1    = 1003
                                        %IDC_CURRFILE_TB = 1005
                                        %IDC_lastfile_tb = 1004
                                        %IDC_BUTTON1     = 1007
                                        %IDC_LABEL4      = 1008
                                        %IDC_TEXTBOX1    = 1009
                                        #PBFORMS END CONSTANTS
                                        
                                        #PBFORMS DECLARATIONS
                                        
                                        %minchunksize = 128
                                        %chunkincrement = 100
                                        GLOBAL gsLast, gsCurr AS STRING
                                        GLOBAL ghDlg AS DWORD
                                        
                                        '-----------------------------------------------------------------------------------
                                        ' strings are different on entry.
                                        FUNCTION compare ( llow AS LONG, lhigh AS LONG ) AS LONG
                                            LOCAL lmid, n AS LONG
                                        
                                            IF lhigh < llow + %minchunksize THEN
                                                FUNCTION = -1
                                                EXIT FUNCTION ' not found
                                            END IF
                                            lmid = (llow + lhigh) / 2
                                            n = INSTR(gsCurr, MID$(gsLast, lmid, lmid))
                                            IF n THEN ' match in upper half of subtree?
                                                LISTBOX ADD ghDlg, %IDC_LISTBOX1, _
                                                      "found " + STR$(lmid) + "bytes at " + STR$(n) + " in current file matching last know file at " + STR$(lmid)
                                                compare( llow, lmid-1)  ' yes, so search lower part of tree
                                            ELSE
                                                compare( lmid + 1, lhigh) ' no, refine search in upper part of tree
                                            END IF
                                            EXIT FUNCTION
                                        END FUNCTION
                                        
                                        '-----------------------------------------------------------------------------------
                                        CALLBACK FUNCTION ShowDIALOG1Proc()
                                            STATIC sLastFile, sCurrFile AS STRING
                                        
                                            SELECT CASE AS LONG CBMSG
                                                CASE %WM_INITDIALOG
                                                    ghDlg = CBHNDL
                                        
                                                CASE %WM_COMMAND
                                                    SELECT CASE AS LONG CBCTL
                                                        CASE %IDC_BUTTON1
                                                            IF CBCTLMSG = %BN_CLICKED OR CBCTLMSG = 1 THEN
                                                                LISTBOX RESET CBHNDL, %IDC_LISTBOX1
                                                                CONTROL GET TEXT CBHNDL, %IDC_LASTFILE_TB TO sLastFile
                                                                CONTROL GET TEXT CBHNDL, %IDC_CURRFILE_TB TO sCurrFile
                                                                OPEN sLastfile FOR BINARY AS #1
                                                                gsLast = STRING$(LOF(1),0)
                                                                GET #1, 1, gsLast
                                                                CLOSE #1
                                                                OPEN sCurrfile FOR BINARY AS #1
                                                                gsCurr = STRING$(LOF(1),0)
                                                                GET #1, 1, gsCurr
                                                                CLOSE #1
                                                                IF gsLast = gsCurr THEN
                                                                    LISTBOX ADD CBHNDL, %IDC_LISTBOX1, "files are identical"
                                                                ELSE
                                                                    compare(1, LEN(gsCurr))
                                                                END IF
                                                            END IF
                                        
                                                    END SELECT
                                            END SELECT
                                        END FUNCTION
                                        
                                        '----------------------------------------------------------------------
                                        FUNCTION ShowDIALOG1(BYVAL hParent AS DWORD) AS LONG
                                            LOCAL lRslt AS LONG
                                        
                                        #PBFORMS BEGIN DIALOG %IDD_DIALOG1->->
                                            LOCAL hDlg  AS DWORD
                                        
                                            DIALOG NEW hParent, "file differences by Sancho Panza's ***", 33, 81, 471, 121, %WS_POPUP OR %WS_BORDER OR %WS_DLGFRAME OR _
                                                %WS_SYSMENU OR %WS_CLIPSIBLINGS OR %WS_VISIBLE OR %DS_MODALFRAME OR %DS_3DLOOK OR %DS_NOFAILCREATE OR %DS_SETFONT, _
                                                %WS_EX_CONTROLPARENT OR %WS_EX_LEFT OR %WS_EX_LTRREADING OR %WS_EX_RIGHTSCROLLBAR, TO hDlg
                                            CONTROL ADD LABEL,   hDlg, %IDC_LABEL1, "name of last known file", 5, 5, 85, 15
                                            CONTROL ADD LABEL,   hDlg, %IDC_LABEL2, "name of current file", 5, 40, 85, 15
                                            CONTROL ADD LISTBOX, hDlg, %IDC_LISTBOX1, , 120, 15, 345, 100, %WS_CHILD OR %WS_VISIBLE OR %WS_TABSTOP OR %WS_VSCROLL, _
                                                %WS_EX_CLIENTEDGE OR %WS_EX_LEFT OR %WS_EX_LTRREADING OR %WS_EX_RIGHTSCROLLBAR
                                            CONTROL ADD TEXTBOX, hDlg, %IDC_lastfile_tb, "last known file", 5, 20, 85, 15
                                            CONTROL ADD TEXTBOX, hDlg, %IDC_CURRFILE_TB, "current file", 5, 55, 85, 15
                                            CONTROL ADD BUTTON,  hDlg, %IDC_BUTTON1, "start", 70, 105, 45, 15
                                            CONTROL ADD LABEL,   hDlg, %IDC_LABEL4, "minimum chunk size", 5, 75, 85, 15
                                            CONTROL ADD TEXTBOX, hDlg, %IDC_TEXTBOX1, "", 5, 90, 25, 15, %WS_CHILD OR %WS_VISIBLE OR %WS_TABSTOP OR %ES_LEFT OR _
                                                %ES_AUTOHSCROLL OR %ES_NUMBER, %WS_EX_CLIENTEDGE OR %WS_EX_LEFT OR %WS_EX_LTRREADING OR %WS_EX_RIGHTSCROLLBAR
                                        #PBFORMS END DIALOG
                                        
                                            DIALOG SHOW MODAL hDlg, CALL ShowDIALOG1Proc TO lRslt
                                        
                                        #PBFORMS BEGIN CLEANUP %IDD_DIALOG1
                                        #PBFORMS END CLEANUP
                                        
                                            FUNCTION = lRslt
                                        END FUNCTION
                                        
                                        '==================================================================================================================================
                                        FUNCTION PBMAIN()
                                            ShowDIALOG1 %HWND_DESKTOP
                                        END FUNCTION

                                        Comment

                                        Working...
                                        X