I think I could write an incremental backup program for a file of fixed-length records, if I knew or could somehow find out the record length, but how to make an incremental backup of a file of unknown structure is - er - a challenge! any ideas out there?
Announcement
Collapse
No announcement yet.
how to write an incremental backup application?
Collapse
X
-
What about storing the last length backed up and storing from that point to the end ? That would not require the knowledge of the record length.
Well, of course, that wouldn't help if an earlier part is changed. hmm.
Aren't most incremental archives based on date of the file and the entire file is backed up rather than records within it ?Last edited by Fred Buffington; 3 Apr 2008, 04:57 PM.
-
???
"Incremental backup" usually refers to selecting a file for backup only if its archive bit is set, then resetting that bit after copying the file.
Contrast with "differential backup", in which you select a file for backup only if its archive bit is set, but you don't reset that archive bit after copying.
With incremental back of a group of files, to restore you need the last full backup plus ALL backup sets since the last full backup; using differential backup, you need the last full backup plus ONLY the most recent differential backup.
Are you trying to backup pieces of a single file? That's tricky enough when you KNOW the structure; it's silly to even think you could selectively backup pieces of it if you don't know the structure.
Then again, if you KNOW the file is ONLY appended to, all ya gotta do is keep track of the filesize at the last backup and copy only the rest of it at backup time.Michael Mattias
Tal Systems (retired)
Port Washington WI USA
[email protected]
http://www.talsystems.com
Comment
-
Hi Chris,
i have had a lot of experience in the area of backup, i remember when our first computer system took over one hour to backup less than 700kb worth of data on to a 700k floppy diskette. i was able to reduce the time down to 12 minutes with file compression.
i complained to the people who wrote the software about something they were doing. at the end of the month, to clearout the data in the file, rather than erasing a file and creating a new file, they would just change the header in the file. they said i did not understand what goes on inside of the computer(defragmentation) and i told them i understood time and it was one hour a day, 220 days a year, let see(1hour*220days/40hours=5.5 work weeks based on a 40hours in a week , so out of a year, our computers where not being used for 5.5 weeks during work.
now on a different system with many many more files, our backup(where the server is down is under 1 minute), i offload the data to a dedicated backup computer that does the actual backup.
i am a programmer and user. so any saved time is mine, any lost data is mine, i pay the cost.
when backing up data, i have always thought it was the one who wrote the program to create an effiecent way to save changed and added data(transactions that have taken place) and also the programmer's job is to make sure that when data is corrupted to warn the user or do some actions.
in backing up, it depends on many factors,
how much a file changes in one period
how big are the files to backup
how many files to backup
how are the indexes to database files created and can they be recreated, are the indexs in separate files for the database file.
the biggest question you have to ask yourself is that can you be 100% sure you can recover from your backup.
one of the thing i did was write a program to create a database that duplicated the main database as far as structure goes, then every day run a program to transfer new items in the main database to this second database and also update any records with changes. in by doing this i was able to purge the main database once a year and backup the only the main database rather than this growing database that had records that where stale. because the second database had the same format and all historical data, we where able to access that data all the time and it made the main database faster to run programs on because it was always small with only active records. i purged the main database at the end of february or march each year. now that we changes software our single database has all records back to 1983, because the programmer who made this does not care about time or efficiency when it come to stale data and one never knows when you need to get to what data.
so with the above, said compression of files and creative backup routines and strategies come into play.
i have no idea of what or who is needing the backup, but if you a intent on doing file changes only, i would look into keep a duplicate of the most recent backed up files to compare against, then compare the files against the files that where backed up. basically after any backup, just copy all files with changes to a separate place for the next time you do comparisons for creating a backup. all this can go very wrong if it is not monitored in some way. the dates on computers are going to be very important to be accurate in comparing these files too i would guess.
if the data files are shrinking in sizes from what was previously backed up, then backup the whole file, if the files are small and have changed backup the whole file, if the file is larger that what you would consider small and the file is equal to or larger than the backed up file, compare by blocks of the file and backup blocks
to speed up block comparing, you might want to also want to use some kind of crc checking and store that in a file along with the block of data backed up, possibly in another separate file for checking.
there was some fast code written in a thread on doing file comparisons that would be beneficial in comparing blocks of a files.
a word of advice, most users are not going to validate that successful backup are actually taking place, that validation needs to be recorded somewhere, preferably in a file log for later review.
backing up all comes down to four things; time, accuracy, recoverability, and easiness on the user. backup programs and time on such endeavors never get to be the prince and have little value, until lost of data occurs.
paulp purvis
Comment
-
when doing incremental backups, considering todays storage options, i would only feel comfortable if full copies of backup where retained on some media other than just the backup media.
the full backup copies could be incremental in nature where there is a deletion of older backups to keep the media from being used up. this would be the first place to restore data from and the media where the data is permanently being backed up to should be a the last resort for data loss retrieval.p purvis
Comment
-
> to speed up block comparing, you might want to also want to use some kind of crc checking and store that in a file along with the block of data backed up, possibly in another separate file for checking.
> there was some fast code written in a thread on doing file comparisons that would be beneficial in comparing blocks of a files.
For crc I wouldn't use one but would go with a hashing algorithm.
With the fast code that Paul refers to I get about 900ms to compare two 10MB blocks with one on an internal SATA and the other on a external USB 2.0. With both on the SATA I get 540ms. Comparing a 10MB block, on the SATA, against an MD5 hash comes in at 400ms.
Comment
-
Originally posted by Michael Mattias View Postit's silly to even think you could selectively backup pieces of it if you don't know the structure.
Lets' use the term "differential" instead of "incremental", as you suggested.
Starting off with "why do a differential backup at all?". The reason is to reduce the volume of stored data required to recreate the file during the "secured period", which is the span of time containing the points at which the data can be recreated. So the apllication will be related to backing up large files only. "Large here is relative to storage medium. ISTM that what we are looking for is the largest possible chunks of matching data between the stored data (base plus deltas) and the live data, irrespective of position occupied in the file, and a way of repositioning them when the data are retrieved.
OK, there are two slippery terms in that assertion:
What are "largest chunks of matching data" how do we determine what these chunks are, and if the method is iterative, when do we stop trying.
What is "a way of repositioning", how is it accomplished. We also need it to do "base plus deltas", of course.
Yours, daft as a brush....
CJH
Comment
-
Given you don't know the file structure (your own words), comparing blocks of data at given offsets is silly, because you don't know that the application writing to that file isn't moving data around in the file.
E.g, the file may be indexed or keyed; inserting a record may insert those new records somewhere in the middle of the physical file... or changes may occur "in place", meaning "some interior portion (a logical record)" changes.
I am thinking you simply "must" know something about the file structure to even consider trying what I think you are considering trying.
I have several large files I have to back up. You know what I use? A tape drive (about $400), and (are you ready for this?) Microsoft Backup.
You can back up a multi-GB file fairly quickly (ten minutes?) and with tape you don't have to worry about running out of media, i.e., no worries about "disk full, please insert a new disk."
For that matter, you can get "size large" removeable hard drives at a pretty reasonable cost and back up to those things.
My view is a couple hundred bucks of hardware solution is going to be a much better investment than will be the programming time+effort (=money) to create some kind of "find changes and backup only those changes " application.
MCM
PS:
Not included: programming time and effort to put Humpty-Dumpty back together again if a 'restore' is required!Last edited by Michael Mattias; 4 Apr 2008, 08:45 AM.Michael Mattias
Tal Systems (retired)
Port Washington WI USA
[email protected]
http://www.talsystems.com
Comment
-
I have several large files I have to back up. You know what I use? A tape drive (about $400), and (are you ready for this?) Microsoft Backup
Say it ain't so, please!
All these years and people still think tape is reliable? Wow. Hey, I got this bridge you just have to see!!!
I hope you do FULL RESTORES from those flimsy, error prone, heat prone, thin rolls of plastic you valuable data is "stored" on at least one a month... you might be surprised to find out whats NOT on them
(Google disk-to-disk backup solutions. You can get TERABYTE external drives for under $500. Heck, a decent 500gb external drive will run you less than $100.00 and think of all the good night sleep you can get again!)Software makes Hardware Happen
Comment
-
Chris, I should 1st mention that I saw a 500GB HD a couple weeks ago for $79. Okay that said, there is a program I have used for probably a decade now called DLSuperC32 based on the IBM program SuperC that compares two text files with records of almost any length. It's advertised as Industrial Strength and I have found it to be pretty much that. It's $19.95 shareware.
Its output is a text file with line numbers of all the changes (and what kind of changes those are), so you could conceivably write a small utility to back up just those line numbers after any given DLSuperC run.
Comment
-
Originally posted by Michael Mattias View Post... silly, because you don't know that the application writing to that file isn't moving data around in the file.
2. Going back to the data comparison. Clearly, there is an overhead in the chunk descriptor. It would make no sense at all to start with a 1-byte chunksize. Maybe the best way is to start with the whole file, compute a checksum and see if it is 100% unchanged. If not, then do binary chops on file A, and using a plausible chunk size, look up file B to find matches at any location with said chunk. Having found one (process all of them in this way), attempt to grow that chunk in both directions until it becomes a non-matcher. If it bumps into another unchanged chunk, coalesce the chunks. This type of approach migh go faster if streamed, and would benefit from intelligent optimisation at Assembler level - see something by Charles Pegge about parsing the Bible in 0.7 secs, forget where I saw it, ah yes .
Comment
-
So let's say you start looking, and byte #2 of the file is different from what it was.
What do you do now? Backup everything after byte #1? Gee, that saved a lot.
Many files ("unknown" structure) have a file header in the first "n" bytes... which changes based on the number of records stored, the last update timestamp, etc... meaning you might find a change in nearly every "generation" of that file, while actual "data changes" may be interspersed within the file or consist of new data added to the end of the file.. or both. (Not that a changed header might not be very important!)
Worse, what if the software has done a "reorganization" of the file, to say, remove unused records. Now great swaths of the file are 'moved' but unchanged, but you can't know that unless you somehow perform synchronization comparisons at various offsets.
Absent some idea of the file structure this whole project is, IMNSHO, a stone loser.
Just check the archive bit; if it's set, back up the file; if it's not, don't.Michael Mattias
Tal Systems (retired)
Port Washington WI USA
[email protected]
http://www.talsystems.com
Comment
-
How big is this file?Michael Mattias
Tal Systems (retired)
Port Washington WI USA
[email protected]
http://www.talsystems.com
Comment
-
Originally posted by Michael Mattias View PostHow big is this file?
Comment
-
So, the files have binary data, but are they organized in a record-like structure? That is, when a file is modified, are the changes consistent with changes one might find in a "standard record format" file but with your own unique format? Are you limited to, for example, adding a record, deleting a record, and changing a record (length and content)?
Or, can the changes simply be any bytes, any count, any location?
Comment
-
>Or, can the changes simply be any bytes, any count, any location?
That has to be a "YES" given the stated conditions.. that the file structure is "unknown."
"IF" you know something about the target file(s), it may be practical to develop a backup scheme based on pieces of the file... but in the truly "unknown" circumstance I stand by my use of the adjective "silly."
Then again, Miguel de Cervantes gave us a colloquialism still with us some four hundred years after the fact merely writing about "tilting at windmills."
MCMMichael Mattias
Tal Systems (retired)
Port Washington WI USA
[email protected]
http://www.talsystems.com
Comment
-
This reports a list of file differences between two files without making assumptions about file structure. I've only tested it gently, because I don't want it to go wrong.
Code:#PBFORMS CREATED V1.51 #COMPILE EXE #DIM ALL #PBFORMS BEGIN INCLUDES #IF NOT %DEF(%WINAPI) #INCLUDE "WIN32API.INC" #ENDIF #PBFORMS END INCLUDES #PBFORMS BEGIN CONSTANTS %IDD_DIALOG1 = 101 %IDC_LABEL1 = 1001 %IDC_LABEL2 = 1002 %IDC_LISTBOX1 = 1003 %IDC_CURRFILE_TB = 1005 %IDC_lastfile_tb = 1004 %IDC_BUTTON1 = 1007 %IDC_LABEL4 = 1008 %IDC_TEXTBOX1 = 1009 #PBFORMS END CONSTANTS #PBFORMS DECLARATIONS %minchunksize = 128 %chunkincrement = 100 GLOBAL gsLast, gsCurr AS STRING GLOBAL ghDlg AS DWORD '----------------------------------------------------------------------------------- ' strings are different on entry. FUNCTION compare ( llow AS LONG, lhigh AS LONG ) AS LONG LOCAL lmid, n AS LONG IF lhigh < llow + %minchunksize THEN FUNCTION = -1 EXIT FUNCTION ' not found END IF lmid = (llow + lhigh) / 2 n = INSTR(gsCurr, MID$(gsLast, lmid, lmid)) IF n THEN ' match in upper half of subtree? LISTBOX ADD ghDlg, %IDC_LISTBOX1, _ "found " + STR$(lmid) + "bytes at " + STR$(n) + " in current file matching last know file at " + STR$(lmid) compare( llow, lmid-1) ' yes, so search lower part of tree ELSE compare( lmid + 1, lhigh) ' no, refine search in upper part of tree END IF EXIT FUNCTION END FUNCTION '----------------------------------------------------------------------------------- CALLBACK FUNCTION ShowDIALOG1Proc() STATIC sLastFile, sCurrFile AS STRING SELECT CASE AS LONG CBMSG CASE %WM_INITDIALOG ghDlg = CBHNDL CASE %WM_COMMAND SELECT CASE AS LONG CBCTL CASE %IDC_BUTTON1 IF CBCTLMSG = %BN_CLICKED OR CBCTLMSG = 1 THEN LISTBOX RESET CBHNDL, %IDC_LISTBOX1 CONTROL GET TEXT CBHNDL, %IDC_LASTFILE_TB TO sLastFile CONTROL GET TEXT CBHNDL, %IDC_CURRFILE_TB TO sCurrFile OPEN sLastfile FOR BINARY AS #1 gsLast = STRING$(LOF(1),0) GET #1, 1, gsLast CLOSE #1 OPEN sCurrfile FOR BINARY AS #1 gsCurr = STRING$(LOF(1),0) GET #1, 1, gsCurr CLOSE #1 IF gsLast = gsCurr THEN LISTBOX ADD CBHNDL, %IDC_LISTBOX1, "files are identical" ELSE compare(1, LEN(gsCurr)) END IF END IF END SELECT END SELECT END FUNCTION '---------------------------------------------------------------------- FUNCTION ShowDIALOG1(BYVAL hParent AS DWORD) AS LONG LOCAL lRslt AS LONG #PBFORMS BEGIN DIALOG %IDD_DIALOG1->-> LOCAL hDlg AS DWORD DIALOG NEW hParent, "file differences by Sancho Panza's ***", 33, 81, 471, 121, %WS_POPUP OR %WS_BORDER OR %WS_DLGFRAME OR _ %WS_SYSMENU OR %WS_CLIPSIBLINGS OR %WS_VISIBLE OR %DS_MODALFRAME OR %DS_3DLOOK OR %DS_NOFAILCREATE OR %DS_SETFONT, _ %WS_EX_CONTROLPARENT OR %WS_EX_LEFT OR %WS_EX_LTRREADING OR %WS_EX_RIGHTSCROLLBAR, TO hDlg CONTROL ADD LABEL, hDlg, %IDC_LABEL1, "name of last known file", 5, 5, 85, 15 CONTROL ADD LABEL, hDlg, %IDC_LABEL2, "name of current file", 5, 40, 85, 15 CONTROL ADD LISTBOX, hDlg, %IDC_LISTBOX1, , 120, 15, 345, 100, %WS_CHILD OR %WS_VISIBLE OR %WS_TABSTOP OR %WS_VSCROLL, _ %WS_EX_CLIENTEDGE OR %WS_EX_LEFT OR %WS_EX_LTRREADING OR %WS_EX_RIGHTSCROLLBAR CONTROL ADD TEXTBOX, hDlg, %IDC_lastfile_tb, "last known file", 5, 20, 85, 15 CONTROL ADD TEXTBOX, hDlg, %IDC_CURRFILE_TB, "current file", 5, 55, 85, 15 CONTROL ADD BUTTON, hDlg, %IDC_BUTTON1, "start", 70, 105, 45, 15 CONTROL ADD LABEL, hDlg, %IDC_LABEL4, "minimum chunk size", 5, 75, 85, 15 CONTROL ADD TEXTBOX, hDlg, %IDC_TEXTBOX1, "", 5, 90, 25, 15, %WS_CHILD OR %WS_VISIBLE OR %WS_TABSTOP OR %ES_LEFT OR _ %ES_AUTOHSCROLL OR %ES_NUMBER, %WS_EX_CLIENTEDGE OR %WS_EX_LEFT OR %WS_EX_LTRREADING OR %WS_EX_RIGHTSCROLLBAR #PBFORMS END DIALOG DIALOG SHOW MODAL hDlg, CALL ShowDIALOG1Proc TO lRslt #PBFORMS BEGIN CLEANUP %IDD_DIALOG1 #PBFORMS END CLEANUP FUNCTION = lRslt END FUNCTION '================================================================================================================================== FUNCTION PBMAIN() ShowDIALOG1 %HWND_DESKTOP END FUNCTION
Comment
Comment