i had been working on this html to text routine.
it would be nice have something that would spit out text from html but i just do not see anything.
my biggest purpose here is to make html code safe to read even if a browser reads the text and still try to keep as much of the message as possible.
i am not planing on reading the html with anything but a text program but still i do not want the text after conversion to do any damage to a computer in case a browser reads the text.
it is pretty well chopped up for now but it is a first jab at conversion.
i think tables and such other formating would have to come later.
also this program does not yet format to a letter page widths.
something that can come later.
rather than having a resulting file named .txt i named it .purvis.tht so that any files can be easily deleted
if any of you have html files saved i would appreciate the testing of this program against those files to see if any html code has not been caught.
i have seen wierd things working with this program or at least coding i do not understand, also i have seen outdated html code i though would have been removed by now
there is part of the code in this listing that converts > to > and < to < characters. i have removed pieced of code in a testing program called html2tst.
so that all characters of < or > in a resulting file would possibly be html code and more easily spotted.
it would be nice have something that would spit out text from html but i just do not see anything.
my biggest purpose here is to make html code safe to read even if a browser reads the text and still try to keep as much of the message as possible.
i am not planing on reading the html with anything but a text program but still i do not want the text after conversion to do any damage to a computer in case a browser reads the text.
it is pretty well chopped up for now but it is a first jab at conversion.
i think tables and such other formating would have to come later.
also this program does not yet format to a letter page widths.
something that can come later.
rather than having a resulting file named .txt i named it .purvis.tht so that any files can be easily deleted
if any of you have html files saved i would appreciate the testing of this program against those files to see if any html code has not been caught.
i have seen wierd things working with this program or at least coding i do not understand, also i have seen outdated html code i though would have been removed by now
there is part of the code in this listing that converts > to > and < to < characters. i have removed pieced of code in a testing program called html2tst.
so that all characters of < or > in a resulting file would possibly be html code and more easily spotted.
Code:
'html2tht.bas 'compiled with pbcc 4.04 'convert from html to text format ' #DIM ALL #COMPILE EXE #INCLUDE "WIN32API.INC" GLOBAL sahtmlarrayay1() AS STRING GLOBAL sahtmlarrayay2() AS STRING FUNCTION Exist(DirFileName AS STRING) AS LONG FUNCTION = GETATTR(DirFileName) FUNCTION = (ERR = 0) END FUNCTION FUNCTION PBMAIN() AS LONG LOCAL f, i, j, k, l AS LONG LOCAL iitems AS LONG LOCAL s, ss, SrcName, DestName AS STRING LOCAL smatch1 AS STRING LOCAL smatch2 AS STRING SrcName = COMMAND$ IF NOT Exist(SrcName) THEN STDOUT "File not found!" EXIT FUNCTION END IF DestName = SrcName+".purvis.tht" i = FREEFILE OPEN SrcName FOR BINARY AS i GET$ i, LOF(i), s CLOSE i 'verify file is a html file IF INSTR(s,"<")=0 THEN GOTO abortwritethefile IF INSTR(s,">")=0 THEN GOTO abortwritethefile GOSUB makesmall 'some files are hard to decipher in html format or not 'IF INSTR(s,"<head")=0 THEN GOTO abortwritethefile 'IF INSTR(s,"<body")=0 THEN GOTO abortwritethefile replace1: REPLACE "< " WITH "<" IN s IF INSTR(s,"< ") THEN GOTO replace1 replace2: REPLACE " >" WITH ">" IN s IF INSTR(s," > ") THEN GOTO replace2 'some files are hard to decipher in html format or not 'IF INSTR(s,"<head")=0 THEN GOTO abortwritethefile 'IF INSTR(s,"<body")=0 THEN GOTO abortwritethefile 'file appears to be a html file 'remove all text to the word <body> at the beginning of file i=INSTR(s,"<body") IF i THEN IF i>1 THEN FOR j=1 TO i-1 MID$(s,j,1)=CHR$(0) NEXT j END IF END IF 'remove all text from the end of the file to and including </body> IF INSTR(s,"</body>") THEN FOR l=LEN(s)-6 TO 1 STEP -1 i=INSTR(l,s,"</body>") IF i THEN K=LEN(s) MID$(s,i,2)=$CRLF FOR j=i+2 TO k MID$(s,j,1)=CHR$(0) NEXT j EXIT FOR END IF NEXT l END IF 'now we only have the body of the html file in the s variable 'do some house cleaning on the file REPLACE CHR$( 0) WITH "" IN s REPLACE CHR$( 9) WITH " " IN s REPLACE CHR$( 10) WITH " " IN s REPLACE CHR$( 13) WITH " " IN s s = s & $CRLF 'remove character and word formating, it is not needed smatch1="<em>" :GOSUB getrideof smatch1="</em>" :GOSUB getrideof smatch1="<strong>" :GOSUB getrideof smatch1="</strong>":GOSUB getrideof smatch1="<dfn>" :GOSUB getrideof smatch1="</dfn>" :GOSUB getrideof smatch1="<code>" :GOSUB getrideof smatch1="</code>" :GOSUB getrideof smatch1="<samp>" :GOSUB getrideof smatch1="</samp>" :GOSUB getrideof smatch1="<kbd>" :GOSUB getrideof smatch1="</kbd>" :GOSUB getrideof smatch1="<var>" :GOSUB getrideof smatch1="</var>" :GOSUB getrideof smatch1="<cite>" :GOSUB getrideof smatch1="</cite>" :GOSUB getrideof smatch1="<tt>" :GOSUB getrideof smatch1="</tt>" :GOSUB getrideof smatch1="<i>" :GOSUB getrideof smatch1="</i>" :GOSUB getrideof smatch1="<b>" :GOSUB getrideof smatch1="</b>" :GOSUB getrideof smatch1="<big>" :GOSUB getrideof smatch1="</big>" :GOSUB getrideof smatch1="<small>" :GOSUB getrideof smatch1="</small>" :GOSUB getrideof smatch1="<h1>" :GOSUB getrideof smatch1="</h1>" :GOSUB getrideof smatch1="<h2>" :GOSUB getrideof smatch1="</h2>" :GOSUB getrideof smatch1="<h3>" :GOSUB getrideof smatch1="</h3>" :GOSUB getrideof smatch1="<h4>" :GOSUB getrideof smatch1="</h4>" :GOSUB getrideof smatch1="<h5>" :GOSUB getrideof smatch1="</h5>" :GOSUB getrideof smatch1="<h6>" :GOSUB getrideof smatch1="</h6>" :GOSUB getrideof smatch1="<basefont" :smatch2=">":GOSUB cleanto1 smatch1="</basefont" :smatch2=">":GOSUB cleanto1 smatch1="<h1 " :smatch2=">":GOSUB cleanto1 smatch1="<h2 " :smatch2=">":GOSUB cleanto1 smatch1="<h3 " :smatch2=">":GOSUB cleanto1 smatch1="<h4 " :smatch2=">":GOSUB cleanto1 smatch1="<h5 " :smatch2=">":GOSUB cleanto1 smatch1="<h6 " :smatch2=">":GOSUB cleanto1 smatch1="<!--" :smatch2=">":GOSUB cleanto1 smatch1="<!doctype" :smatch2=">":GOSUB cleanto1 smatch1="<a href" :smatch2=">":GOSUB cleanto1 smatch1="<a " :smatch2=">":GOSUB cleanto1 smatch1="</a " :smatch2=">":GOSUB cleanto1 smatch1="</a>" :GOSUB getrideof smatch1="<abbr" :smatch2=">":GOSUB cleanto1 smatch1="</abbr>" :GOSUB getrideof smatch1="<acronym" :smatch2=">":GOSUB cleanto1 smatch1="</acronym>" :GOSUB getrideof smatch1="<address" :smatch2=">":GOSUB cleanto1 smatch1="</address>" :GOSUB getrideof smatch1="<base" :smatch2=">":GOSUB cleanto1 smatch1="<button" :smatch2="/button>":GOSUB cleanto1 smatch1="<body" :smatch2=">":GOSUB cleanto1 smatch1="</body" :smatch2=">":GOSUB cleanto1 smatch1="<caption" :smatch2=">":GOSUB cleanto1 smatch1="/caption>" :smatch2=">":GOSUB cleanto1 smatch1="<center>" :GOSUB getrideof smatch1="</center>" :GOSUB getrideof smatch1="<col " :smatch2="/col>":GOSUB cleanto1 smatch1="<colgroup" :smatch2="/colgroup>":GOSUB cleanto1 smatch1="<div" :smatch2=">":GOSUB cleanto1 smatch1="</div" :smatch2=">":GOSUB cleanto1 smatch1="<fieldset" :smatch2="/fieldset>":GOSUB cleanto1 smatch1="<font" :smatch2=">":GOSUB cleanto1 smatch1="</font" :smatch2=">":GOSUB cleanto1 smatch1="<form" :smatch2=">":GOSUB cleanto1 smatch1="</form" :smatch2=">":GOSUB cleanto1 smatch1="<frame" :smatch2="/>":GOSUB cleanto1 smatch1="<frameset" :smatch2="/frameset>":GOSUB cleanto1 smatch1="<head" :smatch2="/head>":GOSUB cleanto1 smatch1="<html" :smatch2=">":GOSUB cleanto1 smatch1="</html" :smatch2=">":GOSUB cleanto1 smatch1="<iframe" :smatch2="/iframe>":GOSUB cleanto1 smatch1="<img" :smatch2=">":GOSUB cleanto1 smatch1="<link" :smatch2="/>":GOSUB cleanto1 smatch1="<input" :smatch2=">":GOSUB cleanto1 smatch1="<map" :smatch2="/map>":GOSUB cleanto1 smatch1="<meta" :smatch2="/>":GOSUB cleanto1 smatch1="<meta" :smatch2=">":GOSUB cleanto1 smatch1="<object" :smatch2="/object>":GOSUB cleanto1 smatch1="<select" :smatch2="/select>":GOSUB cleanto1 smatch1="<script" :smatch2="/script>":GOSUB cleanto1 smatch1="<script" :smatch2=">":GOSUB cleanto1 smatch1="</script" :smatch2=">":GOSUB cleanto1 smatch1="<span" :smatch2=">":GOSUB cleanto1 smatch1="</span>" :GOSUB getrideof smatch1="<style" :smatch2="/style>":GOSUB cleanto1 smatch1="<textarea" :smatch2=">":GOSUB cleanto1 smatch1="</textarea" :smatch2=">":GOSUB cleanto1 'make any quote marks in the file REPLACE "<q>" WITH CHR$(39) IN s REPLACE "</q>" WITH CHR$(39) IN s REPLACE "<blockquote>" WITH CHR$(39) IN s REPLACE "</blockquote>" WITH CHR$(39) IN s 'smatch1="<table" :smatch2="/table>":GOSUB cleanto1 'removes all table information smatch1="<table" :smatch2=">":GOSUB cleanto1 smatch1="</table>" :GOSUB getrideof REPLACE "<thead>" WITH "" IN s REPLACE "</thead>" WITH "" IN s REPLACE "<tfoot>" WITH "" IN s REPLACE "</tfoot>" WITH "" IN s REPLACE "<th>" WITH "" IN s REPLACE "</th>" WITH " "+CHR$(124) IN s smatch1="<td" :smatch2=">":GOSUB cleanto1 'REPLACE "<td>" WITH "" IN s REPLACE "</td>" WITH " "+CHR$(124) IN s smatch1="<tr" :smatch2=">":GOSUB cleanto1 'REPLACE "<tr>" WITH "" IN s REPLACE "</tr>" WITH $CRLF IN s smatch1="<tbody" :smatch2=">":GOSUB cleanto1 'REPLACE "<tbody>" WITH "" IN s REPLACE "</tbody>" WITH $CRLF IN s smatch1="<tbody" :smatch2=">":GOSUB cleanto1 REPLACE "<title>" WITH "title of webpage: " IN s 'title should alread by wiped out REPLACE "</title>" WITH $CRLF IN s REPLACE "<ul>" WITH "" IN s REPLACE "</ul>" WITH "" IN s REPLACE "<p>" WITH $CRLF IN s REPLACE "</p>" WITH $CRLF IN s REPLACE "<br>" WITH $CRLF IN s REPLACE "</br>" WITH $CRLF IN s '</br> bad format but included anyways REPLACE "<br />" WITH $CRLF IN s REPLACE "<del>" WITH " *strikethroughstart* " IN s REPLACE "</del>" WITH " *strikethroughstop* " IN s REPLACE "<sub>" WITH " *subscriptstart* " IN s REPLACE "</sub>" WITH " *subscriptstop* " IN s REPLACE "<ins>" WITH " *underlinestart* " IN s REPLACE "</ins>" WITH " *underlinestop* " IN s REPLACE "<u>" WITH " *underlinestart* " IN s REPLACE "</u>" WITH " *underlinestop* " IN s REPLACE "<hr /" WITH $CRLF+"-----------------------------------------"+$CRLF IN S REPLACE " " WITH " " IN s REPLACE "&" WITH "&" IN s REPLACE "'" WITH "'" IN s 'special removeal smatch1="<!" :smatch2=">":GOSUB cleanto1 smatch1="<o:" :smatch2=">":GOSUB cleanto1 smatch1="</o:" :smatch2=">":GOSUB cleanto1 smatch1="<v:" :smatch2=">":GOSUB cleanto1 smatch1="</v:" :smatch2=">":GOSUB cleanto1 smatch1="<xml" :smatch2=">":GOSUB cleanto1 smatch1="</xml" :smatch2=">":GOSUB cleanto1 smatch1="</xml" :smatch2=">":GOSUB cleanto1 smatch1="<b style" :smatch2=">":GOSUB cleanto1 smatch1="<b " :smatch2=">":GOSUB cleanto1 smatch1="<p class" :smatch2=">":GOSUB cleanto1 smatch1="<p " :smatch2=">":GOSUB cleanto1 smatch1="<st1" :smatch2=">":GOSUB cleanto1 smatch1="</st1" :smatch2=">":GOSUB cleanto1 REPLACE CHR$(0) WITH "" IN s REPLACE ">" WITH ">"+$CRLF IN s REPLACE "< " WITH $CRLF+"<" IN s 'remove all html code tages 'smatch1="<" :smatch2=">":GOSUB cleanto1 replacetriplespaces: REPLACE " " WITH " " IN s IF INSTR(s," ") THEN GOTO replacetriplespaces replaceit1: REPLACE CHR$(124)+$CRLF WITH $CRLF IN s IF INSTR(s,CHR$(124)+$CRLF) THEN GOTO replaceit1 replaceit2: REPLACE $CRLF+CHR$(124) WITH $CRLF IN s IF INSTR(s,$CRLF+CHR$(124)) THEN GOTO replaceit2 s=TRIM$(s) DIM sahtmlarray(1) AS LOCAL STRING iitems = TALLY(s, $CRLF) REDIM sahtmlarray(1 TO iitems + 1) PARSE s, sahtmlarray(), $CRLF FOR i = 1 TO iitems ss = TRIM$(sahtmlarray(i)) IF ss="" THEN sahtmlarray(i)="" NEXT i GOSUB cleanarrayup FOR i = 1 TO iitems ss = TRIM$(sahtmlarray(i)) sahtmlarray(i)=ss NEXT i ss=JOIN$(sahtmlarray(),$CRLF) replaceit4: REPLACE " "+$CRLF WITH $CRLF IN ss IF INSTR(ss," "+$CRLF) THEN GOTO replaceit4 replaceit5: REPLACE $CRLF+$CRLF+$CRLF WITH $CRLF+$CRLF IN ss IF INSTR(ss,$CRLF+$CRLF+$CRLF) THEN GOTO replaceit5 replaceit6: IF LEFT$(ss,2)=$CRLF THEN ss=RIGHT$(ss,LEN(ss)-2):GOTO replaceit6 IF RIGHT$(ss,2)=$CRLF THEN ss=LEFT$(ss,LEN(ss)-2):GOTO replaceit6 '<dl></dl> '<dt></dt> '<dd></dd> '<ol></ol> '<li></li> REPLACE "<dl>" WITH $CRLF IN ss REPLACE "</dl>" WITH $CRLF+$CRLF IN ss REPLACE "<dt>" WITH SPACE$(8) IN ss REPLACE "</dt>" WITH $CRLF+$CRLF IN ss REPLACE "<dd>" WITH SPACE$(16) IN ss REPLACE "</dd>" WITH $CRLF IN ss 'REPLACE "<td>" WITH " BeginTableCelltd " IN ss 'REPLACE "<td " WITH " BeginTableCelltd " IN ss 'REPLACE "</td>" WITH " EndTableCelltd " IN ss ' ' 'REPLACE "<tr>" WITH " BeginTableRowtr " IN ss 'REPLACE "<tr " WITH " BeginTableRowtr <tr" IN ss 'REPLACE "</tr>" WITH " EndTableRowtr " IN ss 'REPLACE "<table>" WITH " BeginTableTABLE " IN ss 'REPLACE "<table " WITH " BeginTableTABLE " IN ss 'REPLACE "</table>" WITH " EndTableTABLE " IN ss replaceit7: REPLACE $CRLF+$CRLF+$CRLF WITH $CRLF+$CRLF IN ss IF INSTR(ss,$CRLF+$CRLF+$CRLF) THEN GOTO replaceit7 REPLACE "<" WITH "<" IN ss REPLACE ">" WITH ">" IN ss REPLACE "<" WITH "<" IN ss REPLACE ">" WITH ">" IN ss 'write the scrubed html file to disk '------------------------------------------------------------------------- f = FREEFILE OPEN DestName FOR OUTPUT AS f LEN=16384 PRINT #f,ss CLOSE f STDOUT "All done. Check the file " & DestName ' WAITKEY$ EXIT FUNCTION 'abort working with the file, write the whole file back to disk '------------------------------------------------------------------------- abortwritethefile: f = FREEFILE OPEN DestName FOR OUTPUT AS f LEN=16384 PRINT #f,s CLOSE f STDOUT srcname+" file does not appear to be in html format" STDOUT "All done. Check the file " & DestName STDOUT "press any key to continue" SLEEP 3000 'WAITKEY$ EXIT FUNCTION 'start of gosub routines '------------------------------------------------------------------------- getrideof: i=INSTR(UCASE$(s),UCASE$(smatch1)) IF i=0 THEN RETURN j=LEN(smatch1) FOR k=i TO i+j-1 MID$(s,k,1)=CHR$(0) NEXT k GOTO getrideof RETURN cleanto1: i=INSTR(UCASE$(s),UCASE$(smatch1)) IF i=0 THEN RETURN j=INSTR(i,UCASE$(s),UCASE$(smatch2)) IF j=0 THEN RETURN k=LEN(smatch2) FOR l=i TO j+k-1 MID$(s,l,1)=CHR$(0) NEXT l GOTO cleanto1 RETURN cleanto2: FOR i=1 TO iitems j=INSTR(UCASE$(sahtmlarray(i)),UCASE$(smatch1)) IF j THEN k=INSTR(UCASE$(sahtmlarray(i)),UCASE$(smatch2)) IF k THEN FOR l=j TO k MID$(sahtmlarray(i),l,1)=CHR$(0) NEXT l END IF END IF NEXT i RETURN cleantoendofmatch: I=INSTR(UCASE$(s),UCASE$(smatch1)) IF I=0 THEN RETURN J=INSTR(i,UCASE$(s),UCASE$(smatch2)) IF J=0 THEN RETURN k=LEN(smatch2) FOR l=i TO j+k-1 MID$(s,l,1)=CHR$(0) NEXT l GOTO cleantoendofmatch RETURN makesmall: k=0 FOR i=1 TO LEN(s) IF MID$(s,i,1)="<" THEN k=1:ITERATE FOR ELSE IF MID$(s,i,1)=">" THEN k=0:ITERATE FOR END IF IF k THEN MID$(s,i,1)=LCASE$(MID$(s,i,1)) NEXT i RETURN cleanarrayup: cleanarrayupagain: FOR i = 1 TO iitems ss = TRIM$(sahtmlarray(i)) IF LEN(ss)=0 THEN ITERATE FOR IF LEN(ss)=1 THEN IF ss=CHR$(124) THEN sahtmlarray(i)="":GOTO cleanarrayupagain END IF IF LEN(ss)>1 THEN IF LEFT$(ss,1)=CHR$(124) THEN sahtmlarray(i)=RIGHT$(ss,(LEN(ss)-1)):GOTO cleanarrayupagain IF RIGHT$(ss,1)=CHR$(124) THEN sahtmlarray(i)=LEFT$(ss,(LEN(ss)-1)):GOTO cleanarrayupagain END IF NEXT i RETURN END FUNCTION 'PbMain
Comment