Announcement

Collapse
No announcement yet.

html to text

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • html to text

    i had been working on this html to text routine.
    it would be nice have something that would spit out text from html but i just do not see anything.

    my biggest purpose here is to make html code safe to read even if a browser reads the text and still try to keep as much of the message as possible.

    i am not planing on reading the html with anything but a text program but still i do not want the text after conversion to do any damage to a computer in case a browser reads the text.

    it is pretty well chopped up for now but it is a first jab at conversion.

    i think tables and such other formating would have to come later.
    also this program does not yet format to a letter page widths.
    something that can come later.

    rather than having a resulting file named .txt i named it .purvis.tht so that any files can be easily deleted

    if any of you have html files saved i would appreciate the testing of this program against those files to see if any html code has not been caught.
    i have seen wierd things working with this program or at least coding i do not understand, also i have seen outdated html code i though would have been removed by now

    there is part of the code in this listing that converts &gt to > and &lt to < characters. i have removed pieced of code in a testing program called html2tst.
    so that all characters of < or > in a resulting file would possibly be html code and more easily spotted.



    Code:
    'html2tht.bas
    'compiled with pbcc 4.04
    'convert from html to text format
    '
    #DIM ALL
    #COMPILE EXE
    #INCLUDE "WIN32API.INC"
    
    GLOBAL sahtmlarrayay1() AS STRING
    GLOBAL sahtmlarrayay2() AS STRING
    
    FUNCTION Exist(DirFileName AS STRING) AS LONG
       FUNCTION = GETATTR(DirFileName)
       FUNCTION = (ERR = 0)
    END FUNCTION
    
    FUNCTION PBMAIN() AS LONG
      LOCAL f, i, j, k, l AS LONG
      LOCAL iitems AS LONG
      LOCAL s, ss, SrcName, DestName AS STRING
      LOCAL smatch1 AS STRING
      LOCAL smatch2 AS STRING
    
      SrcName = COMMAND$
      IF NOT Exist(SrcName) THEN
        STDOUT "File not found!"
        EXIT FUNCTION
      END IF
    
    
      DestName = SrcName+".purvis.tht"
    
      i = FREEFILE
      OPEN  SrcName FOR BINARY AS i
      GET$  i, LOF(i), s
      CLOSE i
    
    'verify file is a html file
    IF INSTR(s,"<")=0 THEN GOTO abortwritethefile
    IF INSTR(s,">")=0 THEN GOTO abortwritethefile
    GOSUB makesmall
    'some files are hard to decipher in html format or not
    'IF INSTR(s,"<head")=0 THEN GOTO abortwritethefile
    'IF INSTR(s,"<body")=0 THEN GOTO abortwritethefile
    replace1:
    REPLACE "< "       WITH "<"  IN s
    IF INSTR(s,"< ") THEN GOTO replace1
    replace2:
    REPLACE " >"       WITH ">"  IN s
    IF INSTR(s," > ") THEN GOTO replace2
    'some files are hard to decipher in html format or not
    'IF INSTR(s,"<head")=0 THEN GOTO abortwritethefile
    'IF INSTR(s,"<body")=0 THEN GOTO abortwritethefile
    'file appears to be a html file
    
    'remove all text to the word <body> at the beginning of file
    i=INSTR(s,"<body")
      IF i THEN
          IF i>1 THEN
              FOR j=1 TO i-1
               MID$(s,j,1)=CHR$(0)
              NEXT j
          END IF
      END IF
    
    'remove all text from the end of the file to and including </body>
    IF INSTR(s,"</body>") THEN
           FOR l=LEN(s)-6 TO 1 STEP -1
              i=INSTR(l,s,"</body>")
              IF i THEN
                 K=LEN(s)
                 MID$(s,i,2)=$CRLF
                 FOR j=i+2 TO k
                   MID$(s,j,1)=CHR$(0)
                 NEXT j
                 EXIT FOR
              END IF
           NEXT l
    END IF
    'now we only have the body of the html file in the s variable
    'do some house cleaning on the file
    REPLACE CHR$(  0) WITH ""          IN s
    REPLACE CHR$(  9) WITH " "         IN s
    REPLACE CHR$( 10) WITH " "         IN s
    REPLACE CHR$( 13) WITH " "         IN s
    s = s & $CRLF
    
    'remove character and word formating, it is not needed
    smatch1="<em>"     :GOSUB getrideof
    smatch1="</em>"    :GOSUB getrideof
    smatch1="<strong>" :GOSUB getrideof
    smatch1="</strong>":GOSUB getrideof
    smatch1="<dfn>"    :GOSUB getrideof
    smatch1="</dfn>"   :GOSUB getrideof
    smatch1="<code>"   :GOSUB getrideof
    smatch1="</code>"  :GOSUB getrideof
    smatch1="<samp>"   :GOSUB getrideof
    smatch1="</samp>"  :GOSUB getrideof
    smatch1="<kbd>"    :GOSUB getrideof
    smatch1="</kbd>"   :GOSUB getrideof
    smatch1="<var>"    :GOSUB getrideof
    smatch1="</var>"   :GOSUB getrideof
    smatch1="<cite>"   :GOSUB getrideof
    smatch1="</cite>"  :GOSUB getrideof
    smatch1="<tt>"     :GOSUB getrideof
    smatch1="</tt>"    :GOSUB getrideof
    smatch1="<i>"      :GOSUB getrideof
    smatch1="</i>"     :GOSUB getrideof
    smatch1="<b>"      :GOSUB getrideof
    smatch1="</b>"     :GOSUB getrideof
    smatch1="<big>"    :GOSUB getrideof
    smatch1="</big>"   :GOSUB getrideof
    smatch1="<small>"  :GOSUB getrideof
    smatch1="</small>" :GOSUB getrideof
    smatch1="<h1>"     :GOSUB getrideof
    smatch1="</h1>"    :GOSUB getrideof
    smatch1="<h2>"     :GOSUB getrideof
    smatch1="</h2>"    :GOSUB getrideof
    smatch1="<h3>"     :GOSUB getrideof
    smatch1="</h3>"    :GOSUB getrideof
    smatch1="<h4>"     :GOSUB getrideof
    smatch1="</h4>"    :GOSUB getrideof
    smatch1="<h5>"     :GOSUB getrideof
    smatch1="</h5>"    :GOSUB getrideof
    smatch1="<h6>"     :GOSUB getrideof
    smatch1="</h6>"    :GOSUB getrideof
    smatch1="<basefont"   :smatch2=">":GOSUB cleanto1
    smatch1="</basefont"  :smatch2=">":GOSUB cleanto1
    smatch1="<h1 "  :smatch2=">":GOSUB cleanto1
    smatch1="<h2 "  :smatch2=">":GOSUB cleanto1
    smatch1="<h3 "  :smatch2=">":GOSUB cleanto1
    smatch1="<h4 "  :smatch2=">":GOSUB cleanto1
    smatch1="<h5 "  :smatch2=">":GOSUB cleanto1
    smatch1="<h6 "  :smatch2=">":GOSUB cleanto1
    
    
    
    smatch1="<!--"       :smatch2=">":GOSUB cleanto1
    smatch1="<!doctype"       :smatch2=">":GOSUB cleanto1
    smatch1="<a href"    :smatch2=">":GOSUB cleanto1
    smatch1="<a "        :smatch2=">":GOSUB cleanto1
    smatch1="</a "        :smatch2=">":GOSUB cleanto1
    smatch1="</a>"       :GOSUB getrideof
    smatch1="<abbr"      :smatch2=">":GOSUB cleanto1
    smatch1="</abbr>"    :GOSUB getrideof
    smatch1="<acronym"   :smatch2=">":GOSUB cleanto1
    smatch1="</acronym>" :GOSUB getrideof
    smatch1="<address"   :smatch2=">":GOSUB cleanto1
    smatch1="</address>" :GOSUB getrideof
    smatch1="<base"      :smatch2=">":GOSUB cleanto1
    smatch1="<button"    :smatch2="/button>":GOSUB cleanto1
    smatch1="<body"      :smatch2=">":GOSUB cleanto1
    smatch1="</body"     :smatch2=">":GOSUB cleanto1
    smatch1="<caption"   :smatch2=">":GOSUB cleanto1
    smatch1="/caption>"  :smatch2=">":GOSUB cleanto1
    smatch1="<center>" :GOSUB getrideof
    smatch1="</center>" :GOSUB getrideof
    
    smatch1="<col "      :smatch2="/col>":GOSUB cleanto1
    smatch1="<colgroup"  :smatch2="/colgroup>":GOSUB cleanto1
    smatch1="<div"       :smatch2=">":GOSUB cleanto1
    smatch1="</div"      :smatch2=">":GOSUB cleanto1
    smatch1="<fieldset"  :smatch2="/fieldset>":GOSUB cleanto1
    smatch1="<font"      :smatch2=">":GOSUB cleanto1
    smatch1="</font"     :smatch2=">":GOSUB cleanto1
    smatch1="<form"      :smatch2=">":GOSUB cleanto1
    smatch1="</form"      :smatch2=">":GOSUB cleanto1
    
    smatch1="<frame"     :smatch2="/>":GOSUB cleanto1
    smatch1="<frameset"  :smatch2="/frameset>":GOSUB cleanto1
    smatch1="<head"      :smatch2="/head>":GOSUB cleanto1
    smatch1="<html"      :smatch2=">":GOSUB cleanto1
    
    smatch1="</html"     :smatch2=">":GOSUB cleanto1
    smatch1="<iframe"    :smatch2="/iframe>":GOSUB cleanto1
    smatch1="<img"       :smatch2=">":GOSUB cleanto1
    smatch1="<link"      :smatch2="/>":GOSUB cleanto1
    smatch1="<input"      :smatch2=">":GOSUB cleanto1
    
    smatch1="<map"       :smatch2="/map>":GOSUB cleanto1
    smatch1="<meta"      :smatch2="/>":GOSUB cleanto1
    smatch1="<meta"      :smatch2=">":GOSUB cleanto1
    
    smatch1="<object"    :smatch2="/object>":GOSUB cleanto1
    smatch1="<select"    :smatch2="/select>":GOSUB cleanto1
    
    smatch1="<script"    :smatch2="/script>":GOSUB cleanto1
    smatch1="<script"    :smatch2=">":GOSUB cleanto1
    smatch1="</script"    :smatch2=">":GOSUB cleanto1
    
    smatch1="<span"      :smatch2=">":GOSUB cleanto1
    smatch1="</span>"    :GOSUB getrideof
    smatch1="<style"     :smatch2="/style>":GOSUB cleanto1
    smatch1="<textarea"  :smatch2=">":GOSUB cleanto1
    smatch1="</textarea" :smatch2=">":GOSUB cleanto1
    
    'make any quote marks in the file
    REPLACE "<q>"   WITH CHR$(39)    IN s
    REPLACE "</q>"  WITH CHR$(39)    IN s
    REPLACE "<blockquote>"  WITH CHR$(39)    IN s
    REPLACE "</blockquote>" WITH CHR$(39)   IN s
    
    
    'smatch1="<table"     :smatch2="/table>":GOSUB cleanto1  'removes all table information
    smatch1="<table"     :smatch2=">":GOSUB cleanto1
    smatch1="</table>"   :GOSUB getrideof
    REPLACE "<thead>"   WITH "" IN s
    REPLACE "</thead>"  WITH "" IN s
    
    REPLACE "<tfoot>"   WITH "" IN s
    REPLACE "</tfoot>"  WITH "" IN s
    REPLACE "<th>"      WITH "" IN s
    REPLACE "</th>"     WITH " "+CHR$(124) IN s
    smatch1="<td"       :smatch2=">":GOSUB cleanto1
    'REPLACE "<td>"     WITH "" IN s
    REPLACE "</td>"     WITH " "+CHR$(124) IN s
    smatch1="<tr"       :smatch2=">":GOSUB cleanto1
    'REPLACE "<tr>"     WITH "" IN s
    REPLACE "</tr>"     WITH $CRLF IN s
    smatch1="<tbody"    :smatch2=">":GOSUB cleanto1
    'REPLACE "<tbody>"  WITH "" IN s
    REPLACE "</tbody>"  WITH $CRLF IN s
    smatch1="<tbody"    :smatch2=">":GOSUB cleanto1
    REPLACE "<title>"   WITH "title of webpage: " IN s    'title should alread by wiped out
    REPLACE "</title>"  WITH $CRLF IN s
    REPLACE "<ul>"      WITH "" IN s
    REPLACE "</ul>"     WITH "" IN s
    
    REPLACE "<p>"       WITH $CRLF IN s
    REPLACE "</p>"      WITH $CRLF IN s
    REPLACE "<br>"      WITH $CRLF IN s
    REPLACE "</br>"     WITH $CRLF IN s   '</br> bad format but included anyways
    REPLACE "<br />"    WITH $CRLF IN s
    
    
    
    REPLACE "<del>"     WITH " *strikethroughstart* " IN s
    REPLACE "</del>"    WITH " *strikethroughstop* " IN s
    REPLACE "<sub>"     WITH " *subscriptstart* " IN s
    REPLACE "</sub>"    WITH " *subscriptstop* " IN s
    
    REPLACE "<ins>"     WITH " *underlinestart* " IN s
    REPLACE "</ins>"    WITH " *underlinestop* " IN s
    REPLACE "<u>"       WITH " *underlinestart* " IN s
    REPLACE "</u>"      WITH " *underlinestop* " IN s
    
    REPLACE "<hr /"     WITH $CRLF+"-----------------------------------------"+$CRLF IN S
    REPLACE "&nbsp;"     WITH " " IN s
    REPLACE "&amp;"  WITH "&" IN s
    REPLACE "'"  WITH "'" IN s
    
    
    'special removeal
    smatch1="<!"       :smatch2=">":GOSUB cleanto1
    smatch1="<o:"      :smatch2=">":GOSUB cleanto1
    smatch1="</o:"     :smatch2=">":GOSUB cleanto1
    smatch1="<v:"      :smatch2=">":GOSUB cleanto1
    smatch1="</v:"     :smatch2=">":GOSUB cleanto1
    smatch1="<xml"     :smatch2=">":GOSUB cleanto1
    smatch1="</xml"    :smatch2=">":GOSUB cleanto1
    smatch1="</xml"    :smatch2=">":GOSUB cleanto1
    smatch1="<b style" :smatch2=">":GOSUB cleanto1
    smatch1="<b "      :smatch2=">":GOSUB cleanto1
    
    smatch1="<p class"  :smatch2=">":GOSUB cleanto1
    smatch1="<p "       :smatch2=">":GOSUB cleanto1
    
    smatch1="<st1"      :smatch2=">":GOSUB cleanto1
    smatch1="</st1"     :smatch2=">":GOSUB cleanto1
    
    
    
    
    REPLACE CHR$(0)     WITH ""         IN s
    REPLACE ">"         WITH ">"+$CRLF  IN s
    REPLACE "< "        WITH $CRLF+"<"  IN s
    
    'remove all html code tages
    'smatch1="<"        :smatch2=">":GOSUB cleanto1
    
    
    
    
    replacetriplespaces:
    REPLACE "   "         WITH "  "  IN s
    IF INSTR(s,"   ") THEN GOTO replacetriplespaces
    replaceit1:
    REPLACE CHR$(124)+$CRLF     WITH $CRLF IN s
    IF INSTR(s,CHR$(124)+$CRLF) THEN GOTO replaceit1
    replaceit2:
    REPLACE $CRLF+CHR$(124)     WITH $CRLF IN s
    IF INSTR(s,$CRLF+CHR$(124)) THEN GOTO replaceit2
    
    
     s=TRIM$(s)
      DIM sahtmlarray(1) AS LOCAL STRING
      iitems = TALLY(s, $CRLF)
      REDIM sahtmlarray(1 TO iitems + 1)
      PARSE s, sahtmlarray(), $CRLF
      FOR i = 1 TO iitems
      ss = TRIM$(sahtmlarray(i))
      IF ss="" THEN sahtmlarray(i)=""
      NEXT i
    
    GOSUB cleanarrayup
    FOR i = 1 TO iitems
      ss = TRIM$(sahtmlarray(i))
      sahtmlarray(i)=ss
    NEXT i
    
    ss=JOIN$(sahtmlarray(),$CRLF)
    
    replaceit4:
    REPLACE " "+$CRLF     WITH $CRLF IN ss
    IF INSTR(ss," "+$CRLF) THEN GOTO replaceit4
    replaceit5:
    REPLACE $CRLF+$CRLF+$CRLF     WITH $CRLF+$CRLF IN ss
    IF INSTR(ss,$CRLF+$CRLF+$CRLF) THEN GOTO replaceit5
    replaceit6:
    IF LEFT$(ss,2)=$CRLF  THEN ss=RIGHT$(ss,LEN(ss)-2):GOTO replaceit6
    IF RIGHT$(ss,2)=$CRLF THEN ss=LEFT$(ss,LEN(ss)-2):GOTO replaceit6
    
    
    '<dl></dl>
    '<dt></dt>
    '<dd></dd>
    '<ol></ol>
    '<li></li>
    REPLACE "<dl>"     WITH $CRLF IN ss
    REPLACE "</dl>"    WITH $CRLF+$CRLF IN ss
    REPLACE "<dt>"    WITH SPACE$(8) IN  ss
    REPLACE "</dt>"    WITH $CRLF+$CRLF IN ss
    REPLACE "<dd>"    WITH SPACE$(16) IN ss
    REPLACE "</dd>"    WITH $CRLF IN ss
    
    'REPLACE "<td>"     WITH " BeginTableCelltd " IN ss
    'REPLACE "<td "     WITH " BeginTableCelltd " IN ss
    'REPLACE "</td>"    WITH " EndTableCelltd "   IN ss
    '
    '
    'REPLACE "<tr>"     WITH " BeginTableRowtr "  IN ss
    'REPLACE "<tr "     WITH " BeginTableRowtr <tr"  IN ss
    'REPLACE "</tr>"    WITH " EndTableRowtr "    IN ss
    'REPLACE "<table>"  WITH " BeginTableTABLE "  IN ss
    'REPLACE "<table "  WITH " BeginTableTABLE "  IN ss
    'REPLACE "</table>" WITH " EndTableTABLE "    IN ss
    
    
    replaceit7:
    REPLACE $CRLF+$CRLF+$CRLF     WITH $CRLF+$CRLF IN ss
    IF INSTR(ss,$CRLF+$CRLF+$CRLF) THEN GOTO replaceit7
    REPLACE "&lt;"     WITH "<" IN ss
    REPLACE "&gt;"  WITH ">" IN ss
    REPLACE "&LT;"     WITH "<" IN ss
    REPLACE "&GT;"  WITH ">" IN ss
    
    
    
    
    
    
    
    'write the scrubed html file to disk
    '-------------------------------------------------------------------------
    
     f = FREEFILE
     OPEN DestName FOR OUTPUT AS f LEN=16384
      PRINT #f,ss
      CLOSE f
      STDOUT "All done. Check the file " & DestName
     ' WAITKEY$
      EXIT FUNCTION
    
    'abort working with the file, write the whole file back to disk
    '-------------------------------------------------------------------------
    abortwritethefile:
     f = FREEFILE
     OPEN DestName FOR OUTPUT AS f LEN=16384
      PRINT #f,s
      CLOSE f
      STDOUT srcname+" file does not appear to be in html format"
      STDOUT "All done. Check the file " & DestName
      STDOUT "press any key to continue"
      SLEEP 3000
      'WAITKEY$
      EXIT FUNCTION
    
    
    
    'start of gosub routines
    '-------------------------------------------------------------------------
    getrideof:
    i=INSTR(UCASE$(s),UCASE$(smatch1))
    IF i=0 THEN RETURN
      j=LEN(smatch1)
      FOR k=i TO i+j-1
        MID$(s,k,1)=CHR$(0)
      NEXT k
    GOTO getrideof
    RETURN
    
    cleanto1:
    i=INSTR(UCASE$(s),UCASE$(smatch1))
    IF i=0 THEN RETURN
    j=INSTR(i,UCASE$(s),UCASE$(smatch2))
    IF j=0 THEN RETURN
    k=LEN(smatch2)
       FOR l=i TO j+k-1
         MID$(s,l,1)=CHR$(0)
       NEXT l
    GOTO cleanto1
    RETURN
    
    cleanto2:
    FOR i=1 TO iitems
       j=INSTR(UCASE$(sahtmlarray(i)),UCASE$(smatch1))
       IF j THEN
           k=INSTR(UCASE$(sahtmlarray(i)),UCASE$(smatch2))
           IF k THEN
              FOR l=j TO k
                  MID$(sahtmlarray(i),l,1)=CHR$(0)
              NEXT l
           END IF
        END IF
    NEXT i
    RETURN
    
    
    cleantoendofmatch:
     I=INSTR(UCASE$(s),UCASE$(smatch1))
     IF I=0 THEN RETURN
     J=INSTR(i,UCASE$(s),UCASE$(smatch2))
     IF J=0 THEN RETURN
     k=LEN(smatch2)
     FOR l=i TO j+k-1
        MID$(s,l,1)=CHR$(0)
     NEXT l
    GOTO cleantoendofmatch
    RETURN
    
    
    makesmall:
    k=0
    FOR i=1 TO LEN(s)
        IF MID$(s,i,1)="<" THEN
            k=1:ITERATE FOR
          ELSE
          IF MID$(s,i,1)=">" THEN k=0:ITERATE FOR
        END IF
        IF k THEN MID$(s,i,1)=LCASE$(MID$(s,i,1))
    NEXT i
    RETURN
    
    cleanarrayup:
    cleanarrayupagain:
    FOR i = 1 TO iitems
      ss = TRIM$(sahtmlarray(i))
      IF LEN(ss)=0 THEN ITERATE FOR
      IF LEN(ss)=1 THEN
          IF ss=CHR$(124) THEN sahtmlarray(i)="":GOTO cleanarrayupagain
      END IF
      IF LEN(ss)>1 THEN
        IF LEFT$(ss,1)=CHR$(124) THEN sahtmlarray(i)=RIGHT$(ss,(LEN(ss)-1)):GOTO cleanarrayupagain
        IF RIGHT$(ss,1)=CHR$(124) THEN sahtmlarray(i)=LEFT$(ss,(LEN(ss)-1)):GOTO cleanarrayupagain
      END IF
    NEXT i
    RETURN
    
    END FUNCTION 'PbMain
    Attached Files
    Last edited by Paul Purvis; 7 Jun 2008, 07:42 PM.
    p purvis

  • #2
    Originally posted by paul d purvis View Post
    it would be nice have something that would spit out text from html but i just do not see anything.
    Well, you could start by automating IE, using the "SaveAs" method, which can save as text.

    There are dozens of "html2txt" converters out there, just Google for them. Most use regular expressions to keep the coding to a minimum (hint, PB has some nice regex functions).

    If your html input is valid XML (it ought to be in the year 2008), a simple XSL transform will do the job.

    There's even an online converter, courtesy of www.w3.org:

    http://cgi.w3.org/cgi-bin/html2txt

    Here's a pretty good list of converters to several other formats:

    http://www.w3.org/Tools/html2things.html

    I compiled one of them (C source, using gcc under MSYS). It's a filter, so gets its input from STDIN and its output goes to STDOUT.

    To use, simply

    C:> type mypage.html | html2setext.exe > mypage.txt
    Attached Files
    Last edited by Paul Franks; 7 Jun 2008, 11:01 PM.
    --pdf

    Comment


    • #3
      A very simple filter would look for opening tag flags (<) and trailing tag flags (>) and simply ignore everything in between. However, quoted text (surrounded by double quote symbols (") are generally not to be interpreted, so if you find a "<" or ">" inside a quoted section, you would ignore them.

      the code to do both these things is very simple. We define two variables which
      we call tag and quote to track the current status of these marks
      Code:
      tag=0
      quote=0
      FOR a=1 TO LEN(httpstr$)
        aa$=MID$(httpstr$,a,1)   'deal with one character at a time
        SELECT CASE aa$
        CASE CHR$(34)    'double quote
           IF NOT tag THEN quote = NOT quote  'ignore insides of tags
        CASE "<"     'possible start of tag
           IF NOT quote THEN INCR tag  'this allows for embedded tags
         CASE ">"
           IF NOT quote AND tab>0 THEN DECR tag  'again, allows for embedded tags
         END SELECT
         IF NOT tag THEN PRINT aa$;  'anything not in a tag is fair game
      NEXT
      This is certainly not going to deal with every situation by any means.
      If you want to recover embedded URLs, you have to look for certain types of
      tags and process them. If the HTTP page has links to other documents, lists,
      tables, or whatever, you have to search them out, retrieve them, process them, and embed them in the right place. But in its simplest form, this is how
      you can remove http tags from a straightforward document and retain just the text.

      Comment

      Working...
      X