Announcement

Collapse
No announcement yet.

http getter with effort to scrap out text without elements n remove bracketed stuff

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • http getter with effort to scrap out text without elements n remove bracketed stuff

    This is more of a starter webpage getter.
    Ever heard of phrase I call "a Starter Party", that is a party where no food is there and just a place to start off and you will not stay long.

    Code:
    'compiled with pbcc 4.04
    
    'the right way or the wrong way, this progam mostly deals with making an attempt to remove all content
    'between matched open and closing brackets of block of code in a webpage.
    'most the time that will be an opeinng and closing angle brackets mostly called
    ' the < "less than" and > "greater than" character signs but the function was written to take any 2 set of matching characters.
    'of course there is an order of opening before closing that this program uses.
    
    'i had to create a function called removeblockedhttpelement because i found where blocks of code had
    'an uneven number of < and > characters that existed and it was in http elemnent block called <style> with </style>
    'of course <script> and </script> blocks can be tough also.
    
    'wheb using the  function removeblockedhttpelement, it is necessary the webpage block elements do not contain any spaces
    'after the < "less than character", so what would fail would be blocks like "< style>" and/or "</ style>"
    
    'this program makes an attempt to help find webpage html code that gives me trouble with uneven matching
    'opening and closing characters for tags and where is just want the raw text in a programmic manor.
    
    
    '
    #COMPILE EXE
    #DIM ALL
    #BREAK ON
    
    'basic web get code from powerbasic examples to get a http webpage and not a redirected or https webpage
    FUNCTION getwebpage(BYVAL swebsite AS STRING,BYVAL swebsitefullurl AS STRING, BYREF swebpagecontents AS STRING) AS LONG
        LOCAL sBuffer, sEntirepage AS STRING
        LOCAL iLength,nfile AS LONG
    
        FUNCTION=0
        swebpagecontents=""
        IF LEN(swebsitefullurl)=0 THEN EXIT FUNCTION
        IF LEN(swebsite)=0 THEN EXIT FUNCTION
        ' Connecting...
        nfile=FREEFILE
        TCP OPEN "http" AT swebsite AS nfile TIMEOUT 60000
        'Could we connect to site?
        IF ERR THEN
            TCP CLOSE nfile
            EXIT FUNCTION
        END IF
        ' Send the GET request...
        TCP PRINT #1, "GET " & swebsitefullurl & " HTTP/1.0"
        TCP PRINT #1, "User-Agent:powerbasic
        TCP PRINT #1, ""
        ' Retrieve the page...
        DO
            TCP RECV #1, 4096, sBuffer
            sEntirepage = sEntirepage+ sbuffer
        LOOP WHILE ISTRUE LEN(sBuffer) AND ISFALSE ERR
        'Close the TCP/IP port...
        TCP CLOSE nfile
        IF RIGHT$(sEntirepage,1)=$SPC THEN
            swebpagecontents=RTRIM$(sEntirepage)
            ELSE
            swebpagecontents=sEntirepage
        END IF
        FUNCTION=1
    END FUNCTION
    
    'remove tables and replace then x number of spaces, 1 should likely be your miniumum
    FUNCTION replaceallcharacters (BYREF swebcontents AS STRING, BYVAL charactertoreplace AS STRING, BYVAL relacementcharacter AS STRING ) AS STRING
            WHILE INSTR(swebcontents,charactertoreplace)
                REPLACE charactertoreplace WITH relacementcharacter IN swebcontents
            WEND
    END FUNCTION
    
    'remove tables and replace then x number of spaces, 1 should likely be your miniumum
    FUNCTION replacetabswithspaces (BYVAL swebcontents AS STRING, BYVAL inumberofspacesfortabs AS LONG ) AS STRING
            LOCAL spacesinplaceoftab AS STRING
    
            spacesinplaceoftab=SPACE$(inumberofspacesfortabs)
            WHILE INSTR(swebcontents,$TAB)
                REPLACE $TAB WITH spacesinplaceoftab IN swebcontents
            WEND
           FUNCTION=swebcontents
    END FUNCTION
    
    
    'remove all emtpy lines, and if a tab is present in the line then the line is not empty ' see tab replacement function
    'all spaces to the right on a line will be removed with this code (right trimmed)
    'spaces on the left are retained and no left trim will take place on spaces on the left
    FUNCTION removeemptylines (BYVAL swebcontents AS STRING ) AS STRING
        IF INSTR(swebcontents,$CR) THEN
            WHILE INSTR(swebcontents,$CRLF)
                REPLACE $CRLF WITH $LF IN swebcontents
            WEND
            WHILE INSTR(swebcontents,$CR)
                REPLACE $CR WITH $LF IN swebcontents
            WEND
        END IF
        WHILE INSTR(swebcontents,$SPC+$LF)
            REPLACE $SPC+$LF WITH $LF IN swebcontents
        WEND
        WHILE INSTR(swebcontents,$LF+$LF)
            REPLACE $LF+$LF WITH $LF IN swebcontents
        WEND
        WHILE LEFT$(swebcontents,1)=$LF
             swebcontents=RIGHT$(swebcontents,LEN(swebcontents)-1)
        WEND
        IF INSTR(swebcontents,$LF) THEN REPLACE $LF WITH $CRLF IN swebcontents
        FUNCTION=swebcontents
    END FUNCTION
    
    'standardize on CR and LF together for lines
    FUNCTION standardizeendoflinecharacters (BYVAL swebcontents AS STRING ) AS STRING
        IF INSTR(swebcontents,$CR) THEN
            WHILE INSTR(swebcontents,$CRLF)
                REPLACE $CRLF WITH $LF IN swebcontents
            WEND
            WHILE INSTR(swebcontents,$CR)
                REPLACE $CR WITH $LF IN swebcontents
            WEND
        END IF
        IF INSTR(swebcontents,$LF) THEN REPLACE $LF WITH $CRLF IN swebcontents
        FUNCTION=swebcontents
    END FUNCTION
    
    'remove all webcontent that is not in the body of the webpage contents
    FUNCTION separateouthttpbody (BYVAL swebcontents AS STRING ) AS STRING
        LOCAL iresult AS LONG
        IF LEN(swebcontents)=0 THEN GOTO EXITGETWEBBODY
        IF INSTR(UCASE$(swebcontents),"<BODY") AND INSTR(UCASE$(swebcontents),"</BODY") THEN
            iresult=INSTR(UCASE$(swebcontents),"<BODY")
            swebcontents=RIGHT$(swebcontents,LEN(swebcontents)-iresult)
            iresult=INSTR(swebcontents,">")
            IF iresult THEN swebcontents=RIGHT$(swebcontents,LEN(swebcontents)-iresult)
            iresult=INSTR(UCASE$(swebcontents),"</BODY")
            IF iresult THEN swebcontents=LEFT$(swebcontents,iresult-1)
        END IF
        IF INSTR(UCASE$(swebcontents),"<HTML") AND INSTR(UCASE$(swebcontents),"</HTML") THEN
            iresult=INSTR(UCASE$(swebcontents),"<HTML")
            swebcontents=RIGHT$(swebcontents,LEN(swebcontents)-iresult)
            iresult=INSTR(swebcontents,">")
            IF iresult THEN swebcontents=RIGHT$(swebcontents,LEN(swebcontents)-iresult)
            iresult=INSTR(UCASE$(swebcontents),"</HTML")
            IF iresult THEN swebcontents=LEFT$(swebcontents,iresult-1)
        END IF
    EXITGETWEBBODY:
        FUNCTION=swebcontents
    END FUNCTION
    
    
    'remove all characters in between and including any supplied open and closing brackets characters
    FUNCTION removematchingbrackets (BYVAL swebcontents AS STRING, BYVAL sleftbacketsign AS STRING, BYVAL srightbacketsign AS STRING, _
              BYVAL ifailonnonevencountbrackets AS LONG, BYVAL iprinttoconsoleifnonevencountbrackets AS LONG ) AS STRING
        LOCAL ilooper,  iinsidetbracketsflag,  isetcharactertoremove AS LONG
        LOCAL ileftbracketcharacter, irightbracketcharacter AS LONG
        LOCAL ipointer AS BYTE PTR
        LOCAL scontentstemp AS STRING
    
        'ifailonnonevencountbrackets = 0& '1& means do not do not process removing of brackets if they are uneven in count
                                          'which means the standard out will include the brackets.
                                          '0& means make an effort to process the uneven count of brackets anyway
        'iprinttoconsoleifnonevencountbrackets = 1& '1& means to print to the screen the display the differences in bracket count if they are uneven in count
                                                    '0& means do not print any differences in bracket count if they are found to be any uneven in count
        ileftbracketcharacter = ASC(sleftbacketsign)
        irightbracketcharacter = ASC(srightbacketsign)
        iinsidetbracketsflag = -1&
    
        IF LEN(swebcontents) = 0& OR TALLY(swebcontents,CHR$(ileftbracketcharacter)) = 0& OR TALLY(swebcontents,CHR$(irightbracketcharacter)) = 0& THEN _
            GOTO EXITREMOVEMATCHINGANGLEBRACKETS
    
        'this line checks the number of bracket count and determines if they are equal
        'the next line can counter them
        IF TALLY(swebcontents,CHR$(ileftbracketcharacter)) <> TALLY(swebcontents,CHR$(irightbracketcharacter)) THEN
            IF iprinttoconsoleifnonevencountbrackets THEN
                PRINT "Brackets types of "+CHR$(ileftbracketcharacter)+" and "+CHR$(irightbracketcharacter)" are not equal in count, difference is "+_
                    STR$(ABS(TALLY(swebcontents,CHR$(ileftbracketcharacter))-TALLY(swebcontents,CHR$(irightbracketcharacter))))
                PRINT "The number of character "+CHR$(ileftbracketcharacter)+" is"+STR$(TALLY(swebcontents,CHR$(ileftbracketcharacter)))
                PRINT "The number of character "+CHR$(irightbracketcharacter)+" is"+STR$(TALLY(swebcontents,CHR$(irightbracketcharacter)))
            END IF
    
        END IF
        IF TALLY(swebcontents,CHR$(ileftbracketcharacter)) <> TALLY(swebcontents,CHR$(irightbracketcharacter)) THEN
            IF ifailonnonevencountbrackets GOTO EXITREMOVEMATCHINGANGLEBRACKETS
        END IF
    
        'this step is looking for a single characrter to be use for inserting in the contents that will be removed at the end of this routine
        'if an available character is not found then this process aborts and there returning contents is the same as was passed to it
        WHILE INSTR(swebcontents,CHR$(isetcharactertoremove))>0&
            INCR isetcharactertoremove
            IF isetcharactertoremove > 255& THEN GOTO EXITREMOVEMATCHINGANGLEBRACKETS 'this line should not ever execute if an available temporary character is found
        WEND
        scontentstemp=swebcontents
        iinsidetbracketsflag = 0&
        ipointer = STRPTR(scontentstemp)
        FOR ilooper = 1& TO LEN(scontentstemp)
            IF @ipointer = ileftbracketcharacter THEN
                INCR iinsidetbracketsflag
                @ipointer = isetcharactertoremove
            ELSEIF @ipointer = irightbracketcharacter THEN
                IF iinsidetbracketsflag THEN
                    DECR iinsidetbracketsflag
                    @ipointer = isetcharactertoremove
                ELSE
                    'if you choose to remove the next IF block statements in your program, remove or remark out the ELSE above out too
                    IF iprinttoconsoleifnonevencountbrackets THEN
                        PRINT MID$(swebcontents,ilooper-6,13)+"   A possible location of uneven bracket matching"+STR$(ilooper)
                   END IF
                END IF
            ELSE
            IF iinsidetbracketsflag THEN @ipointer = isetcharactertoremove
            END IF
            INCR ipointer
        NEXT ilooper
    EXITREMOVEMATCHINGANGLEBRACKETS:
           IF iinsidetbracketsflag <> 0& THEN
            FUNCTION=swebcontents
        ELSE
            FUNCTION=REMOVE$(scontentstemp,CHR$(isetcharactertoremove))
        END IF
    END FUNCTION
    
    'remove all specific webpage element block types like <script> with </script> and <style> with </style>"
    FUNCTION removeblockedhttpelement (BYVAL swebcontents AS STRING,BYVAL selementname AS STRING) AS STRING
        LOCAL ilooper AS LONG
        LOCAL iresult1 AS LONG
        LOCAL iresult2 AS LONG
    
        swebcontents=" "+swebcontents+" "
        selementname=UCASE$(TRIM$(selementname))
        IF LEN(selementname)=0 THEN GOTO EXITREMOVEHTPELEMENTS
        WHILE INSTR(UCASE$(swebcontents),"<"+selementname) AND INSTR(UCASE$(swebcontents),"</"+selementname)
            iresult1=INSTR(UCASE$(swebcontents),"<"+selementname)
            iresult2=INSTR(UCASE$(swebcontents),"</"+selementname)
            IF iresult2<iresult1 THEN GOTO EXITREMOVEHTPELEMENTS
            FOR ilooper=iresult2+1& TO LEN(swebcontents)
                IF MID$(swebcontents,ilooper,1&)=">" THEN
                    iresult2=ilooper
                    swebcontents=LEFT$(swebcontents,iresult1-1&)+RIGHT$(swebcontents,LEN(swebcontents)-iresult2)
                    EXIT FOR
                END IF
            NEXT ilooper
        WEND
    EXITREMOVEHTPELEMENTS:
        IF LEN(swebcontents)<3& THEN
            FUNCTION=""
            ELSE
            FUNCTION=MID$(swebcontents,2&,LEN(swebcontents)-2&)
        END IF
    END FUNCTION
    
    
    
    FUNCTION PBMAIN () AS LONG
        LOCAL iresult AS LONG
        LOCAL ifailonnonevencountbrackets AS LONG
        LOCAL iprinttoconsoleifnonevencountbrackets AS LONG
        LOCAL swebsite AS STRING
        LOCAL swebsitefullurl AS STRING
        LOCAL swebpagecontents AS STRING
    
        swebpagecontents=""
        'this code does not support https sites
        'i tested on a web location that was a "simple site" , you might want to google that. It had some uneven number of matched "<>" brackets characters on the webpage.
    
        swebsitefullurl = "http://www.yoursimplesite.com/"  'it is possible to need a single forward slash character to end the url line
        swebsite="www.yoursimplesite.com"
    
        'swebpagecontents will be filled with the getwebpage
        swebpagecontents=""
        iresult=getwebpage(swebsite,swebsitefullurl,swebpagecontents)
        IF iresult=0 THEN FUNCTION=0:EXIT FUNCTION
    
        'next 2 lines does a quick test to get see if the program is getting the proper webpage content
        'IF LEN(swebpagecontents) THEN STDOUT swebpagecontents
        'EXIT FUNCTION
    
    
        'remove all but the body of the webpage
        IF LEN(swebpagecontents) THEN swebpagecontents=separateouthttpbody(swebpagecontents)
        replaceallcharacters(swebpagecontents,"<BR>","<br>")
        replaceallcharacters(swebpagecontents,"<Br>","<br>")
        replaceallcharacters(swebpagecontents,"<bR>","<br>")
        replaceallcharacters(swebpagecontents,"<BR />","<br />")
        replaceallcharacters(swebpagecontents,"<BR/>","<br/>")
        replaceallcharacters(swebpagecontents,"<P>","<p>")
        replaceallcharacters(swebpagecontents,"<br>",$CRLF)
        replaceallcharacters(swebpagecontents,"<br/>",$CRLF)
        replaceallcharacters(swebpagecontents,"<br />",$CRLF)
        replaceallcharacters(swebpagecontents,"<p>",$CRLF+$CRLF)
    
        'remove all blocks of the elements named script
        IF LEN(swebpagecontents) THEN swebpagecontents=removeblockedhttpelement(swebpagecontents,"script")
    
        'remove all blocks of the elements named style
        IF LEN(swebpagecontents) THEN swebpagecontents=removeblockedhttpelement(swebpagecontents,"style")
    
        ifailonnonevencountbrackets = 0&            'see inside the function removematchingbrackets for the meaning of this parameter
        iprinttoconsoleifnonevencountbrackets = 1& 'see inside the function removematchingbrackets for the meaning of this parameter
        'remove all characters in between the bracket signs of < and >
        IF LEN(swebpagecontents) THEN        swebpagecontents=removematchingbrackets(swebpagecontents,"<",">",ifailonnonevencountbrackets,iprinttoconsoleifnonevencountbrackets)
    
        'standardize the ending of line characters from carriage return or carriage return/linefeed or line feed to a combination carriage return+line feed characters
        IF LEN(swebpagecontents) THEN        swebpagecontents=standardizeendoflinecharacters(swebpagecontents)
    
    
        'replace characters in string
        replaceallcharacters(swebpagecontents,"&nbsp;",$SPC)
        replaceallcharacters(swebpagecontents,"&gt;",">")
        replaceallcharacters(swebpagecontents,"&lt;","<")
        replaceallcharacters(swebpagecontents,"&amp;","&")
        replaceallcharacters(swebpagecontents,"&quot;",$DQ)
        replaceallcharacters(swebpagecontents,"&apos;",$SQ)
        replaceallcharacters(swebpagecontents,"&cent;","¢")
        replaceallcharacters(swebpagecontents,"&pound;","£")
        replaceallcharacters(swebpagecontents,"&yen;","¥")
        replaceallcharacters(swebpagecontents,"&euro;","€")
        replaceallcharacters(swebpagecontents,"&copy;","©")
        replaceallcharacters(swebpagecontents,"&reg;","®")
        replaceallcharacters(swebpagecontents,"à","a`")
        replaceallcharacters(swebpagecontents,"á","a´")
        replaceallcharacters(swebpagecontents,"â","a^")
        replaceallcharacters(swebpagecontents,"ã","a~")
        replaceallcharacters(swebpagecontents,"Ò","O`")
        replaceallcharacters(swebpagecontents,"Ó","O´")
        replaceallcharacters(swebpagecontents,"Ô","O^")
        replaceallcharacters(swebpagecontents,"Õ","O~")
        replaceallcharacters(swebpagecontents,"
",$CR)
        replaceallcharacters(swebpagecontents,"
    ",$LF)
        replaceallcharacters(swebpagecontents,"",$TAB)
    
    
        replaceallcharacters(swebpagecontents,"&nbsp",$SPC)
        replaceallcharacters(swebpagecontents,"&gt",">")
        replaceallcharacters(swebpagecontents,"&lt","<")
        replaceallcharacters(swebpagecontents,"&amp","&")
        replaceallcharacters(swebpagecontents,"&quot",$DQ)
        replaceallcharacters(swebpagecontents,"&apos",$SQ)
        replaceallcharacters(swebpagecontents,"&cent","¢")
        replaceallcharacters(swebpagecontents,"&pound","£")
        replaceallcharacters(swebpagecontents,"&yen","¥")
        replaceallcharacters(swebpagecontents,"&euro","€")
        replaceallcharacters(swebpagecontents,"&copy","©")
        replaceallcharacters(swebpagecontents,"&reg","®")
        replaceallcharacters(swebpagecontents,"a&#768","a`")
        replaceallcharacters(swebpagecontents,"a&#769","a´")
        replaceallcharacters(swebpagecontents,"a&#770","a^")
        replaceallcharacters(swebpagecontents,"a&#771","a~")
        replaceallcharacters(swebpagecontents,"O&#768","O`")
        replaceallcharacters(swebpagecontents,"O&#769","O´")
        replaceallcharacters(swebpagecontents,"O&#770","O^")
        replaceallcharacters(swebpagecontents,"O&#771","O~")
        replaceallcharacters(swebpagecontents,"&#013",$CR)
        replaceallcharacters(swebpagecontents,"&#010",$LF)
        replaceallcharacters(swebpagecontents,"&#011",$TAB)
    
        'replace tabs with space characters, use a number of at least 1 being the minimum to provide at least one space where tabs are removed
        IF INSTR(swebpagecontents,$TAB) THEN swebpagecontents=replacetabswithspaces(swebpagecontents,8&)
        'remove blank lines, if tabs are pressent, use the replace tabs with spaces function first if tabs are pressent.
        'all spaces to the right on lines will be removed and right trimmed
        IF LEN(swebpagecontents) THEN        swebpagecontents=removeemptylines(swebpagecontents)
    
        IF LEN(swebpagecontents) THEN STDOUT swebpagecontents
    
    END FUNCTION
    p purvis

  • #2
    function to help standardize some tags to lower case or tags where you want to remove spaces in tags

    Everything is great when http elements are coded in more of a proper case like lower case and no spacing that might play havic.
    A lot of people do not like my style of writing by when it comes to something that has to be programmed against like html code, lose is not always best and
    as far as i now, lower case is basically king when coding a lot of html but it does not always end up that way.
    This code will basically strip all the spaces from the html code and lowercase it to a temporary string to find the code that you want to change without affecting the original code.
    This code does find an unused character to be used as a temporary character and hopefully it will be a null character and the code will remove it at the end.
    The idea is make the code as fast as possible working on the same string rather than making string copies. If it does not find one it will return not make changes to the original html
    code contents.

    The program works on a single element and the element provide to change needs to be in brackets.
    The html element tag "<pre>" and "</pre>" are good examples to use this code on.
    If you wanted to standardize html tags of the above and it was in a html file like this even "< p R e >" or " < / Pr e>" which should not be but if it is it will convert those in the
    flowing lines of code.
    convert_http_element_with_brackets_to_lowercase (swebpagecontents,"<pre>")
    convert_http_element_with_brackets_to_lowercase (swebpagecontents,"</pre>")
    I am not a html expert by far but element tags that basically are a single worded tag with no spaces in the tag name, this should help.
    Those <BR> can be <br> or given <br/> can turn <br /> to <br/> which it should have been all the time because lets face it. Some of those spaces are hard to see while creating code.



    Code:
    FUNCTION convert_http_element_with_brackets_to_lowercase (BYREF swebpagecontents AS STRING,BYVAL shttpelementname AS STRING ) AS LONG
        LOCAL ilooper,istartsearchlocation, ilocation1,ilocation2,icounter AS LONG
        LOCAL ipointer AS BYTE PTR
        LOCAL isetcharactertoremove AS LONG
        LOCAL iflagalldone AS LONG
        LOCAL icountnumberofspacescharacterstolocation AS LONG
        LOCAL icountnumberofnonspaces AS LONG
        LOCAL ilastcharacterinshttpelementname AS LONG
        LOCAL stemplowercasewitoutspaces AS STRING
    
        shttpelementname=LCASE$(shttpelementname)
        stemplowercasewitoutspaces=LCASE$(REMOVE$(swebpagecontents, $SPC))
        IF INSTR(stemplowercasewitoutspaces,shttpelementname) = 0& THEN
            FUNCTION=1&
            GOTO EXITCONVERTELEMENTWITHBRACKETTOLOWERCASE
        END IF
        IF INSTR(swebpagecontents,UCASE$(shttpelementname)) THEN
            REPLACE UCASE$(shttpelementname) WITH shttpelementname IN swebpagecontents
        END IF
        IF INSTR(stemplowercasewitoutspaces,shttpelementname)=0& THEN FUNCTION=1:GOTO EXITCONVERTELEMENTWITHBRACKETTOLOWERCASE
        WHILE INSTR(swebpagecontents,CHR$(isetcharactertoremove)) > 0&
            INCR isetcharactertoremove
            IF isetcharactertoremove > 255& THEN
                FUNCTION=0
                GOTO EXITCONVERTELEMENTWITHBRACKETTOLOWERCASE 'this line hopefully should not ever execute if an available temporary character is found
            END IF
        WEND
    
        ilastcharacterinshttpelementname=ASC(RIGHT$(shttpelementname,1&))
        istartsearchlocation=1&
        WHILE iflagalldone=0&
            ilocation1=INSTR(istartsearchlocation,stemplowercasewitoutspaces,shttpelementname)
            IF ilocation1=0& THEN iflagalldone=1&: ITERATE
            icountnumberofspacescharacterstolocation=0&
            icountnumberofnonspaces=0&
            ilocation2=1&
            ipointer =STRPTR(swebpagecontents)
            WHILE ilocation2<ilocation1
                IF @ipointer=32& THEN
                    INCR icountnumberofspacescharacterstolocation
                ELSE
                    INCR ilocation2
               END IF
               INCR icounter
               INCR ipointer
            WEND
            ilocation2=icounter
            FOR ilooper = 1& TO LEN(swebpagecontents)
                IF @ipointer=ilastcharacterinshttpelementname THEN
                    @ipointer=isetcharactertoremove
                    INCR ilocation2
                    EXIT FOR
                END IF
                @ipointer=isetcharactertoremove
                INCR ipointer
                INCR ilocation2
            NEXT ilooper
            ipointer =STRPTR(swebpagecontents)+icountnumberofspacescharacterstolocation+ilocation1-1&
            FOR ilooper = 1& TO LEN(shttpelementname)
                @ipointer=ASC(MID$(shttpelementname,ilooper,1&))
                INCR ipointer
            NEXT ilooper
            istartsearchlocation=ilocation2
         WEND
        IF INSTR(swebpagecontents,CHR$(isetcharactertoremove)) THEN swebpagecontents=REMOVE$(swebpagecontents,CHR$(isetcharactertoremove))
        EXITCONVERTELEMENTWITHBRACKETTOLOWERCASE:
        EXIT FUNCTION
    END FUNCTION
    p purvis

    Comment


    • #3
      This function makes an effort to deal with line breaks in the html <pre> and </pre> tags.
      Why, because there are line breaks in html code and you might want preserve the line breaks while removing line breaks in html code.

      The only way I figure to keep line breaks at time in pre tags is to convert the line feeds to some thing you can reverse later.
      So in a small effort to visualize a standard with any help coaching me, i just converted the line breaks in the pre tags with an entity code like the space is treated.
      But because you are the coder, you can do almost anything you want as long as your doing the coding right and your converting the code so something you program understands.
      Then it should be ok.

      Because the ix's systems where first, i believe it is best to convert all carriage returns or carriage returns/line feeds combinations to line feeds.
      You can always to do a full change of characters from line feeds to CRLF at the end of your program and also even if you just want to change all the entity_number representatives
      of the line feed and carriage returns.

      This code was written most for while a program is working with an html code and not intended to save files with pre tags and entity number numbers of used rather than control characters.

      I am not sure if this code will make it easier to break down html code into string arrays for dealing with the code, but maybe this will help by changing those line feeds and carriage returns
      into string arrays and preserving the line breaks in html pre element blocks. It is much easier to deal with a short string than the whole html as one string.
      This code could be improved by removing the inefficient string building. I will have to think about it

      The names for the functions helpfully should be self explaining.

      In the original code, i actually changed the tag also from "<pre>" to "&preprepre" and "</pre>" to "&/preprepre" but
      i rethought that because i had written the code in the prior post to standardize some tags to lower case and proper a proper tag,
      so i felt if you wanted to make those changes, you can use the above code that should work for changing <pre> tags to something you like.
      I get carried away and start deleting all those tags and all things in them when i just want the raw basic text in a html file..

      convert_linebreaks_in_pre_element_to_entity_number(swebpagecontents)
      convert_linebreaks_in_pre_element_from_entity_number(swebpagecontents)

      Code:
      FUNCTION convert_linebreaks_in_pre_element_to_entity_number(BYREF swebpagecontents AS STRING) AS LONG
          LOCAL x1,x2 AS LONG
          LOCAL stemp1,stemp2,stemp3 AS STRING
         '<pre>paul</pre>
      
          x1=INSTR(swebpagecontents,"<pre>")
          x2=INSTR(swebpagecontents,"</pre>")
          IF x1=0& THEN EXIT FUNCTION
          IF x2<x1 THEN EXIT FUNCTION
          stemp3=swebpagecontents
          WHILE x1 AND x2
              IF x1>x2 THEN x1=0&:x2=0&:ITERATE
              stemp2=MID$(stemp3,x1+5,(x2-x1)-5)
      
              WHILE INSTR(stemp2,$CR+$LF)
                  REPLACE $CR+$LF WITH $LF IN stemp2
              WEND
              WHILE INSTR(stemp2,$CR)
                  REPLACE $CR WITH $LF IN stemp2
              WEND
              WHILE INSTR(stemp2,$LF)
                  REPLACE $LF WITH " " IN stemp2
              WEND
              stemp1+=LEFT$(stemp3,x1+4)+stemp2+MID$(stemp3,x2,+6)
              stemp3=RIGHT$(stemp3,LEN(stemp3)-x2-5)
              x1=INSTR(stemp3,"<pre>")
              x2=INSTR(stemp3,"</pre>")
          WEND
         swebpagecontents=stemp1+stemp3
      END FUNCTION
      
      FUNCTION convert_linebreaks_in_pre_element_from_entity_number(BYREF swebpagecontents AS STRING) AS LONG
          LOCAL x1,x2 AS LONG
          LOCAL stemp1,stemp2,stemp3 AS STRING
         '<pre>paul</pre>
      
          x1=INSTR(swebpagecontents,"<pre>")
          x2=INSTR(swebpagecontents,"</pre>")
          IF x1=0& THEN EXIT FUNCTION
          IF x2<x1 THEN EXIT FUNCTION
          stemp3=swebpagecontents
          WHILE x1 AND x2
              IF x1>x2 THEN x1=0&:x2=0&:ITERATE
              stemp2=MID$(stemp3,x1+5,(x2-x1)-5)
              WHILE INSTR(stemp2," ")
                  REPLACE " " WITH $CR IN stemp2
              WEND
              WHILE INSTR(stemp2," ")
                  REPLACE " " WITH $LF IN stemp2
              WEND
              WHILE INSTR(stemp2,$CR+$LF)
                  REPLACE $CR+$LF WITH $LF IN stemp2
              WEND
              WHILE INSTR(stemp2,$CR)
                  REPLACE $CR WITH $LF IN stemp2
              WEND
              stemp1+=LEFT$(stemp3,x1+4)+stemp2+MID$(stemp3,x2,+6)
              stemp3=RIGHT$(stemp3,LEN(stemp3)-x2-5)
              x1=INSTR(stemp3,"<pre>")
              x2=INSTR(stemp3,"</pre>")
          WEND
         swebpagecontents=stemp1+stemp3
      END FUNCTION
      p purvis

      Comment


      • #4
        Most of the work here is just retrieve some inhouse http server information.
        The internet webpages have gotten so complicated with CCS and other script type code.
        If somebody wants to really keep a webpage, there is code for that or find a way to store it as
        a pdf then extract the text with things like pdftk.
        My use sometimes is in a program running in the back ground to check the condition of servers or information in real time and
        some of that i do not even want to be hitting the drive due to newer SSD drives and how they work which is on about every work station now.
        The newer version of "Autoit" has really improved since i tested it years ago.

        I will be changing the webget function in the program above httprequest API code later.
        Most additions to this thread will be only posting functions rather that full code source.
        p purvis

        Comment

        Working...
        X