Announcement

Collapse
No announcement yet.

Retrieve a web page

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Retrieve a web page

    New member here - a PB hobbyist. I began using PB with version 2.10a for DOS (Actually, before that, TurboBasic 1.1 1987) and in October, finally upgraded to PBCC 6.03.0102. Back then I used PB to take data from the mainframe and put it into usable formats for use in my university department. Those programs were simply "strings in one format" to "strings in another." I am retired now and write little programs for my own entertainment and to help with some financial calcs & taxes.

    While exploring PBCC I found the TCP statement which enables the retrieval of a web page. That looks like a handy command. I tried the sample program in the documentation, but couldn't get it to work. Looking through the discussions I think what I learned is that HTTP evolved to HTTPS and the command no longer works because PB doesn't have certificates. Am I correct in this assumption?

    I found a PB sample program is suppose to allow keystrokes to be sent to a web page so one can SELECT ALL with CTRL-A, the COPY with CTRL-C and then PASTE to the clipboard. I have only worked a little with that program at this point.

    I feel like a squirrel when reading the forums, because I keep finding "nuts of interest" that distract me from searching out that TCP statement. Thanks to all before me who contributed to the forum because it is a great help to me.

    Tim

  • #2
    Howdy, TIm!

    Welcome to the forums!!

    Comment


    • #3
      For my clarification, are you wanting to capture the visible text on a page? Or the entire HTML code? Or perhaps an image of the webpage?

      Comment


      • #4
        Gary, All I want to do is capture text from a page. For example, my credit union has a summary page of all my holdings, but it is not "downloadable." I can download .CSV files for each individual holding (checking, money market, certificates, etc) , but not the nice summary page. Or, a etext book may be displayed page by page, but no provision is available to download the page/book. Current practice is to select the page, copy the selected material, paste it to Notepad and then use a PBCC program to parse out the information I want the way I want.

        I have been successful in downloading entire files with a program (by Pierre Bellisle, 29 Mar 2016, 12:40 PM) which is handy for grabbing .PDF files, etc.

        Thanks

        Tim

        Comment


        • #5
          Tim,

          If you want a slob's way of getting the data you want, load the URL into an open file dialog and save the resulting file wherever you want. Then parse the file to get the data you need.
          hutch at movsd dot com
          The MASM Forum

          www.masm32.com

          Comment


          • #6
            > a slob's way of getting the data you want

            Yeah, my first thought was WEBGET, but that doesn't work with some database-based pages.

            Tim, however you retrieve it, a super-easy way to parse well-formed HTML is to use:

            PARSE$(... ANY "<>")

            By using ANY, you'll find that all of the odd numbered elements are HTML tags and all of the even ones are the plain text you seek.
            "Not my circus, not my monkeys."

            Comment


            • #7
              Originally posted by Steve Hutchesson View Post
              Tim,

              If you want a slob's way of getting the data you want, load the URL into an open file dialog and save the resulting file wherever you want. Then parse the file to get the data you need.
              Steve

              Speaking as a slob - can you give me the details of this please?

              Kerry
              [I]I made a coding error once - but fortunately I fixed it before anyone noticed[/I]
              Kerry Farmer

              Comment


              • #8
                Originally posted by Eric Pearson View Post
                > a slob's way of getting the data you want

                Yeah, my first thought was WEBGET, but that doesn't work with some database-based pages.

                Tim, however you retrieve it, a super-easy way to parse well-formed HTML is to use:

                PARSE$(... ANY "<>")

                By using ANY, you'll find that all of the odd numbered elements are HTML tags and all of the even ones are the plain text you seek.
                Dang! Thata's one I've never thought of. Nice one!

                --
                [URL="http://www.camcopng.com"]CAMCo - Applications Development & ICT Consultancy[/URL][URL="http://www.hostingpng.com"]
                PNG Domain Hosting[/URL]

                Comment


                • #9
                  I have been able to use the TCP statement with the following code. I tried to incorporate the suggestion by Eric, but I'm not using the commands correctly. The web site in my example is one I visit frequently and it is safe - www.pasty.com. A pasty is a meat "pie" wrapped in a crust and was food for Cornish miners. The web site celebrates a region of Michigan in the Upper Peninsula. I think the reason this site downloads is that it is HTTP not HTTPS. UPLOAD.TXT

                  Comment


                  • #10
                    Tim,
                    Once you have the content downloaded, do you want to do anything other than read the content?

                    Specifically, would an image of the page be good enough? I've published code here in the forums for that.

                    Getting the formatted content of a web page pretty much requires that your download the entire page, including any JavaScript, CSS or other external files that modify the formatting/content. That's a daunting task.

                    Steve's idea of letting the browser download the complete page is the simplest solution but isn't a programming approach. I don't see that you responded to his suggestion, so I don't know if it meets your needs.

                    Comment


                    • #11
                      Kerry,
                      With Chrome open, click on the menu button just to the right of the website address bar.. Open the "More Tools" then "Save Page As..."

                      That saves an HTML file to the PC, plus a subfolder that contains all the extra needed files.

                      Comment


                      • #12
                        And, Kerry,
                        Interestingly if you delete the HTML file, the subfolder deletes as well.

                        Comment


                        • #13
                          Thanks Gary

                          Can this process be automated so it can be run from a PB program?

                          Kerry
                          [I]I made a coding error once - but fortunately I fixed it before anyone noticed[/I]
                          Kerry Farmer

                          Comment


                          • #14
                            > if you delete the HTML file, the subfolder deletes as well.

                            Yeah, it threw me off when I noticed that too.
                            "Not my circus, not my monkeys."

                            Comment


                            • #15
                              Kerry,
                              I'm not sure. Jose would perhaps be the one to answer that for us.

                              Comment


                              • #16
                                Kerry,

                                I simply used my own plain text editor and pasted in the web page URL into the file open dialog which it then loads directly into the editor. Whatever is visible to a web browser is what you get as a HTML page and you only get the page, not the supporting files.
                                hutch at movsd dot com
                                The MASM Forum

                                www.masm32.com

                                Comment


                                • #17
                                  Tim and Kerry, you may be able to use "webget", a commandline program that downloads web pages. I use one called wget. Free, downloadable, lots of commandline options. Your PB program would simply SHELL to the webget program with the URL you want.

                                  Hutch, I didn't get what you were saying at first, nice tip about loading URLs into an editor. Works great with UltraEdit; I'll use that!
                                  Last edited by Eric Pearson; 4 Dec 2017, 03:11 AM. Reason: Wrong name.
                                  "Not my circus, not my monkeys."

                                  Comment


                                  • #18
                                    Well, that is pretty cool. I tried it with NotePad - put "http://www.garybeene.com" into the Open File Dialog and got the HTML code, this ...

                                    Code:
                                    <html>
                                    <head><link rel="stylesheet" href="files/gbic.css"></head>
                                    <body>
                                    <table width=800 border=0 cellpadding=0 cellspacing=0>
                                    <tr><td colspan=4> <img onMouseOver="CloseMenu()" src="images/main.gif">
                                    <tr><td><script type="text/javascript" src="files/menu0.js"></script>
                                    
                                    <tr>  <!-- main row -->
                                    <td width=15 nowrap> <!-- left gap -->
                                    
                                    <td valign=top nowrap width=650> <!-- navagation line + center body -->
                                    
                                    <table width="100%" bgcolor="#FFFFFF" border="1" cellpadding="5" cellspacing="0"><tr><td>
                                    
                                    <!-- start main content -->
                                    
                                    <p>
                                    <!-- <img align=right src="images/garylive.jpg"> -->
                                    
                                    <p>
                                    <b>Welcome!</b><br>
                                    This site has over 2,000 pages of content covering an unusual blend of
                                    topics. With a million+ pages delivered from this site each year, including over
                                    250,000 downloads of my software,
                                    
                                    ... etc.

                                    Comment


                                    • #19
                                      The WinAPI function URLDownloadToFile() might be handy.

                                      That supports "save as..."
                                      Michael Mattias
                                      Tal Systems Inc.
                                      Racine WI USA
                                      mmattias@talsystems.com
                                      http://www.talsystems.com

                                      Comment


                                      • #20
                                        Originally posted by Michael Mattias View Post
                                        The WinAPI function URLDownloadToFile() might be handy.

                                        That supports "save as..."
                                        It's a long time since I tried to use that function, but as far as I remember I gave up because I couldn't make it not read the file from the IE cache. I'm sure there's a way, but I couldn't find it. I just used PB's TCP functions in the end.
                                        Dan

                                        Comment

                                        Working...
                                        X