Announcement

Collapse
No announcement yet.

Where does the forum database live? Who is hosting the forum?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    I am sure that the original request was well intended but a forum database of this type belongs to the account that has the forum and contains a massive amount of information including personal information about the members so allowing anyone to get a copy of the forum database is an absolute NO NO. I guess what a number of people want is a copy of the postings and attachments but the is a different matter to getting the database.
    hutch at movsd dot com
    The MASM Forum

    www.masm32.com

    Comment


    • #22
      Who knew! Eric, any starting point for a JSON try?

      Comment


      • #23
        You are right Steve, I just want to get to the posts and attachments.

        Comment


        • #24
          David,
          In that case, gbThreads would be the place to go. The downloadable version is not currently up to date on the most recent threads. I'll work to have that done in the next week.

          Comment


          • #25
            Originally posted by David Clarke View Post
            Who knew! Eric, any starting point for a JSON try?
            Just Google. Lots of people use JSON. Example: This is the url for this forum's Actiivity Stream https://forum.powerbasic.com/search?searchJSON={%22date%22:{%22from%22:%222%22},%22sort%22:{%22created%22:%22desc%22},%22view%22:%22%22,%22exclude%22:%2245%22,%22exclude_type%22:[%22vBForum_PrivateMessage%22]}

            Replace all %22 with double quotes to make it more readable.
            "Not my circus, not my monkeys."

            Comment


            • #26
              Thanks Eric!

              Gary, I am I correct in my understanding that you have downloaded all the threads by screen scraping? Your data would be great to try.

              Comment


              • #27
                USED wget GNU Wget 1.19.4, to retrieve pb webpages in batch program

                tread numbers do not work but played anyway.
                each page took 2.6 seconds in this loop
                I created a program to compare file sizes and return 0 if they are equal

                pb compare file size program
                Code:
                'compiled with pbcc 6.04
                'filesizeequal.bas
                'determine if 2 files sizes are equal
                
                #COMPILE EXE
                #DIM ALL
                #BREAK ON
                
                FUNCTION PBMAIN () AS LONG
                    LOCAL sfilename1 AS STRING
                    LOCAL sfilename2 AS STRING
                    LOCAL ifilesize1 AS LONG
                    LOCAL ifilesize2 AS LONG
                
                    FUNCTION=1
                
                    sfilename1=COMMAND$(1)
                    sfilename2=COMMAND$(2)
                    'test for input of command line
                
                    IF LEN(sfilename1)=0 OR LEN(sfilename2)=0 THEN
                        FUNCTION=20
                        EXIT FUNCTION
                    END IF
                
                
                    IF NOT ISFILE(sfileName1) THEN
                        STDOUT sfilename1+ " file does not exist "
                        FUNCTION=11
                        EXIT FUNCTION
                    END IF
                    IF NOT ISFILE(sfileName2) THEN
                        STDOUT sfilename2+ " file does not exist "
                        FUNCTION=12
                        EXIT FUNCTION
                    END IF
                
                    TRY
                        OPEN sfilename1 FOR BINARY ACCESS READ WRITE LOCK SHARED AS #1 LEN=32767
                    CATCH
                        CLOSE #1
                        STDOUT "problem reading file  "+sfilename1
                        FUNCTION=16
                        EXIT FUNCTION
                        EXIT TRY
                    FINALLY
                       ifilesize1 = LOF(1)
                       CLOSE #1
                    END TRY
                
                    TRY
                        OPEN sfilename2 FOR BINARY ACCESS READ WRITE LOCK SHARED AS #2 LEN=32767
                    CATCH
                        CLOSE #2
                        STDOUT "problem reading file  "+sfilename2
                        FUNCTION=17
                        EXIT FUNCTION
                        EXIT TRY
                    FINALLY
                       ifilesize2 = LOF(2)
                       CLOSE #2
                    END TRY
                
                    IF ifilesize1=ifilesize2 THEN FUNCTION=0
                END FUNCTION

                batch file to retrieve webpages and compare
                you can start off with 785539 to get this thread
                var1 is thread number
                var2 is page number in thread
                https://forum.powerbasic.com/forum/u...R1%/page%VAR2%

                batch file retrieve 1 more webpage than need and when it compares to the size of the last, it will delete the last webpages retrieved and go to the next higher thread number
                but webpages using the link as above will not work
                but even though it was wrong in the correct thread number lookup, it took 18 minutes to pull 500 threads with their pages, even the extra pages.
                maybe somebody has a way to get the proper webpages

                Code:
                @ECHO OFF
                rem    DEL *.HTM
                SET VAR1=0
                SET VAR2=0
                SET VAR3=0
                SET VAR4=0
                
                SET /A VAR1=0
                SET /A VAR2=0
                SET /A VAR3=0
                SET /A VAR4=0
                
                SET /A VAR1=785530
                
                :LOOP1
                SET /A VAR1=%VAR1%+1
                SET /A VAR2=0
                SET /A VAR3=0
                SET /A VAR4=0
                ECHO. thread number %VAR1%
                IF %VAR1% GTR 785540 GOTO END
                
                :LOOP2
                IF %VAR2% GTR 300 GOTO END
                SET /A VAR3=%VAR2
                SET /A VAR2=%VAR2%+1
                SET /A VAR4=0
                
                :GETPAGEAGAIN
                SET /A VAR4=%VAR4%+1
                IF %VAR4% GTR 10 GOTO LOOP1
                IF EXIST %VAR1%-%VAR2%.htm DEL %VAR1%-%VAR2%.htm
                wget   -q -O %VAR1%-%VAR2%.htm --tries=10 "https://forum.powerbasic.com/forum/user-to-user-discussions/programming/%VAR1%/page%VAR2%"
                IF not EXIST %VAR1%-%VAR2%.htm GOTO GETPAGEAGAIN
                IF %VAR2% LSS 2 GOTO LOOP2
                
                IF NOT  EXIST %VAR1%-%VAR3%.htm GOTO LOOP1
                C:\PBCC60\FILESIZEEQUAL %VAR1%-%VAR3%.htm %VAR1%-%VAR2%.htm
                IF %ERRORLEVEL% == 11  GOTO LOOP1
                IF %ERRORLEVEL% == 1   GOTO LOOP2
                IF %ERRORLEVEL% == 0   GOTO DELETELASTWEBPAGEFILE
                GOTO LOOP2
                
                :DELETELASTWEBPAGEFILE
                DEL %VAR1%-%VAR2%.htm
                GOTO LOOP1
                
                :END

                htmlastext configuration file for porgram htmlastext from nisoft
                https://www.nirsoft.net/utils/htmlastext.html
                run the batch file in directory c:\pbwebpages\retrieve
                copy the htm files to c:\pbwebpages\retrieve\temp
                use command line c:\folder\htmastext /run c:\pbwebpages\retrieve\htmlastext.cfg

                name file htmlastext.cfg in c:\pbwebpages\retrieve folder
                the configuration file can be build(saved) from inside the htmlastext program
                Code:
                [Config]
                OpenInNotepad=0
                CharsPerLine=75
                Source=C:\pbwebpagest\retreive\temp\*.htm
                Dest=c:\pbwebpages\retrieve\temp\*.txt
                SkipTitleText=0
                AddLineUnderHeader=0
                SkipTableHeaderText=0
                TableCellDelimit=1
                HeadingLineChars=======
                HorRuleChar==
                ListChars=*[email protected]#
                ConvertMode=2
                AllowCenterText=0
                AllowRightText=0
                DLSpc=8
                LinksDisplayFormat=%T
                EncloseBoldCharsStart=<<
                EncloseBoldCharsEnd=>>
                EncloseBold=0
                SubFolders=0

                after the conversion of html(htm in this case) to txt, many txt files should be the same looking at them.
                then just delete txt files with the highest number.

                htm files will be number like pagenumber-pagenumber.htm
                eg this thread 785530-1.htm 785530-2.htm etc

                the text from the webpages seem easy enough to read in order to scrub out the wanted info.
                your not going to get the attachments as far as i know but i have not tested that.





                p purvis

                Comment


                • #28
                  Hi David!

                  ...threads by screen scraping
                  No, I get the threads by using the base thread URL with the URLDownloadToFile API Then, I parse each to see how many extra pages of posts there are and download those secondary pages as well. Then I combine all of the pages of a thread into a single file.

                  An example thread URL ...

                  I'll post some example code this weekend. I'm under a big time crunch at the moment - haven't touched gbThreads in weeks now!

                  Comment


                  • #29
                    Thanks Gary - No rush at all. Still just pondering. I am still thinking getting a copy of the database backup file might be a good idea.

                    Comment


                    • #30
                      I have some code to post on retrieving powerbasic forum webpages and finding if there are duplicate links to threads as there many that are generated in the vbulletin software.
                      I would hate to think that something would happen to to where all the work put into these forums is lost from damage by a hacker or loss of support by many unselfish people.
                      You cannot look look threads in version 4 of vbulletin using showthread.php?= eg:https://forum.powerbasic.com/forum/s...d.php?t=######
                      I will post the most the code in zip attachments so it will should not be viewable to non members of the forum.
                      p purvis

                      Comment


                      • #31
                        phase 1 of 2
                        get the threads into html webpage file format using a batch program and 2 programs, wget and one i wrote

                        batch process
                        even though this starts at thread 1 and ends at thread 9999999, it should be broken down into several batches to run concurrently
                        right now 800000 to 999999 are of no use and as of testing now, the same might apply to 3000000
                        this is because threads in those ranges have duplicate threads they point to threads with lower numbers.
                        that was likely was due to webserver changes

                        Code:
                        @ECHO OFF
                        rem VAR3 is not used in this batch file
                        
                        SET VAR1=0
                        SET VAR2=0
                        REM ....................... SET VAR3=0
                        SET VAR4=0
                        
                        REM ....................... SET /A VAR1=0
                        SET /A VAR1=0
                        
                        SET /A VAR2=0
                        REM ....................... SET /A VAR3=0
                        SET /A VAR4=0
                        
                        
                        
                        :LOOP1
                        SET /A VAR1=%VAR1%+1
                        SET /A VAR2=0
                        REM ....................... SET /A VAR3=0
                        SET /A VAR4=0
                        REM ....................... ECHO. thread number %VAR1%
                        IF %VAR1% GTR 999999 GOTO END
                        
                        :LOOP2
                        IF %VAR2% GTR 300 GOTO END
                        REM ....................... SET /A VAR3=%VAR2
                        SET /A VAR2=%VAR2%+1
                        SET /A VAR4=0
                        
                        :GETPAGEAGAIN
                        SET /A VAR4=%VAR4%+1
                        IF %VAR4% GTR 10 GOTO LOOP1
                        IF EXIST %VAR1%-%VAR2%.htm DEL %VAR1%-%VAR2%.htm
                        REM ....................... wget   -q -O %VAR1%-%VAR2%.htm --tries=10 "http://powerbasic.com/support/pbforums/showthread.php?t=%VAR1%/page%VAR2%"
                        wget   -q -O %VAR1%-%VAR2%.htm --tries=10 "https://forum.powerbasic.com/forum/user-to-user-discussions/%VAR1%/page%VAR2%"
                        IF NOT EXIST %VAR1%-%VAR2%.htm GOTO GETPAGEAGAIN
                        REM ....................... IF NOT EXIST %VAR1%-%VAR2%.htm GOTO LOOP1
                        C:\PBCC60\pbwebpagecountend %VAR1%-%VAR2%.htm
                        IF %ERRORLEVEL% == 2  GOTO REMOVEHTMLFILE
                        IF %ERRORLEVEL% == 1  GOTO LOOP2
                        IF %ERRORLEVEL% == 0  GOTO LOOP1
                        GOTO LOOP2
                        
                        :REMOVEHTMLFILE
                        IF EXIST %VAR1%-%VAR2%.htm DEL %VAR1%-%VAR2%.htm
                        GOTO LOOP1
                        
                        :END
                        code to pbwebpagecountend is in an attachment, its function to stop getting webpages on a thread when the last page on the thread was retrieved.
                        Attached Files
                        p purvis

                        Comment


                        • #32
                          phase 2 of 2
                          process all htm files received and rename html files with a different extension added if a duplicate link is pointed to with
                          either the canonical link html tag or the meta property html tag inside the html file.
                          if a html file does appear worth of keeping then an extension of .bad is added to the end of html file name.
                          if there seems to be no duplicate link using my functions and routines to check the html file, then the file name is left unchanged.

                          there are 2 routines to check for duplicate links to choose from but a single routine is decided up compile time.
                          I guess you do both but one or the other seems to return the same results at this time from a vbullettin ver 5 webpage.

                          an example of a thread having a duplicate link to another numbered thread which is its canonical link in of a webpage of 89901

                          https://forum.powerbasic.com/forum/u...cussions/89901
                          will return thread number 3632 and can be seen by inspecting the webpage tags and can be seen in the url of your web browser.

                          so for the most part, there are a whole lot of multiple thread numbers that point to the a different same thread
                          vbulletin has taken a big hit of loss broken links or i should say confused links when version 5 came out from my research where showthread.php is no longer supported.

                          well that will get you started in the right direction i hope it does not come to that soon.
                          scrubbing html files will have to be up to somebody else
                          I really like powerbasic and i have recommended it others, but there are some hard heads out there, plenty of them.
                          Attached Files
                          p purvis

                          Comment


                          • #33
                            Originally posted by David Clarke View Post
                            Thanks Eric!

                            Gary, I am I correct in my understanding that you have downloaded all the threads by screen scraping? Your data would be great to try.
                            http://www.httrack.com/

                            Comment


                            • #34
                              Hey Knuth. Gary uses the file download API
                              then enummerstes through the page numbers.
                              He likley does not get an web info that is required by login.
                              From testing which is in progress now, I might have some insight on how to reduce getting duplicate webpages.
                              It is based on acquiring the canonical tag of the first webpage then sorting and removing any thread numbers that point to a different thread
                              p purvis

                              Comment

                              Working...
                              X