Announcement

Collapse
No announcement yet.

Parse Question

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Parse Question

    I have text something like this, where there can be variable multiple spaces between words.

    one two three
    I'd like the PARSE functions to treat multiple spaces as a single space, but that's not how they seem to work.

    I can do this, where 20 is the max number of spaces that will happen.

    Code:
        For i = 20 To 2 Step -1
            Replace Space$(i) With " " In temp$
        Next i
    That solution seems inelegant, not to mention that I have to guess the max number of spaces. Is there a simpler way anyone uses?
    Last edited by Gary Beene; 9 Sep 2009, 10:45 AM.

  • #2
    Been there, done that:

    http://www.powerbasic.com/support/pb...ad.php?t=21402

    The same ?problem? is with both TALLY and PARSECOUNT.

    The way I solved the double space problem is with (close to yours):

    Replace dots with spaces

    Code:
    do
    p = instr(mainstring$, "..")
    if p = 0 then exit loop
    replace ".." with "." in mainstring$
    loop
    Last edited by Mel Bishop; 9 Sep 2009, 10:52 AM.
    There are no atheists in a fox hole or the morning of a math test.
    If my flag offends you, I'll help you pack.

    Comment


    • #3
      Yeah, I've done the loop thing before too. Here are three different loops that might be useful:

      Code:
      #COMPILE EXE
      #DIM ALL
      
      FUNCTION PBMAIN () AS LONG
      
          LOCAL lPos          AS LONG
          LOCAL sMainString   AS STRING
      
          ' REPLACE method
          sMainString = "one two  three   four"
          WHILE INSTR(sMainString, "  ")
              REPLACE "  " WITH " " IN sMainString
          WEND
          MSGBOX sMainString
      
          ' REGREPL method
          sMainString = "one two  three   four"
          lPos = 1
          DO
              REGREPL "[ ]+" IN sMainString WITH " " AT lPos TO lPos, sMainString
          LOOP UNTIL lPos = 0
          MSGBOX sMainString
      
          ' change delimiter method
          sMainString = "one two  three   four"
          lPos = INSTR(sMainString, " ")
          WHILE lPos
              MID$(sMainString, lPos, 1) = ","
              lPos = INSTR(lPos + 1, sMainString, ANY CHR$(0 TO 31, 33 TO 255))
              lPos = INSTR(lPos, sMainString, " ")
          WEND
          ' now you can use PARSE$ with the comma delimiter (or whatever you choose to use)
          ' just remember to do an LTRIM$ to get rid of any extra spaces
          MSGBOX sMainString
      
      END FUNCTION
      I'm not sure which is the best but I would think the last one might be most efficient as it modifies the existing string in memory instead of making a new string each time things get replaced.
      Jeff Blakeney

      Comment


      • #4
        REGEXPR supports the "\b" (word-break) meta-character.
        Michael Mattias
        Tal Systems (retired)
        Port Washington WI USA
        [email protected]
        http://www.talsystems.com

        Comment


        • #5
          Also, you may want to pick up the results of the First and Last "programming contest" and see how other people solved the "find words in text" problem - some 765,000 or so words as a matter of fact - the text of The Bible.

          Complete package (source, executables, and judges grades and comments) available at my web site:
          Summer 2005 Progamming Contest: Winners' Code and Judges' Scoring

          MCM
          Michael Mattias
          Tal Systems (retired)
          Port Washington WI USA
          [email protected]
          http://www.talsystems.com

          Comment


          • #6
            Actually Gary, your solution is good, your only error is in the number of loops. Remember that each loop builds on the previous one, and a starting value of 7 will result in removal of up to 27 spaces. Your "20" start value removes something like up to 120 spaces. That's way more loops than you need, and hence is much slower.

            Added: I just realized my arithmetic is way off. A starting value of 7 will remove WAY more than 27 spaces--hundreds I think, and your 20 might be in the THOUSANDS.

            Added: For example, in the code below, a starting value of only 4 removes even the 37-space part of the string.
            Code:
            #COMPILE EXE
            #DIM ALL
            
            FUNCTION PBMAIN () AS LONG
                LOCAL lineo AS STRING, ii AS LONG
                
                lineo = REPEAT$(35000, "            3    4444     3 1                11  1111                                     45 44")
                'this replaces [COLOR="Red"]way more than[/COLOR] 7+6+5+4+3+2 = 27 spaces, with " " (single space) and you can
                'add or subtract from the start iteration count to handle exponentially larger spaces
                FOR ii = [COLOR="Red"]4[/COLOR] TO 2 STEP -1
                   REPLACE SPACE$(ii) WITH " " IN lineo
                NEXT
                
                ? LEFT$(lineo, 2000)
            END FUNCTION
            Last edited by John Gleason; 9 Sep 2009, 01:18 PM.

            Comment


            • #7
              Wow, I tested a couple big-space strings, and using a start value of 7 in the loop, the maximum space size that will be removed is 1426. Using a start value of 20... get this... I gave up testing at a max space size of 40,000,000!!
              Last edited by John Gleason; 9 Sep 2009, 03:52 PM. Reason: calculated exact value at 7

              Comment


              • #8
                Wouldn't it be more efficient to use powers of two, successively replacing 2^n thru 2^1 spaces with a single space?

                Comment


                • #9
                  Code:
                         REPLACE SPACE$(32) WITH " " IN lineo
                         REPLACE SPACE$(16) WITH " " IN lineo
                         REPLACE SPACE$(8) WITH " " IN lineo
                         REPLACE SPACE$(4) WITH " " IN lineo
                         REPLACE SPACE$(2) WITH " " IN lineo
                  Chris, the above almost works, it just needs one addition, a 3:
                  Code:
                                                                'max spaces replaced
                         REPLACE SPACE$(32) WITH " " IN lineo   '11806
                         REPLACE SPACE$(16) WITH " " IN lineo   '398
                         REPLACE SPACE$(8) WITH " " IN lineo    '38
                         REPLACE SPACE$(4) WITH " " IN lineo    '10
                         REPLACE SPACE$(3) WITH " " IN lineo    '4
                         REPLACE SPACE$(2) WITH " " IN lineo    '2
                  and it's good up to ~11,800 spaces using just 6 REPLACE statements.

                  Comment


                  • #10
                    John,

                    Thanks for digging that out for us. One line added to a loop does the trick.

                    Code:
                    For i = 5 to 1
                          2^i ...
                          if i = 2 then space$(3) ...
                    next i
                    If I ever put more than 10K spaces in a string, someone shoot me!

                    Comment


                    • #11
                      I'd like the PARSE functions to treat multiple spaces as a single space, but that's not how they seem to work
                      Actually, you don't want that.

                      What if you are reading a CSV file, line by line?
                      Code:
                      A,B,,,E,F,G
                      Did you really want all three commas between B and E treated as one comma?

                      I didn't think so.

                      MCM
                      Michael Mattias
                      Tal Systems (retired)
                      Port Washington WI USA
                      [email protected]
                      http://www.talsystems.com

                      Comment


                      • #12
                        Not to sound like a broken record, but ...REGEXPR and REGREPL demo January 16, 2002

                        ... includes this deceptively-named subroutine:
                        Code:
                        [B]FindEachWordInString:[/B]  
                         TheMain = " Now is the time for    all good men" & $CRLF & "to come to the aid of his party."
                           LET mask = "[a-z]+\b"        ' \b = word boundary. CRLF is (undocumented) word boundary
                           GOSUB RegExprLoop
                             RETURN
                        !!!

                        MCM
                        Michael Mattias
                        Tal Systems (retired)
                        Port Washington WI USA
                        [email protected]
                        http://www.talsystems.com

                        Comment


                        • #13
                          So ask for a new feature: reduce all multiple consecutive occurences of a character to but one such occurrence. You know what you'd have? The Mercator "SQUEEZE" function, that's what...
                          Code:
                             Output = SQUEEZE (text_variable, " ")
                          MCM
                          Michael Mattias
                          Tal Systems (retired)
                          Port Washington WI USA
                          [email protected]
                          http://www.talsystems.com

                          Comment


                          • #14
                            Code:
                            FUNCTION Squeeze (SourceStringVar AS STRING, onechar AS STRING) AS STRING 
                            
                            LOCAL pSrc AS BYTE PTR, bLast AS BYTE
                            LOCAL S AS STRING, pDest AS BYTE PTR, L AS LONG, nChar AS LONG 
                            LOCAL I AS LONG, bTarget AS BYTE
                            
                             pSrc    = STRPTR (SourceStringVar)
                             L       =  LEN (SourceStringVar) 
                             bTarget = ASC (oneChar)
                            
                             S    =  STRING$(L, $NUL)   ' worst case, nothing to SQUEEZE 
                             pDest = STRPTR(S) 
                             bLast = 0?
                            
                             FOR I = 1 TO L 
                                IF (@pSrc XOR bTarget) OR (bLast XOR bTarget) then 
                                    ' not target char, or first occurrence of target char 
                                      INCR     nChar
                                      @pDest = @pSrc 
                                      INCR     pDest   ' where next one goes 
                                      bLast  = @pSrc
                                END IF               
                                bLast       = @pSrc 
                                INCR pSrc               ' next input char 
                            
                             NEXT
                             S = LEFT$(S, nChar) 
                             FUNCTION = S 
                            END FUNCTION
                            Seven minutes to write (and fix). Use it for a lifetime.

                            Fails if first character of source or target char = $NUL. Deal with it.


                            MCM
                            Last edited by Michael Mattias; 9 Sep 2009, 05:12 PM. Reason: Had made error and needed to correct.
                            Michael Mattias
                            Tal Systems (retired)
                            Port Washington WI USA
                            [email protected]
                            http://www.talsystems.com

                            Comment


                            • #15
                              Jeff,

                              Thanks for the examples.

                              In your example using RegRepl, I don't believe the Do/Loop is required. In my test, this single line will reduce all lengths of spaces to a single space.

                              Code:
                              RegRepl "[ ]+" In temp$ With " " At 1 To iEnd, tmp$
                              Can you clarify why the loop is needed?

                              Comment


                              • #16
                                So here are my three take-aways. Thanks for the comments, everyone.

                                I did a simple timer and with a single 10K space reduction, got essentially zero on every test.

                                When I ran the compression 500 times, the results were 0.06, 0.69, 0.03 for the three examples below.

                                Code:
                                'Example #1
                                'Credit: John Gleason / Chris Holbrook
                                For i = 5 to 1 Step -1               'good for 11806 spaces
                                   Replace Space$(i^2) With " " In temp$
                                   If i = 2 then Replace Space$(3) With " " In temp$
                                Next i
                                Dim D(ParseCount(temp$, " ") As String
                                
                                'Example #2
                                'Credit: Jeff Blakeney
                                While Instr(temp$, "  ")
                                   Replace "  " With " " In temp$
                                Wend
                                
                                'Example#3
                                'Credit: Jeff Blakeney
                                iEnd = Len(temp$)
                                RegRepl "[ ]+" In temp$ With " " At 1 To iEnd, tmp$

                                Comment


                                • #17
                                  Gary, you don't have to put the condition inside the loop in example #1.
                                  Also it *may* be more efficient to use a shift operator than to use the exponentiation operator.

                                  Comment


                                  • #18
                                    Gary,
                                    Why not make things simple? (forget spaces that you have to count, use Space$ instead so years later you are not manually counting???)

                                    Code:
                                    #COMPILE EXE
                                    #DIM ALL
                                    
                                    FUNCTION PBMAIN () AS LONG
                                         LOCAL SpaceCount AS LONG
                                         LOCAL i AS LONG
                                         LOCAL sMainString   AS STRING
                                         LOCAL TempString AS STRING
                                        ' REPLACE method
                                        sMainString = "one two  three   four" + $TAB + "Five" + $TAB + $TAB + "Six" + $TAB + $TAB + $TAB + " Seven "
                                        FOR i = 1 TO LEN(sMainString)
                                             SELECT CASE MID$(sMainString, i, 1)
                                                  CASE SPACE$(1)
                                                       SpaceCount = SpaceCount + 1
                                                  CASE ELSE
                                             END SELECT
                                         NEXT i
                                    '*** Sort Reverse Order (Most Spaces to Fewest spaces)
                                         FOR i = SpaceCount TO 1 STEP -1                   'Leave room for 1 space per division
                                              SELECT CASE SpaceCount
                                                   CASE 1
                                                   CASE ELSE
                                                        REPLACE SPACE$(i) WITH SPACE$(1) IN sMainString
                                              END SELECT
                                         NEXT i
                                    '*** Now Parse the string or msgbox or whatever
                                         MSGBOX sMainString
                                    END FUNCTION
                                    Engineer's Motto: If it aint broke take it apart and fix it

                                    "If at 1st you don't succeed... call it version 1.0"

                                    "Half of Programming is coding"....."The other 90% is DEBUGGING"

                                    "Document my code????" .... "WHYYY??? do you think they call it CODE? "

                                    Comment


                                    • #19
                                      MCM,
                                      True, some situation calls for the current PARSE action. I'd want to pick and choose when to apply the squeeze.

                                      I don't find a net reference to Mercator squeeze.

                                      Good point - the \b regex option can be especially valuable when multiple, different characters are present - such as this.

                                      one, two, three
                                      Limiting it to spaces, as in the takeaway I listed, isn't as powerful. RegEx has potential I seldom take advantage of in BASIC, although I fairly often in Perl.

                                      And finally, dang, your SQUEEZE function is hardly viewer friendly. Does it do something one of the other example doesn't? Faster, using pointers to work on the original string?

                                      Chris,
                                      No? In a 3 space example, then a 4-action does nothing, a 2-action leaves 2 behind, and a followup 3-action does nothing - result is a 2-space string. Doesn't this mean the 3-action has to be in sequence?

                                      Cliff,
                                      Yes, I'd rather not have the qty of spaces be a factor. So examples #2 and #3 would be safer, although #1 works for an very unlikely large number of delimiters.

                                      Thanks again, guys.

                                      Comment


                                      • #20
                                        Originally posted by Gary Beene View Post
                                        Chris,
                                        No? In a 3 space example, then a 4-action does nothing, a 2-action leaves 2 behind, and a followup 3-action does nothing - result is a 2-space string. Doesn't this mean the 3-action has to be in sequence?
                                        It will still be in sequence if executed after the loop.

                                        The point about 2^n vs shift left 2 n is not quite so important in your example as there are few iterations, but ISTR that the shift method is a lot (several times) faster.

                                        Comment

                                        Working...
                                        X