Announcement

Collapse
No announcement yet.

Regular Expression Problem

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regular Expression Problem

    Hi Folks,

    I am having a little difficulty implementing a PB/Win 9.01 regular expression. It is based on a UK Government published regular expression which is designed to validate UK postcodes. My PB version of the regex looks as though it should work to me but it just isn't matching as I expect.

    I have prepared an example program below including notes etc. and would be grateful if anyone would take a look at it to see if I am doing something wrong.

    Any comments gratefully received.

    Thanks,

    David

    Code:
    ' Module Name : postcode_regex_test.bas
    ' Platform    : PB/Win 9.01
    '               Windows XP Professional SP3
    '
    ' Purpose     : To implement a PB version of the the UK BS7666
    '               postcode validation regular expression provided
    '               at...
    '
    '               http://www.govtalk.gov.uk/gdsc/schemas/bs7666-v2-0.xsd
    '
    '               The 'complex pattern' for UK postcodes
    '               (to be found under the PostCodeType node of the schema)
    '               is as follows...
    '
    ' "(GIR 0AA)|((([A-Z-[QVX]][0-9][0-9]?)|(([A-Z-[QVX]][A-Z-[IJZ]][0-9][0-9]?)|(([A-Z-[QVX]][0-9][A-HJKSTUW])|([A-Z-[QVX]][A-Z-[IJZ]][0-9][ABEHMNPRVWXY])))) [0-9][A-Z-[CIKMOV]]{2})"
    '
    '               This pattern implements the rules given at...
    '               http://www.govtalk.gov.uk/gdsc/html/frames/PostCode.htm
    '
    ' Issue       : My PB version doesn't work, am I doing something wrong?
    
    #COMPILE EXE
    #DIM ALL
    
    FUNCTION PBMAIN () AS LONG
    
        DIM a AS STRING
        DIM b AS STRING
        DIM c AS STRING
        DIM position AS LONG
        DIM length AS LONG
    
        'Un-comment the following postcode lines one at a time
        'and re-compile to test each one.
    
        'Example UK Postcode  Format
        a$ = "GIR 0AA"       '1) GIR 0AA  - matches
        'a$ = "M1 1AA"        '2) AN NAA   - no match
        'a$ = "M60 1NW"       '3) ANN NAA  - no match
        'a$ = "CR2 6XH"       '4) AAN NAA  - no match
        'a$ = "DN55 1PT"      '5) AANN NAA - no match
        'a$ = "W1A 1HQ"       '6) ANA NAA  - no match
        'a$ = "EC1A 1BB"      '7) AANA NAA - no match
    
        'test 1 - A PB compatible version of the BS7666
        'postcode validation regular expression.
        'It produces no matches on the example postcodes
        'numbered 2 to 7, but I think perhaps it should.
        '
        '*NOTE: The individual sub-patterns match data ok
        'when tested independently of the full expression.
    
        b$ =      "(GIR 0AA)|((([A-PR-UWYZ][0-9][0-9]?)|"
        b$ = b$ & "(([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)|"
        b$ = b$ & "(([A-PR-UWYZ][0-9][A-HJKSTUW])|"
        b$ = b$ & "([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY]))))"
        b$ = b$ & " [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])"
    	 
        'When the regex above is tabbed out for clarity as shown
        'below, it all seems logical. However only the (GIR 0AA) pattern
        'produces a match against the example postcodes listed above.
        '
        '(
        'GIR 0AA
        ')
        '|
        '(	
        '	(			
        '		([A-PR-UWYZ][0-9][0-9]?)
        '		|
        '		(
        '			([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)
        '			|
        '			(	
        '				([A-PR-UWYZ][0-9][A-HJKSTUW])
        '				|
        '				([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY])
        '			)
        '		)	
        '	)
        ' [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]
        ')
    	
        RESET position& : RESET length&
        DO
          position& = position& + length&
          REGEXPR b$ IN a$ AT position& TO _
                position&, length&
            IF position& <> 0 THEN
                c$ = "Match at position : " & FORMAT$(position&,"00") _
                & "    Data : " & CHR$(34) & MID$(a$,position&,length)  & CHR$(34)
                MSGBOX c$
            END IF
        LOOP WHILE position&
    END FUNCTION
    Last edited by David Warner; 10 Apr 2009, 05:31 PM.

  • #2
    Dunno about Regular Expressions but the reason the first a$ ("GIR 0AA") matched is that is actually in the searched b$. Dunno why the others didn't. Maybe a "Regular" guy can tell. (I suspect a coding error.)

    =========================================
    'Make people think they are thinking,
    and they will love you for it.
    Make them really think,
    and they will hate you for it.
    attributed both to Plato and Aristotle."
    =========================================
    It's a pretty day. I hope you enjoy it.

    Gösta

    JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
    LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

    Comment


    • #3
      Hi Gosta,

      thanks for your reply.

      the reason the first a$ ("GIR 0AA") matched is that is actually in the searched b$
      From my example, you just need to...
      Code:
          'Un-comment the following postcode lines one at a time
          'and re-compile to test each one.
      
          'Example UK Postcode  Format
          a$ = "GIR 0AA"       '1) GIR 0AA  - matches
          'a$ = "M1 1AA"        '2) AN NAA   - no match
          'a$ = "M60 1NW"       '3) ANN NAA  - no match
          'a$ = "CR2 6XH"       '4) AAN NAA  - no match
          'a$ = "DN55 1PT"      '5) AANN NAA - no match
          'a$ = "W1A 1HQ"       '6) ANA NAA  - no match
          'a$ = "EC1A 1BB"      '7) AANA NAA - no match
      The first one matches, the rest do not although I think they should.

      (I suspect a coding error.)
      Always a possibility!

      Regards,

      David

      Comment


      • #4
        Works if you change the line to:-
        b$ = a$+"|((([A-PR-UWYZ][0-9][0-9]?)|"

        As said above, you have hard-coded the postcode into b$

        Iain Johnstone
        “None but those who have experienced them can conceive of the enticements of science” - Mary Shelley

        Comment


        • #5
          Hi Iain,

          Thanks for your reply.

          As said above, you have hard-coded the postcode into b$
          The (GIR 1AA) postcode is correctly embedded into the regular expression in b$ because it is an exception to the other rules. It is a non-geographic UK postcode that relates to the UK Girobank and therefore has its own pattern in the regular expression.

          If you take a look at the original goverment supplied regular expression, you will note its presence there.

          There is a reference to this at http://en.wikipedia.org/wiki/UK_postcodes

          There are at least two exceptions (other than the overseas territories) to this format:

          the postcode for the formerly Post Office-owned Girobank is GIR 0AA.
          the postcode for correctly addressed letters to Father Christmas is SAN TA1
          So thanks for your suggested change for

          b$ = a$+"|((([A-PR-UWYZ][0-9][0-9]?)|"

          but this incorrectly mixes the regular expression with the test data.

          My problem still stands I'm afraid.

          Regards,

          David
          Last edited by David Warner; 11 Apr 2009, 06:38 AM.

          Comment


          • #6
            Here's a workaround, because the 1st and 2nd masks work, just not together for some reason I couldn't figure out.
            Code:
            #COMPILE EXE
            #DIM ALL
            
            FUNCTION PBMAIN () AS LONG
            
                LOCAL a, a2 AS STRING
                LOCAL b, b2 AS STRING
                LOCAL c AS STRING
                LOCAL position, position2 AS LONG
                LOCAL length, length2 AS LONG
            
                'Un-comment the following postcode lines one at a time
                'and re-compile to test each one.
            
                'Example UK Postcode  Format
            '    a$ = "GIR 0AA"       '1) GIR 0AA  - matches
                a$ = "M1 " :a2 = "1AA"'2) AN NAA   - no match
                a$ = "M60 ":a2 = "1NW"'3) ANN NAA  - no match
                'a$ = "CR2 6XH"       '4) AAN NAA  - no match
                'a$ = "DN55 1PT"      '5) AANN NAA - no match
                'a$ = "W1A 1HQ"       '6) ANA NAA  - no match
                'a$ = "EC1A 1BB"      '7) AANA NAA - no match
            
                'test 1 - A PB compatible version of the BS7666
                'postcode validation regular expression.
                'It produces no matches on the example postcodes
                'numbered 2 to 7, but I think perhaps it should.
                '
                '*NOTE: The individual sub-patterns match data ok
                'when tested independently of the full expression.
            
                b$ =      "(GIR 0AA)|(([A-PR-UWYZ][0-9][0-9]?)|"
                b$ = b$ & "(([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)|"
                b$ = b$ & "(([A-PR-UWYZ][0-9][A-HJKSTUW])|"
                b$ = b$ & "([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY])))) "
            
                b2 = "[0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"
            
                'When the regex above is tabbed out for clarity as shown
                'below, it all seems logical. However only the (GIR 0AA) pattern
                'produces a match against the example postcodes listed above.
                '
                '(
                'GIR 0AA
                ')
                '|
                '(
                '    (
                '       ([A-PR-UWYZ][0-9][0-9]?)
                '       |
                '       (
                '          ([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)
                '          |
                '          (
                '             ([A-PR-UWYZ][0-9][A-HJKSTUW])
                '             |
                '             ([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY])
                '          )
                '       )
                '    )
                ' [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]
                ')
            
                RESET position& : RESET length&
                RESET position2& : RESET length2&
                DO
                  position& = position& + length&
            '      !int 3
                  REGEXPR b$ IN a$ AT position& TO _
                        position&, length&
                    IF position& <> 0 THEN
                        REGEXPR b2 IN a2 AT position2& TO position2&, length2&
                        IF position2 <> 0 THEN
                           c$ = "Match at position : " & FORMAT$(position&,"00") _
                           & "    Data : " & CHR$(34) & MID$(a$,position&,length) & MID$(a2$,position&,length) & CHR$(34)
                           MSGBOX c$
                        END IF
                    END IF
                LOOP WHILE position&
            END FUNCTION

            Comment


            • #7
              By combining the two space-separated parts and ordering by increasing complexity (so simple patterns aren't found in more complex ones), and by adding a letter-letter-number option, this now seems to work. I seldom use reg expr's and now I kind of remember why.
              Code:
              #COMPILE EXE
              #DIM ALL
              
              FUNCTION PBMAIN () AS LONG
              
                  LOCAL a AS STRING
                  LOCAL b, b1, b1a, b2, b3, b4, b5 AS STRING
                  LOCAL c AS STRING
                  LOCAL position, position2 AS LONG
                  LOCAL length, length2 AS LONG
              
                  'Un-comment the following postcode lines one at a time
                  'and re-compile to test each one.
              
                  'Example UK Postcode  Format
                   a$ = "GIR 0AA"       '1) GIR 0AA  - matches
              '    a$ = "M1 1AA"        '2) AN NAA   - all ok now
              '    a$ = "M60 1NW"       '3) ANN NAA
              '    a$ = "CR2 6XH"       '4) AAN NAA << I added logic here which needs accuracy checking
              '    a$ = "DN55 1PT"      '5) AANN NAA
              '    a$ = "W1A 1HQ"       '6) ANA NAA
              '    a$ = "EC1A 1BB"      '7) AANA NAA
              
                  'test 1 - A PB compatible version of the BS7666
                  'postcode validation regular expression.
                  'It produces all matches on the example postcodes
                  b5 = "GIR 0AA"
                  b3 = "[A-PR-UWYZ][0-9][0-9] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"
                  b4 = "[A-PR-UWYZ][0-9] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"
                  b1 = "[A-PR-UWYZ][A-HK-Y][0-9][0-9] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"
                  b1a= "[A-PR-UWYZ][A-HK-Y][0-9] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]" '<< check! may not be correct mask letters
                  b2 = "[A-PR-UWYZ][0-9][A-HJKSTUW] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"
                  b  = "[A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"
              
                  RESET position& : RESET length&
                  RESET position2& : RESET length2&
                  DO
                    position& = position& + length&
                    position2 = position
                               
                    REGEXPR b$ IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound ELSE position = position2
                    REGEXPR b1 IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound ELSE position = position2
                   REGEXPR b1a IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound ELSE position = position2
                    REGEXPR b2 IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound ELSE position = position2
                    REGEXPR b3 IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound ELSE position = position2
                    REGEXPR b4 IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound ELSE position = position2
                    REGEXPR b5 IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound
                    EXIT DO
                       matchFound:
                             c$ = "Match at position : " & FORMAT$(position&,"00") _
                             & "    Data : " & CHR$(34) & MID$(a$,position&,length) & CHR$(34)
                             MSGBOX c$
                  LOOP WHILE position&
              END FUNCTION
              Last edited by John Gleason; 11 Apr 2009, 12:22 PM. Reason: fixed comment from "no" to "all matches"

              Comment


              • #8
                Originally posted by John Gleason View Post
                I seldom use reg expr's and now I kind of remember why.
                RegExpr's can be very nice and very powerful. However, when they reach a certain level of complexity, they become quite a project to maintain. Just think about 3 years from now when some unanticipated case arises?

                I'd vote for byte-by-byte syntax directed parsing here...

                Best regards,

                Bob Zale
                PowerBASIC Inc.

                Comment


                • #9
                  David,

                  I don't understand what is trying to be achieved here (other than Regex experience). If a postal code lookup, wouldn't it be FAR simpler and effectively as fast (unless tenth's or hundred's of a second count) to just look up a sorted list of postal codes?

                  ===================================================
                  "There is no sincerer love than the love of food."
                  George Bernard Shaw (1856-1950)
                  ===================================================
                  Last edited by Gösta H. Lovgren-2; 11 Apr 2009, 02:52 PM.
                  It's a pretty day. I hope you enjoy it.

                  Gösta

                  JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                  LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                  Comment


                  • #10
                    @John

                    Many thanks for posting your workarounds John, I appreciate you looking into this.

                    It is in fact perfectly possible to re-structure the patterns into a single working PB regular expression as follows...

                    Code:
                    'New BS7666 regular expression
                    'completely re-structured patterns for PowerBASIC
                    'One full pattern per postcode format
                    '
                    b$ =      "(GIR 0AA)|"
                    b$ = b$ & "([A-PR-UWYZ][0-9][A-HJKSTUW] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])|"
                    b$ = b$ & "([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])|"
                    b$ = b$ & "([A-PR-UWYZ][A-HK-Y][0-9][0-9]? [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])|"
                    b$ = b$ & "([A-PR-UWYZ][0-9][0-9]? [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])"
                    As in your example I combined the two space separated elements (outcode and incode as they are known) into a full postcode pattern, which made it necessary to duplicate the second (incode) sub-pattern for each postcode format.

                    The original UK Gov regex defines a variety of outcode patterns and describes the incode once only.


                    @ Bob,

                    Thanks for your reply Bob,

                    as you can see in the example above, I have a working solution however I would prefer to use the published BS7666 regex (with minor pattern modifications) rather than having to re-structure it.

                    To address 'What happens 3 years from now when some unanticipated case arises?', as far as UK postcodes go, such a case will trigger the release of a new BS7666 regular expression. I would then like to use that with minimum modifications.

                    I would be grateful if you could answer my question (and I promise I am not asking you to write my code for me).

                    Can you see any reason why the following pattern (pulled from my example code above) does not work? It looks to me as though it should.

                    Code:
                    '(
                    'GIR 0AA
                    ')
                    '|
                    '(	
                    '	(			
                    '		([A-PR-UWYZ][0-9][0-9]?)
                    '		|
                    '		(
                    '			([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)
                    '			|
                    '			(	
                    '				([A-PR-UWYZ][0-9][A-HJKSTUW])
                    '				|
                    '				([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY])
                    '			)
                    '		)	
                    '	)
                    ' [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]
                    ')


                    @Gosta

                    I am not particularly looking for regex experience, I am simply attempting to use the most convenient tool available to me to implement a British Standard published regular expression.

                    There are approximately 1.71 million Postcodes in the UK and to use them you have to license them from the Royal Mail. I don't want to do that for this application.

                    Additionally, in historical databases you will find old postcodes that are not actually listed in the Royal Mail file but they do still follow a standard pattern.
                    Pattern matching comes in very handy in this situation.

                    Even more unpleasant is the case where a postcode is embedded into an address requiring it to be separated out. It is easier to say, 'Is there some data in this address line that looks like a postcode' than to check for 1.7 million specific postcodes in each line of an address.

                    I hope that helps you understand what I am trying to achieve here.

                    Regards,

                    David

                    Comment


                    • #11
                      >It is based on a UK Government published regular expression...

                      Just to make sure you are aware...

                      There is no one set of 'standards' for the formation of regular expressions. Simplest example is in one product, Ultra-Edit. With UE you have to choose between "unix style" and "Ultra-Edit-style" regular expressions (separate pages in UE help file and everything); there's also at least perl-style and, of course, "PowerBASIC-style" regular expression syntax.
                      Michael Mattias
                      Tal Systems (retired)
                      Port Washington WI USA
                      [email protected]
                      http://www.talsystems.com

                      Comment


                      • #12
                        Hi Michael,

                        thanks, yes I am aware that there are different 'flavours' of regular expressions. In fact there are a few words relating to this at the source of the regex in question.

                        http://www.govtalk.gov.uk/gdsc/schem...stCodeType.htm

                        complex pattern for postcode, which matches definition, accepted by some parsers is: "(GIR 0AA)|((([A-Z-[QVX]][0-9][0-9]?)|(([A-Z-[QVX]][A-Z-[IJZ]][0-9][0-9]?)|(([A-Z-[QVX]][0-9][A-HJKSTUW])|([A-Z-[QVX]][A-Z-[IJZ]][0-9][ABEHMNPRVWXY])))) [0-9][A-Z-[CIKMOV]]{2})"
                        If the PB regular expression parser doesn't like this style of regex then so be it. I would like to know why though as it appears ok to me.

                        Regards,

                        David

                        Comment


                        • #13
                          Interestingly, I now have a re-producable application crash in my demo program (postcode_regex_test.bas above) when I shorten my PB version of the regex by removing the last two patterns...

                          Code:
                          b$ =      "(GIR 0AA)|((([A-PR-UWYZ][0-9][0-9]?)|"
                          b$ = b$ & "(([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)|"
                          b$ = b$ & "(([A-PR-UWYZ][0-9][A-HJKSTUW])|"
                          b$ = b$ & "([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY]))))"
                          b$ = b$ & " [0-9])"
                          When I run this expression against the 7 test postcodes, Format 1 (GIR 0AA) is matched ok but the application crashes on examples 2 to 7 with the following Windows error message...

                          postcode_regex_test.exe has encountered a problem and needs to close. We are sorry for the inconvenience.
                          So with the longer expression the matching just fails and with the shorter expression the application crashes.

                          Can anyone else confirm this is happening?

                          Regards,

                          David

                          Comment


                          • #14
                            Since statements using regular expressions cannot be checked at compile time, you can get weird things happening at runtime up to and including protection faults.

                            I did not try with yours, but this has happened to me with other regular expressions in the past.
                            Michael Mattias
                            Tal Systems (retired)
                            Port Washington WI USA
                            [email protected]
                            http://www.talsystems.com

                            Comment


                            • #15
                              Originally posted by David Warner View Post
                              @Gosta

                              There are approximately 1.71 million Postcodes in the UK and to use them you have to license them from the Royal Mail. I don't want to do that for this application.
                              Don't understand the part about "licensing" them. (However, immaterial to discussion at hand). As for looking up a particular match in a 1.7m sorted list, a mere cakewalk, I'd bet well under 1/10th of a second, especially for an indexed or hashed list.
                              Additionally, in historical databases you will find old postcodes that are not actually listed in the Royal Mail file but they do still follow a standard pattern.
                              Pattern matching comes in very handy in this situation.
                              Okay, fair enough (I guess). Not sure I understand though. (Not important either).
                              Even more unpleasant is the case where a postcode is embedded into an address requiring it to be separated out. It is easier to say, 'Is there some data in this address line that looks like a postcode' than to check for 1.7 million specific postcodes in each line of an address.

                              I hope that helps you understand what I am trying to achieve here.
                              Dunno the convention in the UK but in the US, the zip code (5 or 9 numerics) is either at the end of the last address line, or is the entire last address line. Again, it's probably immaterial as you have determined (as ONLY you can, it's your app.) a Regex pattern search is the way to go. I can see that if one is searching documents for embedded addresses (for example). Just can't wrap my head around using Regex for lookups in a list, as I initially presumed this app to be..

                              ==========================
                              Patience, n.
                              A minor form of dispair,
                              disguised as a virtue.
                              Ambrose Bierce (1842-1914)
                              ==========================
                              It's a pretty day. I hope you enjoy it.

                              Gösta

                              JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                              LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                              Comment


                              • #16
                                Originally posted by Michael Mattias View Post
                                Since statements using regular expressions cannot be checked at compile time, you can get weird things happening at runtime up to and including protection faults.

                                I did not try with yours, but this has happened to me with other regular expressions in the past.
                                This is one reason I switched to pcre3. While PB's regex are powerful they differ just enough from the "defacto standard pcre" that testing can take up a large amount of time. There are many places on the web to test regex so that's what I do now.

                                James

                                Comment


                                • #17
                                  Glib's regex might be another choice. José recently translated this function rich library to PowerBASIC.I've done some preliminary tests and so far all regex that work with pcre3 also work with the Glib incarnation.

                                  James

                                  Comment

                                  Working...
                                  X