Announcement

Collapse
No announcement yet.

Regular Expression Problem

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • jcfuller
    replied
    Glib's regex might be another choice. José recently translated this function rich library to PowerBASIC.I've done some preliminary tests and so far all regex that work with pcre3 also work with the Glib incarnation.

    James

    Leave a comment:


  • jcfuller
    replied
    Originally posted by Michael Mattias View Post
    Since statements using regular expressions cannot be checked at compile time, you can get weird things happening at runtime up to and including protection faults.

    I did not try with yours, but this has happened to me with other regular expressions in the past.
    This is one reason I switched to pcre3. While PB's regex are powerful they differ just enough from the "defacto standard pcre" that testing can take up a large amount of time. There are many places on the web to test regex so that's what I do now.

    James

    Leave a comment:


  • Gösta H. Lovgren-2
    replied
    Originally posted by David Warner View Post
    @Gosta

    There are approximately 1.71 million Postcodes in the UK and to use them you have to license them from the Royal Mail. I don't want to do that for this application.
    Don't understand the part about "licensing" them. (However, immaterial to discussion at hand). As for looking up a particular match in a 1.7m sorted list, a mere cakewalk, I'd bet well under 1/10th of a second, especially for an indexed or hashed list.
    Additionally, in historical databases you will find old postcodes that are not actually listed in the Royal Mail file but they do still follow a standard pattern.
    Pattern matching comes in very handy in this situation.
    Okay, fair enough (I guess). Not sure I understand though. (Not important either).
    Even more unpleasant is the case where a postcode is embedded into an address requiring it to be separated out. It is easier to say, 'Is there some data in this address line that looks like a postcode' than to check for 1.7 million specific postcodes in each line of an address.

    I hope that helps you understand what I am trying to achieve here.
    Dunno the convention in the UK but in the US, the zip code (5 or 9 numerics) is either at the end of the last address line, or is the entire last address line. Again, it's probably immaterial as you have determined (as ONLY you can, it's your app.) a Regex pattern search is the way to go. I can see that if one is searching documents for embedded addresses (for example). Just can't wrap my head around using Regex for lookups in a list, as I initially presumed this app to be..

    ==========================
    Patience, n.
    A minor form of dispair,
    disguised as a virtue.
    Ambrose Bierce (1842-1914)
    ==========================

    Leave a comment:


  • Michael Mattias
    replied
    Since statements using regular expressions cannot be checked at compile time, you can get weird things happening at runtime up to and including protection faults.

    I did not try with yours, but this has happened to me with other regular expressions in the past.

    Leave a comment:


  • David Warner
    replied
    Interestingly, I now have a re-producable application crash in my demo program (postcode_regex_test.bas above) when I shorten my PB version of the regex by removing the last two patterns...

    Code:
    b$ =      "(GIR 0AA)|((([A-PR-UWYZ][0-9][0-9]?)|"
    b$ = b$ & "(([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)|"
    b$ = b$ & "(([A-PR-UWYZ][0-9][A-HJKSTUW])|"
    b$ = b$ & "([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY]))))"
    b$ = b$ & " [0-9])"
    When I run this expression against the 7 test postcodes, Format 1 (GIR 0AA) is matched ok but the application crashes on examples 2 to 7 with the following Windows error message...

    postcode_regex_test.exe has encountered a problem and needs to close. We are sorry for the inconvenience.
    So with the longer expression the matching just fails and with the shorter expression the application crashes.

    Can anyone else confirm this is happening?

    Regards,

    David

    Leave a comment:


  • David Warner
    replied
    Hi Michael,

    thanks, yes I am aware that there are different 'flavours' of regular expressions. In fact there are a few words relating to this at the source of the regex in question.

    http://www.govtalk.gov.uk/gdsc/schem...stCodeType.htm

    complex pattern for postcode, which matches definition, accepted by some parsers is: "(GIR 0AA)|((([A-Z-[QVX]][0-9][0-9]?)|(([A-Z-[QVX]][A-Z-[IJZ]][0-9][0-9]?)|(([A-Z-[QVX]][0-9][A-HJKSTUW])|([A-Z-[QVX]][A-Z-[IJZ]][0-9][ABEHMNPRVWXY])))) [0-9][A-Z-[CIKMOV]]{2})"
    If the PB regular expression parser doesn't like this style of regex then so be it. I would like to know why though as it appears ok to me.

    Regards,

    David

    Leave a comment:


  • Michael Mattias
    replied
    >It is based on a UK Government published regular expression...

    Just to make sure you are aware...

    There is no one set of 'standards' for the formation of regular expressions. Simplest example is in one product, Ultra-Edit. With UE you have to choose between "unix style" and "Ultra-Edit-style" regular expressions (separate pages in UE help file and everything); there's also at least perl-style and, of course, "PowerBASIC-style" regular expression syntax.

    Leave a comment:


  • David Warner
    replied
    @John

    Many thanks for posting your workarounds John, I appreciate you looking into this.

    It is in fact perfectly possible to re-structure the patterns into a single working PB regular expression as follows...

    Code:
    'New BS7666 regular expression
    'completely re-structured patterns for PowerBASIC
    'One full pattern per postcode format
    '
    b$ =      "(GIR 0AA)|"
    b$ = b$ & "([A-PR-UWYZ][0-9][A-HJKSTUW] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])|"
    b$ = b$ & "([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])|"
    b$ = b$ & "([A-PR-UWYZ][A-HK-Y][0-9][0-9]? [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])|"
    b$ = b$ & "([A-PR-UWYZ][0-9][0-9]? [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])"
    As in your example I combined the two space separated elements (outcode and incode as they are known) into a full postcode pattern, which made it necessary to duplicate the second (incode) sub-pattern for each postcode format.

    The original UK Gov regex defines a variety of outcode patterns and describes the incode once only.


    @ Bob,

    Thanks for your reply Bob,

    as you can see in the example above, I have a working solution however I would prefer to use the published BS7666 regex (with minor pattern modifications) rather than having to re-structure it.

    To address 'What happens 3 years from now when some unanticipated case arises?', as far as UK postcodes go, such a case will trigger the release of a new BS7666 regular expression. I would then like to use that with minimum modifications.

    I would be grateful if you could answer my question (and I promise I am not asking you to write my code for me).

    Can you see any reason why the following pattern (pulled from my example code above) does not work? It looks to me as though it should.

    Code:
    '(
    'GIR 0AA
    ')
    '|
    '(	
    '	(			
    '		([A-PR-UWYZ][0-9][0-9]?)
    '		|
    '		(
    '			([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)
    '			|
    '			(	
    '				([A-PR-UWYZ][0-9][A-HJKSTUW])
    '				|
    '				([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY])
    '			)
    '		)	
    '	)
    ' [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]
    ')


    @Gosta

    I am not particularly looking for regex experience, I am simply attempting to use the most convenient tool available to me to implement a British Standard published regular expression.

    There are approximately 1.71 million Postcodes in the UK and to use them you have to license them from the Royal Mail. I don't want to do that for this application.

    Additionally, in historical databases you will find old postcodes that are not actually listed in the Royal Mail file but they do still follow a standard pattern.
    Pattern matching comes in very handy in this situation.

    Even more unpleasant is the case where a postcode is embedded into an address requiring it to be separated out. It is easier to say, 'Is there some data in this address line that looks like a postcode' than to check for 1.7 million specific postcodes in each line of an address.

    I hope that helps you understand what I am trying to achieve here.

    Regards,

    David

    Leave a comment:


  • Gösta H. Lovgren-2
    replied
    David,

    I don't understand what is trying to be achieved here (other than Regex experience). If a postal code lookup, wouldn't it be FAR simpler and effectively as fast (unless tenth's or hundred's of a second count) to just look up a sorted list of postal codes?

    ===================================================
    "There is no sincerer love than the love of food."
    George Bernard Shaw (1856-1950)
    ===================================================
    Last edited by Gösta H. Lovgren-2; 11 Apr 2009, 02:52 PM.

    Leave a comment:


  • Bob Zale
    replied
    Originally posted by John Gleason View Post
    I seldom use reg expr's and now I kind of remember why.
    RegExpr's can be very nice and very powerful. However, when they reach a certain level of complexity, they become quite a project to maintain. Just think about 3 years from now when some unanticipated case arises?

    I'd vote for byte-by-byte syntax directed parsing here...

    Best regards,

    Bob Zale
    PowerBASIC Inc.

    Leave a comment:


  • John Gleason
    replied
    By combining the two space-separated parts and ordering by increasing complexity (so simple patterns aren't found in more complex ones), and by adding a letter-letter-number option, this now seems to work. I seldom use reg expr's and now I kind of remember why.
    Code:
    #COMPILE EXE
    #DIM ALL
    
    FUNCTION PBMAIN () AS LONG
    
        LOCAL a AS STRING
        LOCAL b, b1, b1a, b2, b3, b4, b5 AS STRING
        LOCAL c AS STRING
        LOCAL position, position2 AS LONG
        LOCAL length, length2 AS LONG
    
        'Un-comment the following postcode lines one at a time
        'and re-compile to test each one.
    
        'Example UK Postcode  Format
         a$ = "GIR 0AA"       '1) GIR 0AA  - matches
    '    a$ = "M1 1AA"        '2) AN NAA   - all ok now
    '    a$ = "M60 1NW"       '3) ANN NAA
    '    a$ = "CR2 6XH"       '4) AAN NAA << I added logic here which needs accuracy checking
    '    a$ = "DN55 1PT"      '5) AANN NAA
    '    a$ = "W1A 1HQ"       '6) ANA NAA
    '    a$ = "EC1A 1BB"      '7) AANA NAA
    
        'test 1 - A PB compatible version of the BS7666
        'postcode validation regular expression.
        'It produces all matches on the example postcodes
        b5 = "GIR 0AA"
        b3 = "[A-PR-UWYZ][0-9][0-9] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"
        b4 = "[A-PR-UWYZ][0-9] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"
        b1 = "[A-PR-UWYZ][A-HK-Y][0-9][0-9] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"
        b1a= "[A-PR-UWYZ][A-HK-Y][0-9] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]" '<< check! may not be correct mask letters
        b2 = "[A-PR-UWYZ][0-9][A-HJKSTUW] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"
        b  = "[A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"
    
        RESET position& : RESET length&
        RESET position2& : RESET length2&
        DO
          position& = position& + length&
          position2 = position
                     
          REGEXPR b$ IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound ELSE position = position2
          REGEXPR b1 IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound ELSE position = position2
         REGEXPR b1a IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound ELSE position = position2
          REGEXPR b2 IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound ELSE position = position2
          REGEXPR b3 IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound ELSE position = position2
          REGEXPR b4 IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound ELSE position = position2
          REGEXPR b5 IN a$ AT position& TO position&, length&: IF position <> 0 GOTO matchFound
          EXIT DO
             matchFound:
                   c$ = "Match at position : " & FORMAT$(position&,"00") _
                   & "    Data : " & CHR$(34) & MID$(a$,position&,length) & CHR$(34)
                   MSGBOX c$
        LOOP WHILE position&
    END FUNCTION
    Last edited by John Gleason; 11 Apr 2009, 12:22 PM. Reason: fixed comment from "no" to "all matches"

    Leave a comment:


  • John Gleason
    replied
    Here's a workaround, because the 1st and 2nd masks work, just not together for some reason I couldn't figure out.
    Code:
    #COMPILE EXE
    #DIM ALL
    
    FUNCTION PBMAIN () AS LONG
    
        LOCAL a, a2 AS STRING
        LOCAL b, b2 AS STRING
        LOCAL c AS STRING
        LOCAL position, position2 AS LONG
        LOCAL length, length2 AS LONG
    
        'Un-comment the following postcode lines one at a time
        'and re-compile to test each one.
    
        'Example UK Postcode  Format
    '    a$ = "GIR 0AA"       '1) GIR 0AA  - matches
        a$ = "M1 " :a2 = "1AA"'2) AN NAA   - no match
        a$ = "M60 ":a2 = "1NW"'3) ANN NAA  - no match
        'a$ = "CR2 6XH"       '4) AAN NAA  - no match
        'a$ = "DN55 1PT"      '5) AANN NAA - no match
        'a$ = "W1A 1HQ"       '6) ANA NAA  - no match
        'a$ = "EC1A 1BB"      '7) AANA NAA - no match
    
        'test 1 - A PB compatible version of the BS7666
        'postcode validation regular expression.
        'It produces no matches on the example postcodes
        'numbered 2 to 7, but I think perhaps it should.
        '
        '*NOTE: The individual sub-patterns match data ok
        'when tested independently of the full expression.
    
        b$ =      "(GIR 0AA)|(([A-PR-UWYZ][0-9][0-9]?)|"
        b$ = b$ & "(([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)|"
        b$ = b$ & "(([A-PR-UWYZ][0-9][A-HJKSTUW])|"
        b$ = b$ & "([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY])))) "
    
        b2 = "[0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"
    
        'When the regex above is tabbed out for clarity as shown
        'below, it all seems logical. However only the (GIR 0AA) pattern
        'produces a match against the example postcodes listed above.
        '
        '(
        'GIR 0AA
        ')
        '|
        '(
        '    (
        '       ([A-PR-UWYZ][0-9][0-9]?)
        '       |
        '       (
        '          ([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)
        '          |
        '          (
        '             ([A-PR-UWYZ][0-9][A-HJKSTUW])
        '             |
        '             ([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY])
        '          )
        '       )
        '    )
        ' [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]
        ')
    
        RESET position& : RESET length&
        RESET position2& : RESET length2&
        DO
          position& = position& + length&
    '      !int 3
          REGEXPR b$ IN a$ AT position& TO _
                position&, length&
            IF position& <> 0 THEN
                REGEXPR b2 IN a2 AT position2& TO position2&, length2&
                IF position2 <> 0 THEN
                   c$ = "Match at position : " & FORMAT$(position&,"00") _
                   & "    Data : " & CHR$(34) & MID$(a$,position&,length) & MID$(a2$,position&,length) & CHR$(34)
                   MSGBOX c$
                END IF
            END IF
        LOOP WHILE position&
    END FUNCTION

    Leave a comment:


  • David Warner
    replied
    Hi Iain,

    Thanks for your reply.

    As said above, you have hard-coded the postcode into b$
    The (GIR 1AA) postcode is correctly embedded into the regular expression in b$ because it is an exception to the other rules. It is a non-geographic UK postcode that relates to the UK Girobank and therefore has its own pattern in the regular expression.

    If you take a look at the original goverment supplied regular expression, you will note its presence there.

    There is a reference to this at http://en.wikipedia.org/wiki/UK_postcodes

    There are at least two exceptions (other than the overseas territories) to this format:

    the postcode for the formerly Post Office-owned Girobank is GIR 0AA.
    the postcode for correctly addressed letters to Father Christmas is SAN TA1
    So thanks for your suggested change for

    b$ = a$+"|((([A-PR-UWYZ][0-9][0-9]?)|"

    but this incorrectly mixes the regular expression with the test data.

    My problem still stands I'm afraid.

    Regards,

    David
    Last edited by David Warner; 11 Apr 2009, 06:38 AM.

    Leave a comment:


  • Iain Johnstone
    replied
    Works if you change the line to:-
    b$ = a$+"|((([A-PR-UWYZ][0-9][0-9]?)|"

    As said above, you have hard-coded the postcode into b$

    Iain Johnstone

    Leave a comment:


  • David Warner
    replied
    Hi Gosta,

    thanks for your reply.

    the reason the first a$ ("GIR 0AA") matched is that is actually in the searched b$
    From my example, you just need to...
    Code:
        'Un-comment the following postcode lines one at a time
        'and re-compile to test each one.
    
        'Example UK Postcode  Format
        a$ = "GIR 0AA"       '1) GIR 0AA  - matches
        'a$ = "M1 1AA"        '2) AN NAA   - no match
        'a$ = "M60 1NW"       '3) ANN NAA  - no match
        'a$ = "CR2 6XH"       '4) AAN NAA  - no match
        'a$ = "DN55 1PT"      '5) AANN NAA - no match
        'a$ = "W1A 1HQ"       '6) ANA NAA  - no match
        'a$ = "EC1A 1BB"      '7) AANA NAA - no match
    The first one matches, the rest do not although I think they should.

    (I suspect a coding error.)
    Always a possibility!

    Regards,

    David

    Leave a comment:


  • Gösta H. Lovgren-2
    replied
    Dunno about Regular Expressions but the reason the first a$ ("GIR 0AA") matched is that is actually in the searched b$. Dunno why the others didn't. Maybe a "Regular" guy can tell. (I suspect a coding error.)

    =========================================
    'Make people think they are thinking,
    and they will love you for it.
    Make them really think,
    and they will hate you for it.
    attributed both to Plato and Aristotle."
    =========================================

    Leave a comment:


  • David Warner
    started a topic Regular Expression Problem

    Regular Expression Problem

    Hi Folks,

    I am having a little difficulty implementing a PB/Win 9.01 regular expression. It is based on a UK Government published regular expression which is designed to validate UK postcodes. My PB version of the regex looks as though it should work to me but it just isn't matching as I expect.

    I have prepared an example program below including notes etc. and would be grateful if anyone would take a look at it to see if I am doing something wrong.

    Any comments gratefully received.

    Thanks,

    David

    Code:
    ' Module Name : postcode_regex_test.bas
    ' Platform    : PB/Win 9.01
    '               Windows XP Professional SP3
    '
    ' Purpose     : To implement a PB version of the the UK BS7666
    '               postcode validation regular expression provided
    '               at...
    '
    '               http://www.govtalk.gov.uk/gdsc/schemas/bs7666-v2-0.xsd
    '
    '               The 'complex pattern' for UK postcodes
    '               (to be found under the PostCodeType node of the schema)
    '               is as follows...
    '
    ' "(GIR 0AA)|((([A-Z-[QVX]][0-9][0-9]?)|(([A-Z-[QVX]][A-Z-[IJZ]][0-9][0-9]?)|(([A-Z-[QVX]][0-9][A-HJKSTUW])|([A-Z-[QVX]][A-Z-[IJZ]][0-9][ABEHMNPRVWXY])))) [0-9][A-Z-[CIKMOV]]{2})"
    '
    '               This pattern implements the rules given at...
    '               http://www.govtalk.gov.uk/gdsc/html/frames/PostCode.htm
    '
    ' Issue       : My PB version doesn't work, am I doing something wrong?
    
    #COMPILE EXE
    #DIM ALL
    
    FUNCTION PBMAIN () AS LONG
    
        DIM a AS STRING
        DIM b AS STRING
        DIM c AS STRING
        DIM position AS LONG
        DIM length AS LONG
    
        'Un-comment the following postcode lines one at a time
        'and re-compile to test each one.
    
        'Example UK Postcode  Format
        a$ = "GIR 0AA"       '1) GIR 0AA  - matches
        'a$ = "M1 1AA"        '2) AN NAA   - no match
        'a$ = "M60 1NW"       '3) ANN NAA  - no match
        'a$ = "CR2 6XH"       '4) AAN NAA  - no match
        'a$ = "DN55 1PT"      '5) AANN NAA - no match
        'a$ = "W1A 1HQ"       '6) ANA NAA  - no match
        'a$ = "EC1A 1BB"      '7) AANA NAA - no match
    
        'test 1 - A PB compatible version of the BS7666
        'postcode validation regular expression.
        'It produces no matches on the example postcodes
        'numbered 2 to 7, but I think perhaps it should.
        '
        '*NOTE: The individual sub-patterns match data ok
        'when tested independently of the full expression.
    
        b$ =      "(GIR 0AA)|((([A-PR-UWYZ][0-9][0-9]?)|"
        b$ = b$ & "(([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)|"
        b$ = b$ & "(([A-PR-UWYZ][0-9][A-HJKSTUW])|"
        b$ = b$ & "([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY]))))"
        b$ = b$ & " [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z])"
    	 
        'When the regex above is tabbed out for clarity as shown
        'below, it all seems logical. However only the (GIR 0AA) pattern
        'produces a match against the example postcodes listed above.
        '
        '(
        'GIR 0AA
        ')
        '|
        '(	
        '	(			
        '		([A-PR-UWYZ][0-9][0-9]?)
        '		|
        '		(
        '			([A-PR-UWYZ][A-HK-Y][0-9][0-9]?)
        '			|
        '			(	
        '				([A-PR-UWYZ][0-9][A-HJKSTUW])
        '				|
        '				([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY])
        '			)
        '		)	
        '	)
        ' [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]
        ')
    	
        RESET position& : RESET length&
        DO
          position& = position& + length&
          REGEXPR b$ IN a$ AT position& TO _
                position&, length&
            IF position& <> 0 THEN
                c$ = "Match at position : " & FORMAT$(position&,"00") _
                & "    Data : " & CHR$(34) & MID$(a$,position&,length)  & CHR$(34)
                MSGBOX c$
            END IF
        LOOP WHILE position&
    END FUNCTION
    Last edited by David Warner; 10 Apr 2009, 05:31 PM.
Working...
X