Announcement

Collapse
No announcement yet.

First: ARRAY SORT text strings by length...

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • First: ARRAY SORT text strings by length...

    I've got an array of dynamic strings that I need to have sorted by their lengths. ARRAY SORT does not have an intrinsic keyword for this, but it DOES have a "custom" sort capability.

    So, after reading (many times over) all the info in the Help on "ARRAY SORT, USING", I created my custom sort function, and a special TYPE for passing array elements to it, etc.

    Here is the result - I hope someone may find it useful.
    -John


    Code:
    #BREAK ON   'only while testing
    #COMPILER PBCC 5
    #DIM ALL
    #COMPILE EXE 
    
    GLOBAL g_All_Words AS STRING  
    
    TYPE MyItem_Type         'for use with "Array Sort USING"
       Txt AS STRING * 20
    END TYPE
    
    FUNCTION PBMAIN() AS LONG
       LOCAL Words () AS MyItem_Type   ' <=== won't work AS STRING
       LOCAL NumWords, i AS LONG, sRet AS STRING
    
       g_All_Words = TRIM$(" BUT MOUSE HAD FOOD LOVE AND WAS BANANA FOOT WHILE FRUFRU MONEY STAPLE BAGEL CEREAL HAND ")  ' 4 each of: 3, 4, 5, and 6 chars long... (The actual list is 1.344Mbytes...)
    
       NumWords = PARSECOUNT(g_All_Words, $SPC)
       DIM Words(1 TO NumWords)
       PARSE g_All_Words, Words(), $SPC
    
       ARRAY SORT Words() , USING SortItemsByLength()  'put them in order by size
       '
    
       FOR i = 1 TO NumWords   'show results
          ? STR$(i), Words(i)
       NEXT i
    
       MOUSE 3, DOUBLE, DOWN
       MOUSE ON
       WAITSTAT      'why doesn't this register a mouse click? Both Mouse buttons and Mouse ON need to be set beforehand...
       'sRet = inkey$  'not really needed for these abbreviated tests
    
    END FUNCTION
    
    FUNCTION SortItemsByLength(Item1 AS MyItem_Type, Item2 AS MyItem_Type) AS LONG
       IF LEN(TRIM$(Item1.Txt, ANY CHR$(0,20))) < LEN(TRIM$(Item2.Txt, ANY CHR$(0,20))) THEN
          FUNCTION = -1 : EXIT FUNCTION
       END IF
       IF LEN(TRIM$(Item1.Txt, ANY CHR$(0,20))) > LEN(TRIM$(Item2.Txt, ANY CHR$(0,20))) THEN
          FUNCTION = +1 : EXIT FUNCTION
       END IF            
    END FUNCTION

    There are a lot of intertwined conditions you need to keep in mind in order to make this work. (In other words, don't expect the help file to be easy to understand...)

    Since my initial results were not as expected, I went back and re-read the Help for TYPE/END TYPE, and here's what it says under "Restrictions". (The bold/italic/underline is my formatting, since this is the statement I was misunderstanding...):

    Restrictions
    When measuring the size of a padded (aligned) UDT structure with the LEN or SIZEOF statements, the measured length includes any padding that was added to the structure. For example, the following UDT structure:

    TYPE LengthTestType DWORD
    a AS INTEGER
    END TYPE
    ...

    DIM abc AS LengthTestType

    x& = LEN(abc)

    Returns a length of 4 bytes in x&, since the UDT was padded with 2 additional bytes to enforce DWORD alignment. Note that the LEN and SIZEOF of individual UDT members will return the true size of the member without regard to padding or alignment. In the previous example, LEN(abc.a) returns 2.

    Individual UDT structures can be up to 16 MB each. Arrays within a UDT, ASCIIZ strings and fixed-length strings may occupy the full 16 MB structure size limit.

    Field strings and dynamic strings cannot be used in UDT or UNION structures. Attempting to do so results in a compile-time Error 485 ("Dynamic/Field strings not allowed").
    The sentence I highlighted seems to contradict what the first paragraph states, but then after an hour's meditation, I understood the original meaning of the phrase "true size", and I realized: If you want to find the "actual" length, you not only need to use LEN(Item1.Txt) but be sure to use: LEN(TRIM$(Item1.Txt, ANY CHR$(0)))



    Oh, and in case you're wondering why I had to create a single-member TYPE for this - "ARRAY SORT , USING" does not accomodate variable length strings...

    Note the deep meaning in the wording in the help file under "Array Sort", in the section "Sorting Custom Arrays"...

    A custom array may be user-defined types, fixed-length strings, or ASCIIZ strings. With a custom array sort, you can write your own simple function to tell PowerBASIC the correct sequence for any two array elements.
    ...
    The array to be sorted, and the function parameters, must be fixed-length strings, ASCIIZ strings, or user-defined types. PowerBASIC verifies that the size of the data and parameters are identical. However, to allow maximum flexibility, it does not require that the data types be the same. Therefore, for example, it's possible to sort an array of fixed-length strings using a function with UDT parameters as long as the data size is identical. It is the programmer's responsibility to ensure accuracy.
    In the first sentence, "may" is misleading. The later sentence that I bolded makes it clear that there is no option to use variable length strings...




    OK, my NEXT step is to figure out how to use ARRAY SCAN to find the first and last of the 5-letter words.

    (I'd like to avoid having to process each and every entry in a loop of code... I realize I can edit my general-purpose Function to push all the 5-letter entries to the front of the array!!! But before I could REDIM PRESERVE, I'd still need to SCAN to the first non-5-letter word.)

    I'm hoping ARRAY SCAN will accept something like:
    Code:
    array scan Words() , = len(5), to FirstFiver 'find the first 5-letter word
    Well, I can dream, can't I?

  • #2
    > LOCAL Words () AS MyItem_Type ' <=== won't work AS STRING

    It won't? That can't be right.

    Nuls and padding got you down?
    Code:
    TYPE MyItem_Type         'for use with "Array Sort USING"
       szTxt [B]AS ASCIIZ *  48[/B]
    END TYPE
    Michael Mattias
    Tal Systems (retired)
    Port Washington WI USA
    [email protected]
    http://www.talsystems.com

    Comment


    • #3
      Darn!

      I was betting that you were going to pick up on this:

      Code:
      LEN(TRIM$(Item1.Txt, ANY CHR$(0,20)))
      and suggest:

      Code:
      MACRO ActualLen(x) = LEN(TRIM$(x, ANY CHR$(0,20)))
      Well, anyway, so I haven't embraced my inner ASCIIZ-ness yet. I opted for the "fixed-length-ness" with which I'm already familiar...

      -jhm

      Comment


      • #4
        Try this:

        Code:
        MACRO ActualLen(x) = LEN(TRIM$(x, ANY CHR$(0,20)))
        
        FUNCTION SortItemsByLength(Item1 AS MyItem_Type, Item2 AS MyItem_Type) AS LONG
           'to push all 5-letter words to top:
           IF ActualLen(Item1.Txt) = 5 AND ActualLen(Item2.Txt) <> 5 THEN
              FUNCTION = -1 : EXIT FUNCTION
           END IF
           IF ActualLen(Item1.Txt) <> 5 AND ActualLen(Item2.Txt) = 5 THEN
              FUNCTION = +1 : EXIT FUNCTION
           END IF
           
           'these sort all items by size, no problem:
           IF ActualLen(Item1.Txt) < ActualLen(Item2.Txt) THEN
              FUNCTION = -1 : EXIT FUNCTION
           END IF
           IF ActualLen(Item1.Txt) > ActualLen(Item2.Txt) THEN
              FUNCTION = +1 : EXIT FUNCTION
           END IF
        END FUNCTION
        :yahoocool:

        Comment


        • #5
          Actually if your string is left justfied an contains no embedded spaces...

          ActualLen = INSTR(str, ANY CHR$(0,20)) -1

          But if the callback will work with ASCIIZ strings
          Code:
          FUNCTION myCallback (sz1 AS ASCIIZ, sz2 AS ASCIIZ) AS LONG 
          
             IF lstrLen(sz1) < lstrlen(sz2) THEN 
                FUNCTION = 1&    ' curious how you can't say "+1&" huh?
             ELSEIF lstrlen(sz2) < lstrlen(sz1) THEN 
                FUNCTION = -1&
             ELSE
                 FUNCTION = 0&
             END IF
          
          END FUNCTION
          I may have the '+' and '-' bass-ackwards, but lstrLen() should outperform LEN on ASCIIZ variables.

          MCM
          Michael Mattias
          Tal Systems (retired)
          Port Washington WI USA
          [email protected]
          http://www.talsystems.com

          Comment


          • #6
            Interesting - I'll give the INSTR a try. Am I reading you correctly that I should change the TYPE.Txt from STRING to ASCIIZ??? (Got me wondering why - what's the diff?)

            I cut back the contents of g_All_Words for the code I posted; it's actually 1.344Mb, so it's ripe for comparison (speed) testing...

            I'll check back in on Friday...

            -jhm

            Comment


            • #7
              Fill array with lengths, sort by length with TAGARRAY words

              Code:
              #COMPILE EXE
              #DIM ALL
              FUNCTION PBMAIN&()
                LOCAL NumWords&,i&,s$, x&
                s = TRIM$(" BUT MOUSE HAD FOOD LOVE AND WAS BANANA FOOT WHILE FRUFRU MONEY STAPLE BAGEL CEREAL HAND ")
                NumWords = PARSECOUNT(s, $SPC)
                REDIM Words (1 TO NumWords) AS STRING
                REDIM WordLength (1 TO NumWords) AS LONG
                PARSE s, Words(), $SPC
                FOR i = 1 TO NumWords: WordLength(i) = LEN(Words(i)): NEXT
                ARRAY SORT WordLength(), TAGARRAY Words()
                ? "Element","Word","Length"
                FOR i = 1 TO NumWords: ? FORMAT$(i),Words(i), FORMAT$(WordLength(i)):NEXT
              '[quote]'OK, my NEXT step is to figure out how to use ARRAY SCAN to find the first and last of the 5-letter words.'[/quote]
                FOR x = WordLength(1) TO WordLength(NumWords)
                  ARRAY SCAN WordLength(), = x, TO i
                  IF i THEN ? "First word with length of";x; "is in element"; i; words(i)
                NEXT
               
                WAITKEY$
                  '[quote]  Well, I can dream, can't I? [/quote]
                'LOL
              END FUNCTION
              Last edited by Mike Doty; 26 Jan 2009, 09:47 PM.
              The world is full of apathy, but who cares?

              Comment


              • #8
                Originally posted by Mike Doty View Post
                Code:
                  FOR i = 1 TO NumWords: WordLength(i) = LEN(Words(i)): NEXT
                  ARRAY SORT WordLength(), TAGARRAY Words()
                ...
                
                  FOR x = WordLength(1) TO WordLength(NumWords)
                    ARRAY SCAN WordLength(), = x, TO i
                    IF i THEN ? "First word with length of";x; "is in element"; i; words(i)
                  NEXT
                Wow!!! To my surprise, running the two For/Next loops with the tagarray is a MUCH faster approach than my original approach of "ARRAY SORT, USING". I had anticipated that the loops would be slower. (It always pays to test one's hypothesis!)

                When used against the full 150,000 word list, my original approach (which did not yet include the "SCAN for size") took over 35 seconds, but Mike Doty's approach (which does include the "SCAN for size") takes under 2 seconds!

                Very cool! Thanks Mike and Mike!
                -John


                ALSO: With reference to another recent thread I've had help with (http://www.powerbasic.com/support/pb...ad.php?t=39612), I've added these macros...

                Code:
                MACRO TB =   TIX Cycles : StartTime = TIMER
                
                MACRO TE(prm1)
                   TIX END Cycles : EndTime = TIMER
                   ? : ? "Elapsed number of CPU cycles used " prm1 ": " ; Cycles ; "("; STR$(EndTime-StartTime) ;" rough seconds)"  
                   ? "Press a key to continue..."
                   WAITKEY$
                END MACRO
                and use them this way (only one invocation shown; others are coded as needed)

                Code:
                   TB
                   NumWords = PARSECOUNT(g_All_Words, $SPC)
                   REDIM Words (1 TO NumWords) AS STRING
                   REDIM WordLength (1 TO NumWords) AS LONG  'the Tag Array
                   PARSE g_All_Words, Words(), $SPC
                   TE("for Parsecount, Dim, and Parse")
                Now that may not be special to you, but I'm very proud of my success in not only using a multi-line macro, but also parameters!!!
                Last edited by John Montenigro; 27 Jan 2009, 09:36 AM. Reason: added link to other thread

                Comment


                • #9
                  Code:
                  ARRAY SCAN WordLength(), = x, TO i
                      IF i THEN ? "First word with length of";x; "is in element"; i; words(i)
                  ===>

                  Code:
                  ARRAY SCAN WordLength(), > x, TO i
                      IF i THEN ? "First word with length greater than ";x; "is in element"; i-1; words(i-1)
                  Michael Mattias
                  Tal Systems (retired)
                  Port Washington WI USA
                  [email protected]
                  http://www.talsystems.com

                  Comment


                  • #10
                    I don't think I mentioned it outright, but it was embedded in comments in the original code: the work of this program is to isolate and work on only the 5-letter words in a list that contains words with lengths from 2 to 28...

                    What is gained with the change from "=" to ">" ????

                    Comment


                    • #11
                      You also requested getting the last word of the length so he got the next highest allowing you to subtract 1.
                      The world is full of apathy, but who cares?

                      Comment


                      • #12
                        Ah, got it. I thought he was editing, and I didn't realize that he was adding on.
                        Thanks,
                        -jhm

                        Comment


                        • #13
                          How about this? Everything is done using ARRAY SORT and ARRAY SCAN:
                          Code:
                          ' Code shown for CC5 but should work for any CC version
                          ' Sorting words according to length
                          #COMPILE EXE
                          #DIM ALL
                          
                          FUNCTION PBMAIN () AS LONG
                          
                            DIM aWordList() AS STRING
                            LOCAL sCharWeight AS STRING
                            LOCAL i, j AS LONG
                          
                            REDIM aWordList(1 TO 7)
                            aWordList(1) = "Long"
                            aWordList(2) = "Shorter"
                            aWordList(3) = "Terribly"
                            aWordList(4) = "Longest"
                            aWordList(5) = "Short"
                            aWordList(6) = "Go"
                            aWordList(7) = "Snort"
                          
                            ' Let sCharWeight make any character look like a space to ARRAY SORT.
                            ' In this way, only the length is considered
                            sCharWeight = SPACE$(256)
                          
                            ARRAY SORT aWordList(), COLLATE sCharWeight, DESCEND
                            FOR i = 1 TO UBOUND(aWordList)
                              PRINT aWordList(i)
                            NEXT i
                          
                            PRINT
                          
                            ARRAY SORT aWordList(), COLLATE sCharWeight, ASCEND
                            FOR i = 1 TO UBOUND(aWordList)
                              PRINT aWordList(i)
                            NEXT i
                          
                            i = 0
                            ' The same COLLATE string trick may be used for ARRAY SCAN.
                            ' 'i' will hold the first entry holding a 5 char word
                            ARRAY SCAN aWordList(), COLLATE sCharWeight, =SPACE$(5), TO i
                          
                            IF i > 0 THEN  ' i < 1 --> No 5-char word found
                              j = 0
                              ' 'j'-1 will hold the last entry holding a 5 char word
                              ' You must start at i - 1 since i may be the only 5-char word,
                              ' and there is BASE 0 unless you set it otherwise
                              ARRAY SCAN aWordList(), FROM i - 1 TO UBOUND(aWordList), COLLATE sCharWeight, >SPACE$(5), TO j
                            END IF
                          
                            PRINT
                            IF j > 0 THEN
                              PRINT "First 5 char entry is"; i
                              PRINT "Last 5 char entry is"; j - 1
                             ELSE
                              PRINT "The only 5 char entry is"; i
                            END IF
                            WAITKEY$
                          
                          END FUNCTION
                          If speed is important you should avoid repeated extressions as SPACE$() and UBOUND(). Set it once and have it accessed as 'constants' from then on as I do with sCharWeight instead of calculating/looking it up every time you need it.

                          ViH

                          Comment


                          • #14
                            Code:
                            ' Let sCharWeight make any character look like a space to ARRAY SORT.
                              ' In this way, only the length is considered
                            Now THAT is clever!

                            MCM
                            Michael Mattias
                            Tal Systems (retired)
                            Port Washington WI USA
                            [email protected]
                            http://www.talsystems.com

                            Comment


                            • #15
                              Originally posted by Michael Mattias View Post
                              Code:
                              ' Let sCharWeight make any character look like a space to ARRAY SORT.
                                ' In this way, only the length is considered
                              Now THAT is clever!

                              MCM
                              I completely agree!
                              -jhm

                              Comment


                              • #16
                                Indeed the COLLATE idea is a thing of beauty, however it does run significantly slower than TAGARRAY on larger arrays:
                                Code:
                                #COMPILE EXE
                                #DIM ALL
                                
                                FUNCTION PBMAIN () AS LONG
                                    DIM str(150000) AS STRING, strLen(150000) AS LONG
                                    LOCAL ii AS LONG, t1, t2 AS QUAD, sCharWeight AS STRING
                                
                                    FOR ii = 0 TO 150000
                                       str(ii) = STRING$(RND(0, 800), RND(32, 126))
                                       strLen(ii) = LEN(str(ii))          
                                    NEXT
                                    ? "ok, 60MB unsorted of strings are loaded, let's sort ascending using COLLATE..."
                                
                                    sCharWeight = SPACE$(256)
                                
                                  TIX t1
                                    ARRAY SORT str(), COLLATE sCharWeight', DESCEND
                                  TIX END t1
                                  ? "That took" & STR$(t1) & " ticks."
                                  ? "Now sort the strLen array and TAGARRAY the strings..."
                                
                                    RESET str(), strLen()
                                
                                    FOR ii = 0 TO 150000
                                       str(ii) = STRING$(RND(0, 800), RND(32, 126))
                                       strLen(ii) = LEN(str(ii))          'peek(long, strptr(str(ii)) - 4)
                                    NEXT
                                    ? "ok, 60MB of NEW unsorted strings are loaded, let's sort ascending again..."
                                
                                  TIX t2
                                    ARRAY SORT strLen(), TAGARRAY str()
                                  TIX END t2
                                  ? "That took " & STR$(t2) & " ticks. So TAGARRAY was" & STR$((t1 / t2), 4) & " times faster than COLLATE."
                                '  WAITKEY$
                                
                                END FUNCTION

                                Comment


                                • #17
                                  > .. thing of beauty, however ...

                                  I swear, some people can find a cloud surrounding ANY silver lining...
                                  Michael Mattias
                                  Tal Systems (retired)
                                  Port Washington WI USA
                                  [email protected]
                                  http://www.talsystems.com

                                  Comment


                                  • #18
                                    Originally posted by Vidar Hanto View Post
                                    Code:
                                      i = 0
                                      ' The same COLLATE string trick may be used for ARRAY SCAN.
                                      ' 'i' will hold the first entry holding a 5 char word
                                      ARRAY SCAN aWordList(), COLLATE sCharWeight, =SPACE$(5), TO i
                                    
                                      IF i > 0 THEN  ' i < 1 --> No 5-char word found
                                        j = 0
                                        ' 'j'-1 will hold the last entry holding a 5 char word
                                        ' You must start at i - 1 since i may be the only 5-char word,
                                        ' and there is BASE 0 unless you set it otherwise
                                    [B][I]    ARRAY SCAN aWordList(), FROM i - 1 TO UBOUND(aWordList), COLLATE sCharWeight, >SPACE$(5), TO j[/I][/B]  END IF
                                    Vidar,
                                    Thanks, that's more in line with what I had originally hoped to do, but couldn't see how to sort on length - setting the COLLATE string to spaces for the comparison is excellent.

                                    One problem in your ARRAY SCAN syntax (bolded above), however. The FROM/TO clause scans within the array string, whereas we want to change the range of elements that are scanned:

                                    Code:
                                          ARRAY SCAN aWordList(i - 1), COLLATE sCharWeight, >SPACE$(5), TO j
                                    Your code only ran correctly because there were only two 5-letter strings that differed in the right place in their spelling. Change the word to something completely different, or add more words, and it won't find the first word larger than 5 chars...

                                    Took me awhile to figure it out, but with the change above, works as you intended.

                                    Thanks!!
                                    -jhm
                                    Last edited by John Montenigro; 31 Jan 2009, 12:34 PM. Reason: deleted a note that I had added; things work fine

                                    Comment


                                    • #19
                                      Originally posted by John Gleason View Post
                                      Indeed the COLLATE idea is a thing of beauty, however it does run significantly slower than TAGARRAY on larger arrays...
                                      OK, I'm going to have to put another yellow sticky on my monitor that reminds me of just how fast the TAGARRAY process is.

                                      And then, I'm going to have to remember to act upon that information!

                                      Thanks for the reminder!
                                      -jhm

                                      Comment


                                      • #20
                                        Here' another (quick and easy. probably too simple for advanced minds though) method to sort by word length.

                                        '
                                        Code:
                                        'PBWIN 9.00 - WinApi 05/2008 - XP Pro SP3
                                        #Compile Exe                                
                                        #Dim All 
                                        #Include "WIN32API.INC"
                                        #Include "COMDLG32.INC"
                                        
                                        Function PBMain         
                                          ErrClear   
                                          Local s, aWordList(), awl_by_Length() As String
                                          Local ctr As Long
                                          
                                          ReDim aWordList(1 To 7), awl_by_Length(1 To 7)
                                          aWordList(1) = "Long"
                                          aWordList(2) = "Shorter"
                                          aWordList(3) = "Terribly"
                                          aWordList(4) = "Longest"
                                          aWordList(5) = "Short"
                                          aWordList(6) = "Go"
                                          aWordList(7) = "Snort"
                                          s$ = "##"
                                          'now put string legth in front for sorting
                                          For ctr = LBound(aWordList()) To UBound(aWordList())
                                             awl_by_Length(ctr) = Using$(s$, Len(aWordList(ctr))) & aWordList(ctr)
                                          Next ctr   
                                        '  
                                          Array Sort awl_by_Length()
                                          
                                          Reset s$ 'Show results
                                          For ctr = LBound(aWordList()) To UBound(aWordList())
                                                      'skip over length 
                                             s$ = s$ & Mid$(awl_by_Length(ctr), 3) & $CrLf 
                                          Next ctr   
                                          
                                          ? s$,,"testing"
                                        End Function 'Applikation kerschplunckened
                                        '
                                        ======================================
                                        I don't want any yes-men around me.
                                        I want everybody to tell me the truth
                                        even if it costs them their jobs.
                                        Samuel Goldwyn
                                        ======================================
                                        It's a pretty day. I hope you enjoy it.

                                        Gösta

                                        JWAM: (Quit Smoking): http://www.SwedesDock.com/smoking
                                        LDN - A Miracle Drug: http://www.SwedesDock.com/LDN/

                                        Comment

                                        Working...
                                        X