No announcement yet.

Array Sort using Collate Strings

  • Filter
  • Time
  • Show
Clear All
new posts

  • Array Sort using Collate Strings

    hi fellows,

    in the source code forum i have posted two programs on this subject:

    any comments and suggestions are welcome here. information on
    correct sorting sequences for special characters in different
    languages could be very interesting.

    august 11,2001:
    the second of the two programs referred to above needs the winapi file. it has now been adjusted for the new
    version of that file.

    erik christensen, copenhagen, denmark. [email protected]

    [this message has been edited by erik christensen (edited august 11, 2001).]

  • #2
    eric --
    array sort, as i understand, simply compares hexadecimal values.
    if you need exact order of letters, you can convert strings, using translate (see below).
    of course, you need own trttable.
       #compile exe
       #register none
       #dim all
       #include ""
       sub translate (tr as string)
          register i as long, j as long
          i = len(tr): j = strptr(tr)
          ! lea ebx, trttable
          ! mov ecx, i
          ! mov edx, j
          ! add ecx, edx
          ! mov al, [edx]
          ! xlat
          ! mov [edx], al
          ! inc edx
          ! cmp edx, ecx
          ! jne lbtrt
          exit sub
          ! db 000,001,002,003,055,045,046,047,022,005,010,011,012,013,014,015
          ! db 016,017,018,019,060,061,050,038,024,025,063,039,028,029,030,031
          ! db 064,079,127,099,103,108,080,125,077,093,092,078,107,096,075,097
          ! db 240,241,242,243,244,245,246,247,248,249,122,094,076,126,110,111
          ! db 236,193,194,195,196,197,198,199,200,201,209,210,211,212,213,214
          ! db 215,216,217,226,227,228,229,230,231,232,233,181,113,159,095,109
          ! db 081,129,130,131,132,133,134,135,136,137,145,146,147,148,149,150
          ! db 151,152,153,162,163,164,165,166,167,168,169,065,187,071,220,255
          ! db 104,161,121,066,192,068,208,072,082,083,084,087,086,088,123,091
          ! db 224,156,158,203,106,205,219,221,223,124,252,112,177,128,191,007
          ! db 069,085,206,222,073,105,154,155,171,175,186,184,183,170,138,139
          ! db 043,044,009,033,040,101,098,100,180,056,049,052,051,176,178,036
          ! db 034,023,041,006,032,042,070,102,026,053,008,057,054,048,058,090
          ! db 140,172,114,115,116,037,117,118,119,035,021,020,004,204,120,059
          ! db 238,089,235,237,207,239,160,142,174,254,251,253,141,173,188,190
          ! db 202,143,027,185,182,067,225,157,144,189,179,218,250,234,062,074
       end sub
       function pbmain ()
          dim sjob1 as string, sjobct as string, i as long, j as long
          dim t1 as double, t2 as double, t3 as double
          %testlen = 100000: %ntest = 1000
          sjobct = space$(%testlen): sjob1 = sjobct
          j = fix(timer) mod 256
          ' random
          for i = 1 to %testlen
             mid$(sjobct, i, 1) = chr$(j)
             j = ((j * 11) + 7) mod 256
          t1 = timer: for i = 1 to %ntest: lset sjob1 = sjobct: next: t2 = timer
          for i = 1 to %ntest: lset sjob1 = sjobct: translate sjob1: next: t3 = timer
          msgbox format$(1000 * ((t3 - t2) - (t2 - t1)), "# ms")
       end function
    this sub (thanks to steve hutchesson works 100 times faster than pb replace any: see

    e-mail: [email protected]


    • #3

      Thank you for this information about very advanced and fast
      sorting methods using assembler which have previously been
      presented in the Forum. For my purposes the speed of the PB
      ARRAY SORT is sufficiently high, but it will certainly be
      quite easy to convert the edited COLLATE STRING from my
      program to the format specified in the trt.table in your program.

      A few program lines to that effect are shown below:


         LOCAL i%,j%,a$,b$
         LOCAL CS AS STRING ' collate string
         ' This user defined collate string covers the last
         ' 224 characters (from 32 to 255) of the collate string. The
         ' first 32 characters (from 0 to 31), which are not changed
         ' are added later.
         ' The double quote "" near the beginning of the string is to enable
         ' a single " to be included in the string.
         ' See the PowerBasic help file on String Data Types.
         CS=" !""#$%&'()*+,-./1579:;<=>[email protected]\fhjlvxz|~‚ŒŽ’”–ž ¢¤«¶·¸¹º»GQSW[egikuwy{}‹‘“•Ÿ¡£ª¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïð0ñ68òóôõö÷øù234úJLNP¯µ­V^`bdnp rtZ€„†Šˆ³û±˜šœ¦¨üýIKMO®´¬U]_acmoqsYƒ…‰‡²þ°—™›¥§ÿ©"
         ' transformation to trt.table format
         FOR i%=0 TO 15
            b$=b$+"! Db "
            FOR j%=i%*16 TO i%*16+15
                IF i%<2 THEN b$=b$+RIGHT$("000"+LTRIM$(STR$(j%)),3)+","
                IF i%>=2 THEN b$=b$+RIGHT$("000"+LTRIM$(STR$(ASC(CS,j%-31))),3)+","
         MSGBOX b$,,"Collate string to Trt.table format"
         ' the string b$ holding the format could easily be saved to disk
         ' and then loaded in the editor and pasted into your program

      [This message has been edited by Erik Christensen (edited February 05, 2001).]


      • #4

        I have looked more at your code, but I have still
        difficulties in seeing how it really works and how it
        should be implemented in pratical sorting problems. Like
        probably many others in this Forum, I am not programming
        in assembler. Sorry. Can the code - as you seem to
        suggest - be used to speed up array sort using your own
        defined collate string? If so, how does it work and how
        can it be implemented? Perhaps it could be useful to see
        a practical example to illustrate this. Below is an
        example using PowerBasic Array Sort to sort some random
        strings using the collate string shown in the previous

        How could your code be implemented in this setting? It
        could be interesting to see by how much the speed could
        be increased by the assembler code.



        #COMPILE EXE
        #DIM ALL
        %ID_LIST1 = 101
        GLOBAL CollateString AS STRING
            LOCAL hDlg AS LONG
            LOCAL I AS INTEGER
            LOCAL J AS INTEGER
            LOCAL Y AS LONG
            LOCAL font AS LONG
            LOCAL t1 AS DOUBLE
            LOCAL t2 AS DOUBLE
            DIM SortArray(5000) AS GLOBAL STRING
            DIALOG NEW 0,"String Array Sorting",,, 160, 252, %WS_CAPTION OR %WS_SYSMENU TO hdlg
            CONTROL ADD LISTBOX,hDlg,%ID_LIST1,SortArray(),5,5,150,245, _
            CALL MakeCharacterCollateString(CollateString)
            ' make random string array
            FOR I=0 TO 5000
                FOR J=1 TO 30
                    Y = RND(65, 255)
                    SELECT CASE Y
                        'All characters filter
                         CASE 65 TO 90,97 TO 122,192 TO 214, 216 TO 221, 224 TO 246, 248 TO 253,255
                        'English character set filter
                        'CASE 65 TO 90,97 TO 122
                        'Scandinavian character set filter
                        'CASE 65 TO 90,97 TO 122, 196 TO 198,214,216,228 TO 230,246,248
                    END SELECT
            t1 = TIMER ' START TIME OF SORTING
            ' sort array according to the collate string
            ARRAY SORT SortArray(), COLLATE CollateString, ASCEND
            t2 = TIMER ' END TIME OF SORTING
            ' put sorted array in the listbox
            FOR I=0 TO 5000
              LISTBOX ADD hDlg,%ID_LIST1,SortArray(I)
            MSGBOX FORMAT$(1000 * (t2 - t1), "# ms"),,"Sorting time:"
            DIALOG SHOW MODAL hdlg
        SUB MakeCharacterCollateString(BYREF CS AS STRING)
           LOCAL I%
           FOR I%=1 TO 256
           ' this is the same collate string as in the previous posting,
           ' just presented as ASCII codes
           DATA 000,001,002,003,004,005,006,007,008,009,010,011,012,013,014,015
           DATA 016,017,018,019,020,021,022,023,024,025,026,027,028,029,030,031
           DATA 032,033,034,035,036,037,038,039,040,041,042,043,044,045,046,047
           DATA 049,053,055,057,058,059,060,061,062,063,064,065,066,067,068,069
           DATA 070,072,082,084,088,092,102,104,106,108,118,120,122,124,126,130
           DATA 140,142,144,146,148,150,158,160,162,164,171,182,183,184,185,186
           DATA 187,071,081,083,087,091,101,103,105,107,117,119,121,123,125,129
           DATA 139,141,143,145,147,149,157,159,161,163,170,188,189,190,191,192
           DATA 193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208
           DATA 209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224
           DATA 225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240
           DATA 048,241,054,056,242,243,244,245,246,247,248,249,050,051,052,250
           DATA 074,076,078,080,175,181,173,086,094,096,098,100,110,112,114,116
           DATA 090,128,132,134,138,136,179,251,177,152,154,156,166,168,252,253
           DATA 073,075,077,079,174,180,172,085,093,095,097,099,109,111,113,115
           DATA 089,127,131,133,137,135,178,254,176,151,153,155,165,167,255,169
        END SUB


        • #5
          Erik --
          Let's again.
          SUB Translate initially was designed to convert IBM codeset to Microsoft codeset.
          But really it converts any codeset to any codeset.

          Imagine, that you want to sort items in following sequence
          at first letters A or a; then B or b and so on

          First step: translate (using first trttable)
          A to &H00
          a to &H01
          B to &H02
          b to &H03
          It's very fast operation

          Second step: Array Sort without COLLATE
          Third step: restoring in original "codeset" (reverse trttable)

          &H00 -> ASC("A")
          &H01 -> ASC("a")

          Following code illustrates, how to build tables
             #Compile Exe
             #Dim All
             #Register None
             #Include "Win32Api.Inc"
             Function PbMain
                Dim Tr1(255) As Byte, Tr2(255) As Long
                Dim Order As String, i As Long, j As Long, w1 As String, w2 As String
                Order = "AaBbCcDdEe"
                j = Len(order)
                For i = 1 To j
                   Tr1(Asc(Order, i)) = i
                For i = 0 To 255
                   If Tr1(i) = 0 Then Tr1(i) = j: Incr j Else Decr Tr1(i)
                   If (i Mod 16) = 0 Then w1 = w1 + $CRLF + "   ! DB" + Str$(Tr1(i)) Else _
                      w1 = w1 + "," + Str$(Tr1(i))
                MsgBox Mid$(w1$, 3), , "First table"
                For i = 0 To 255: Tr2(Tr1(i)) = i: Next
                For i = 0 To 255
                   If (i Mod 16) = 0 Then w2 = w2 + $CRLF + "   ! DB" + Str$(Tr2(i)) Else _
                     w2 = w2 + "," + Str$(Tr2(i))
                MsgBox Mid$(w2$, 3), , "Second table"     
             End Function
          E-MAIL: [email protected]

          [This message has been edited by Semen Matusovski (edited February 07, 2001).]


          • #6
            In Swedish language, there are a few things to consider when working
            with ANSI character set. For some strange reason, they have put
            two letters in wrong order. "Foreign" (usually from USA) software
            often fails to correct this, which has meant that databases, etc,
            often have proved to be un-usable for us over here.

            The two I'm talking about are ASC 196 and 197 (ÄÅ), plus their lower
            case equivilants, 228 and 229 (äå). These two must be in reversed
            order, ÅÄ, åä instead. Full Swedish alphabet:

            ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ (3 last ones, ASC 197, 196, 214)
            abcdefghijklmnopqrstuvwxyzåäö (3 last ones, ASC 229, 228, 246)

            In addition to this - even if not a part of our alphabet, German Ü
            often is found in names,etc, so we have to deal with that one too,
            ASC 220, or 252 for ü. These two shall be treated as regular Uu.

            Many others to consider, like Spanish Ññ (ASC 209 and 241) that goes
            in under Nn, etc. This goes for all languages, since emigration
            and global work means people with "funny" letters in their names
            lives everywhere today..


            [This message has been edited by Borje Hagsten (edited February 07, 2001).]


            • #7
              Thank you for your fine explanation. Now I see what you mean. I
              misunderstood you at first. Sorry for being so slow.

              Fascinating to see the slight differences between our two brother
              languages. We have all the English characters in common, but we Danes
              have Æ (198) and æ (228) instead of Ä (196) and ä (228), and Ø (216)
              and ø (248) instead of Ö (214) and ö (246), but they are pronounced
              pretty much the same. In Danish Å (197) and å (229) are put at the
              end of the alphabet i.e. after Ø or ø. German Ü (220) and ü (252) are
              put together with Y and y because the pronunciations are the same in

              Danish characters and sequence:
              ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅ (last 3 being ASC 198, 216, 197)
              abcdefghijklmnopqrstuvwxyzæøå (last 3 being ASC 230, 248, 229)

              Thus the “American” sequences of ÄÅ and of ÅÆ are out of touch with
              our languages. If Æ and Ä were reversed in position, these sequences
              would be correct.

              In view of the significant differences between languages as close as
              ours, it is no wonder that “standard” solutions cannot be expected to
              work satisfactorily for all. Thus the need for user defined sorting

              Best regards,




              • #8
                hi folks,

                in the source code forum i have posted another array sort program
                using index sort (“in-place” sort):


                all comments are welcome.

                i anticipate that user defined translation tables as described above
                by semen can be used in the program to modify the sorting sequence of string data
                according to need. for sorting of numerical data to be correct i
                would guess (it is just a guess!) that the translation table should
                keep all the numeric characters at their usual position.

                if you try to use translation tables with this method, it would be
                most interesting to see your experiences in this forum.




                [this message has been edited by erik christensen (edited february 11, 2001).]