Announcement

Collapse
No announcement yet.

Guys need to sort 5000000 string(s) ?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Guys need to sort 5000000 string(s) ?

    Lazy maybe ,

    I admit

    But for sure one off your guys have a speedy routine somewhere what does what I need.

    Have a file with 5Mil. strings in it each 72 bytes long, and a lot off strings are the same in it.

    So need a routine that read it sort it and make new file with only the unique strings. For a couple off 1000 easy maybe but not for 5.000.000.
    To Ask or Not To Ask ?

  • #2
    Count unique keys and occurrences thereof in sequential file


    That's about 95% of the job already done for you.


    ADDED:

    Well, that post seems to have converted badly. Here's the original

    http://www.powerbasic.com/support/fo...ML/002991.html


    ADDED 5/13/08
    Steve Rossell fixed the version in the new forum.
    Last edited by Michael Mattias; 12 May 2008, 09:58 AM.
    Michael Mattias
    Tal Systems (retired)
    Port Washington WI USA
    [email protected]
    http://www.talsystems.com

    Comment


    • #3
      Nice to have all that memory under Win/32, huh? 5M x 72 bytes is "only" 360M bytes, well under the 2000M limit.

      For extra credit you can come up with a PB/DOS application to do the same thing.

      Sorry, entries limited to those under age forty, as this would be 'old hat' to us Baby Boomers.
      Michael Mattias
      Tal Systems (retired)
      Port Washington WI USA
      [email protected]
      http://www.talsystems.com

      Comment


      • #4
        Opttech Sort
        The world is full of apathy, but who cares?

        Comment


        • #5
          LSORT (http://www.londoncomputing.com/home.html) is free (donationware) and quite poweful. Sort it and write a trivial program to strip the duplicates.
          George

          Comment


          • #6
            Originally posted by Michael Mattias View Post
            Nice to have all that memory under Win/32, huh? 5M x 72 bytes is "only" 360M bytes, well under the 2000M limit.

            For extra credit you can come up with a PB/DOS application to do the same thing.

            Sorry, entries limited to those under age forty, as this would be 'old hat' to us Baby Boomers.
            Indeed!

            Make it a CSV and open it with ADO.
            Way more faster..
            hellobasic

            Comment


            • #7
              O btw, you are using a fixed 'record' (type)
              Using the correct schema you can use that as well.
              I mean not csv but fixed length lines.
              For ado it's all the same..

              Your type data can be written as a straight dump.
              hellobasic

              Comment


              • #8
                Real Men don't need no steenking lsort, Optech Sort, or ADO.
                Michael Mattias
                Tal Systems (retired)
                Port Washington WI USA
                [email protected]
                http://www.talsystems.com

                Comment


                • #9
                  Originally posted by Michael Mattias View Post
                  Real Men don't need no steenking lsort, Optech Sort, or ADO.
                  hellobasic

                  Comment


                  • #10
                    Adrian, here's an example of a way to do it. But you'll need a fairly large amount of memory to do this with 5 million records, and this example is only 1 million recs. So test it with 5 mil to make sure you have enough memory.

                    Also, the "print unique" part only will be quite efficient if you don't have a ton of unique records--as you indicated you don't.

                    Code:
                    #COMPILE EXE
                    #DIM ALL
                    
                    FUNCTION PBMAIN () AS LONG
                       LOCAL record AS STRING, ii AS LONG
                    
                        OPEN "c:\testFileOrig1.txt" FOR BINARY AS #1
                        record = REPEAT$(10000, STRING$(72, 50) & $CRLF)
                        FOR ii = 1 TO 10000 * 74
                           IF ii MOD 74 = 0 THEN           'create 80 or so unique records
                              POKE BYTE, STRPTR(record) + ii - 37, RND(&h30, &h7e)
                           END IF
                        NEXT
                        FOR ii = 1 TO 100                  'create 1000000 record test file
                           PUT #1, , record
                        NEXT
                    
                        CLOSE #1
                        
                        OPEN "c:\testFileOrig1.txt" FOR INPUT SHARED AS #1
                        OPEN "c:\testFileUnique1.txt" FOR OUTPUT AS #2
                        REDIM strArr(1000000) AS STRING    'This will be 5,000,000 for a 5,000,000 record database. Test file is 1,000,000
                        LINE INPUT #1, strArr()
                        ARRAY SORT strArr()
                        IF strArr(0) <> "" THEN            'filter out blank record (optional)
                           PRINT #2, strArr(0)
                        END IF
                        FOR ii = 1 TO 1000000
                           IF strArr(ii) <> strArr(ii - 1) THEN
                                 PRINT #2, strArr(ii)      'print unique records only
                           END IF
                        NEXT
                        
                        CLOSE
                        ? "File ""c:\testFileUnique1.txt"" created."
                           
                    
                    END FUNCTION
                    Last edited by John Gleason; 11 May 2008, 02:35 PM.

                    Comment


                    • #11
                      And if you wish to have more control over the data:

                      Write file 'schema.ini' in the same folder:

                      [testFileOrig1.txt]
                      ColNameHeader=False
                      Format=FixedLength
                      MaxScanRows=0
                      CharacterSet=OEM
                      Col1=F1 Char Width 37
                      Col2=F2 Char Width 1
                      Col3=F3 Char Width 35

                      Connectionstring:
                      Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\;Extended Properties=TEXT;Persist Security Info=False

                      Use for example:
                      SELECT DISTINCT F1, F2, F3 FROM [testFileOrig1#txt]
                      hellobasic

                      Comment


                      • #12
                        Thanks Guys,

                        This got me thinking again and I just remember I have somewhere a copy of good old SuperSort , BigSort And Cwsort hmm back to the archives first as plan A , they used to work for me I remember I did a 6 mil. sort with one off those already years ago, hmm looking for the manuals.

                        And if that wont work for this plan B with one of this forums routines will do the job for sure.
                        To Ask or Not To Ask ?

                        Comment


                        • #13
                          Here is the way I did it when I needed to extract unique values from large list.
                          Extract the unique values, then user array sort.

                          John Tate



                          http://www.powerbasic.com/support/pb...ad.php?t=25042

                          Comment


                          • #14
                            Rpsort V1.03
                            p purvis

                            Comment


                            • #15
                              If this isn't the most telling thread ever supporting my oft-made assertion that today's programmers don't learn fundamentals....

                              "Given a list of items, find the unique items" used to be one of the early lessons from Programming 101.

                              That so many have suggested reaching for a third party library to perform this basic task makes today a sad day.
                              Michael Mattias
                              Tal Systems (retired)
                              Port Washington WI USA
                              [email protected]
                              http://www.talsystems.com

                              Comment


                              • #16
                                FWIW, related: given TWO lists, select those only appearing in A, only those appearing in B, and those appearing in both.

                                Win32: Find Differences in WIN32API.INC and other text files April 21, 2000

                                Typical Application: merge-purge of mailing lists.
                                Michael Mattias
                                Tal Systems (retired)
                                Port Washington WI USA
                                [email protected]
                                http://www.talsystems.com

                                Comment


                                • #17
                                  so true

                                  If this isn't the most telling thread ever supporting my oft-made assertion that today's programmers don't learn fundamentals....

                                  "Given a list of items, find the unique items" used to be one of the early lessons from Programming 101.

                                  That so many have suggested reaching for a third party library to perform this basic task makes today a sad day.

                                  So true indeed.. And so sad..

                                  Makes me feel like a dinosaur.
                                  Explorations v9.10 RPG Development System
                                  http://www.explore-rpg.com

                                  Comment


                                  • #18
                                    Originally posted by Adrian Verkuil View Post
                                    So need a routine that read it sort it and make new file with only the unique strings. For a couple off 1000 easy maybe but not for 5.000.000.
                                    So simple, read an array, list, or sequential file of 5M non-unique strings. Build a binary tree of same. Reject matches as it is built. Export the tree to whatever format you want to use.

                                    Problem - not enough memory to hold the binary tree? Solution, store a checksum and record number in the source stream instead of the string itself, etc etc.

                                    Comment


                                    • #19
                                      self referential structures 101

                                      Originally posted by Chris Holbrook View Post
                                      So simple, read an array, list, or sequential file of 5M non-unique strings. Build a binary tree of same. Reject matches as it is built. Export the tree to whatever format you want to use.
                                      This example is taken straight out of my head. I would be very suprised if it is much different from the Kernighan & Ritchie example which I first saw in 1982.

                                      It can certainly be improved upon. It uses an array to hold the strings and allocates memory for each string, so that's two things to change for a start!

                                      Code:
                                      #COMPILE EXE
                                      #DIM ALL
                                      
                                      #INCLUDE "WIN32API.INC"
                                      %IDD_DIALOG1  =  101
                                      %IDC_LISTBOX1 = 1001
                                      %IDC_LISTBOX2 = 1002
                                      %IDC_LABEL1   = 1003
                                      %IDC_LABEL2   = 1004
                                      
                                      DECLARE CALLBACK FUNCTION ShowDIALOG1Proc()
                                      DECLARE FUNCTION ShowDIALOG1(BYVAL hParent AS DWORD) AS LONG
                                      
                                      GLOBAL dat() AS STRING
                                      GLOBAL gnode AS DWORD
                                      
                                      %ourstringlen = 50
                                      
                                      TYPE t
                                          pleft AS DWORD
                                          pright AS DWORD
                                          svalue AS STRING * %ourstringlen
                                      END TYPE
                                      
                                      SUB loaddat ( n AS LONG )
                                          LOCAL i, j, l AS LONG
                                      
                                          REDIM dat(1 TO n) AS STRING
                                          FOR i = 1 TO n
                                              j = RND(65, 90)
                                              l = RND(1, %ourstringlen)
                                              dat(i) = STRING$(l, j)
                                          NEXT
                                      END SUB
                                      
                                      FUNCTION it (  pnode AS DWORD, s AS STRING) AS DWORD
                                          LOCAL p, q AS t PTR
                                          LOCAL dwresult AS DWORD
                                          p = pnode
                                          IF p = 0 THEN ' make a node
                                              p = globalalloc(%GMEM_FIXED OR %GMEM_ZEROINIT, SIZEOF(t))
                                              IF gnode = 0 THEN gnode = p ' base of tree
                                              @p.svalue = s
                                              FUNCTION = p
                                              EXIT FUNCTION
                                          END IF
                                      
                                          IF TRIM$(s) = TRIM$(@p.svalue) THEN EXIT FUNCTION ' discard duplicate
                                      
                                          IF s > @p.svalue THEN ' keep looking
                                              dwresult = it ( @p.pright, s)
                                              IF dwresult <> 0 THEN @p.pright = dwresult
                                          ELSE
                                              dwresult = it ( @p.pleft, s)
                                              IF dwresult <> 0 THEN @p.pleft = dwresult
                                          END IF
                                      
                                      END FUNCTION
                                      
                                      SUB free(pt AS DWORD)
                                          LOCAL p AS t PTR
                                      
                                          p = pt
                                          IF ( @p.pleft = 0 ) AND (@p.pright = 0) THEN
                                              globalfree(pt)
                                              EXIT SUB
                                          END IF
                                          IF @p.pleft  <> 0 THEN free(@p.pleft)
                                          IF @p.pright <> 0 THEN free(@p.pright)
                                      END SUB
                                      
                                      SUB walk ( hD AS DWORD, lCtl AS LONG, pt AS DWORD)
                                          LOCAL p AS t PTR
                                          p = pt
                                      
                                          IF p <> 0 THEN
                                              walk ( hD, lCtl, @p.pleft)
                                              LISTBOX ADD hD, lCtl, @p.svalue
                                              walk ( hD, lCtl, @p.pright)
                                          END IF
                                      END SUB
                                      
                                      CALLBACK FUNCTION ShowDIALOG1Proc()
                                          LOCAL i AS LONG
                                          SELECT CASE AS LONG CBMSG
                                              CASE %WM_INITDIALOG
                                                  CONTROL ADD LISTBOX, CBHNDL, %IDC_LISTBOX1,dat() , 5, 20, 255, 290
                                      
                                                  FOR i = 1 TO 1000
                                                      it ( gnode, dat(i))
                                                  NEXT
                                                  walk ( CBHNDL, %IDC_listbox2, gnode)
                                      
                                              CASE %WM_DESTROY
                                                  ' give back all that memory
                                                  free (gnode)
                                          END SELECT
                                      END FUNCTION
                                      
                                      FUNCTION ShowDIALOG1(BYVAL hParent AS DWORD) AS LONG
                                          LOCAL lRslt AS LONG
                                          LOCAL hDlg  AS DWORD
                                      
                                          DIALOG NEW hParent, "TreeSortandDedupe", 109, 83, 532, 317, %WS_POPUP OR %WS_BORDER OR %WS_DLGFRAME OR %WS_SYSMENU OR _
                                              %WS_CLIPSIBLINGS OR %WS_VISIBLE OR %DS_MODALFRAME OR %DS_3DLOOK OR %DS_NOFAILCREATE OR %DS_SETFONT, %WS_EX_CONTROLPARENT _
                                              OR %WS_EX_LEFT OR %WS_EX_LTRREADING OR %WS_EX_RIGHTSCROLLBAR, TO hDlg
                                          CONTROL ADD LISTBOX, hDlg, %IDC_LISTBOX2, , 265, 20, 260, 295
                                          CONTROL ADD LABEL,   hDlg, %IDC_LABEL1, "before", 5, 5, 90, 10
                                          CONTROL ADD LABEL,   hDlg, %IDC_LABEL2, "after", 265, 5, 90, 10
                                          DIALOG SHOW MODAL hDlg, CALL ShowDIALOG1Proc TO lRslt
                                          FUNCTION = lRslt
                                      END FUNCTION
                                      
                                      FUNCTION PBMAIN()
                                          loaddat(10000)
                                          ShowDIALOG1 %HWND_DESKTOP
                                      END FUNCTION

                                      Comment


                                      • #20
                                        All above are helpful replies but forget the most important basic for speed. Intel architecture is little endian so first load the file in memory (its only 36 Megs so trivial on a modern PC) then reverse every 4 bytes. After that all comparisons for the sort etc can be done as dwords. The time saved for the compares far exceeds the time of the translation so a decent sort then usually takes less than half the time of a standard ARRAY SORT I posted some comparative times many years ago.
                                        I do them regularly on many 100MB record files using a modification of an ASM quick sort that Hutch posted.
                                        In fact this method can be used for very fast sorts and compares of any type of UDT as it is possible to transform any type ie singles or doubles so that all compares are done as dwords.
                                        Theoretically then for strings the time taken is 25% of the time without the transformation minus the time taken for the transform. So in your situation sort first then only output if new data does not equal last.

                                        Comment

                                        Working...
                                        X