Announcement

Collapse
No announcement yet.

Finding Unique Records in an Array

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding Unique Records in an Array

    First, I want to thank everyone for some feedback they provided me a few weeks back on how to quickly read a flat-file – everything worked great.

    My next goal is to do some cleaning of the records I have imported.

    The data has been brought into a string array variable ( lets call it A(2000,3) )
    - A(x,1) is Name
    - A(x,2) is Location
    - A(x,3) is Preference

    I know that my data in A(x,1) contains several records for each name (duplicate names, different Location or Preference). I am wanting to output just the Unique names. I was going to run the ARRAY SORT feature on A(x,1) which would sort the data, and after that I was going to go through each record, one at a time, to see if the current record equaled the previous record read, if so, skip it – if not, output that Name and move on to the next record.

    My question is.. Is this the best way? In Excel, I would have used a UNIQUE feature, but cannot seem to find an equiv feature in PBCC.

    Thoughts???

    John

  • #2
    SELECT DISTINCT without SQL

    Hi John!

    Actually, that's a tricky but interesting little problem. Here is a chunk of code I dug up that shows how to do it. Actually, this is a whole little DOS utility I wrote long ago, but the basic algorithm is somewhat generic, and I think you'll be able to figure it out.

    Code:
    TYPE Tree
      Sp As Integer
      Dbh As Integer
      Guess As Integer
      RndNum As Integer
      mDbh As Integer
      mHt As Integer
      mCull As Integer
      Block As Integer
      BkCtr As Integer
      EV As Integer
      Volume As Integer
    END TYPE
    DIM Stem AS Tree
    
    Type IndexFile
      iTNum As Integer
      iPtr As Integer
    End Type
    Dim idx As IndexFile
    
    Dim iPts As Integer, fp1 As Integer, fp2 As Integer
    Dim I As Integer, J As Integer, Count As Integer
    Dim Holder(25) As Integer
    CLS
    fp1=freefile
    Open "C:\Tallies\RawData\16200304.I3P" For Random As #fp1 Len = Len(idx)
    fp2=freefile
    Open "C:\Tallies\RawData\16200304.D3P" For Random As #fp2 Len = Len(Stem)
    iPts = LOF(fp1) / LEN(idx)
    GET #fp1, 1, idx
    Get #fp2,idx.iPtr, Stem
    Holder(1) = Stem.Sp
    Count = 1
    Print "iPts = ";iPts
    Print
    FOR I = 2 To iPts
      GET #fp1, I, idx
      Get #fp2, idx.iPtr, Stem
      FOR J = 1 TO Count
        IF Stem.Sp = Holder(J) THEN
           EXIT FOR
        END IF
        IF J = Count AND Stem.Sp <> 0 THEN
           Count = Count + 1
           Holder(Count) = Stem.Sp
        END IF
      NEXT J
    NEXT I
    Close
    Print "Number of Species = "; Count
    Print
    For I = 1 To Count
      Print I, Holder(I)
    Next I
    End
    Let me give you an idea what the program is doing. I wrote it to scan through a simple database I made in the form of two random access files. The one is an index file that simply contains the record number in the other file where the i'th record can be found (my solution to the 'deletion' problem!). What the program is trying to do is find the unique species codes in a file containing many thousands of records. The nature of the thing is that a typical number of unique species codes would be 10 or 15. That is why you see an array statically dimensioned to hold 25 elements - Holder(). That is the array into which unique species codes go. Another important variable is Count. We can be certain that if the file contains one ledgitimate record there will be at least one unique species code so count is set equal to '1' and Holder(1) gets the 1st species code put into it.

    Note that in the double For loops i loops to the end of the data file and j loops to Count (1 when the program starts). Then the program reads each record in the file and then for each record loops from 1 To the Count of however many unique species codes have been found. It loops through all the contents of Holder() looking for a match. If it finds a match in Holder() it simply exits the inner 'j' For loop and then the outer loop kicks in and reads the next record.

    If j runs from 1 to Count and doesn't find a match in Holder(), then another unique code has been found. It is added at the end of Holder(), Count is incremented, and the outer loop continues to the end of the file.

    Hope this helps. You should be able to adapt it to your needs.

    I wrote this ages ago, but its one of those things I use a couple time a year. I'm sure I've a shorter example somewhere, but wasn't able to quickly locate it so I found this for you. The nature of the algorithm is such that the gist of it is only about 10 or 12 lines of code.
    Last edited by Fred Harris; 15 May 2009, 09:17 PM.
    Fred
    "fharris"+Chr$(64)+"evenlink"+Chr$(46)+"com"

    Comment


    • #3
      It bothered me to post an algorithm embedded in a program whose context was only familiar to me, so I took a few minutes to extract out of it the necessary parts. Here is that code. I did it in PBCC 4, but it should run in 5. You know, it wouldn't be a bad idea if you would think this through yourself before looking at my code. When I first run into this a good many years back it took me a few hours to figure it out, so don't get discouraged. The general idea is that you need to read an element out of the larger data file or array and compare it against some other structure in which are contained the unique/distinct entities so far determined, and of these you must keep a running count. When each record from the larger structure has been compared to the unique names in the array, and if no match has been found, the unique count has to be incremented, and the newly found element stored at that position. If a match is found you simply do the next comparison.

      Code:
      'PBCC
      #Compile Exe
      #Dim All
      
      Sub SelectDistinct(strN() As String, strU() As String, iRet As Long)
        Register i As Long
        Register j As Long
        Local iCount As Long
      
        strU(1)=strN(1,0) : iCount=1
        For i=2 To UBound(strN,1)
          For j=1 To iCount
            If strN(i,0)=strU(j) Then
               Exit For
            End If
            If j=iCount Then
               Incr iCount
               strU(iCount)=strN(i,0)
            End If
          Next j
        Next i
        iRet=iCount
      End Sub
      
      
      Function PBMain() As Long
        Local strUnique() As String
        Local strNames() As String
        Local iReturn As Long
        Register i As Long
      
        Redim strNames(8,2) As String
        Redim strUnique(8) As String
        strNames(1,0) = "Fred"  : strNames(1,1) = "Shamokin" : strNames(1,2)="I Like Tequilla Sunrises"
        strNames(2,0) = "Mark"  : strNames(2,1) = "Numedia"  : strNames(2,2)="I Like Coke"
        strNames(3,0) = "Fred"  : strNames(3,1) = "Shamokin" : strNames(3,2)="I Like Mountain Dew"
        strNames(4,0) = "Mark"  : strNames(4,1) = "Numedia"  : strNames(4,2)="I Like Turkey Hunting"
        strNames(5,0) = "Elsie" : strNames(5,1) = "Shamokin" : strNames(5,2)="I Like Shopping (she's Fred's Wife"
        strNames(6,0) = "Fred"  : strNames(6,1) = "Shamokin" : strNames(6,2)="Eined Guten Roten Wine Trinke Ich Sehr Gern"
        strNames(7,0) = "John"  : strNames(7,1) = "Loganton" : strNames(7,2)="John Got Himself A Cow (He's My Boss)"
        strNames(8,0) = "Fred"  : strNames(8,1) = "Shamokin" : strNames(8,2)="I Also Like Enchilladas"
        Call SelectDistinct(strNames(), strUnique(), iReturn)
        Print "Found" iReturn "Unique Names!  They Are:"
        Print
        For i=1 To iReturn
          Print i, strUnique(i)
        Next i
        Erase strNames, strUnique
        Waitkey$
      
        PBMain=0
      End Function
      
      'Output
      '================================
      'Found 4 Unique Names!  They Are:
      '
      ' 1            Fred
      ' 2            Mark
      ' 3            Elsie
      ' 4            John
      Last edited by Fred Harris; 15 May 2009, 11:24 PM. Reason: Add Output
      Fred
      "fharris"+Chr$(64)+"evenlink"+Chr$(46)+"com"

      Comment


      • #4
        Just reread your idea about Array Sort. That sounds to me like it would work too. Give it a try. Back when I wrote my algorithm ( 20 years ago ) I don't believe I had an 'Array Sort'. I'm not sure which would be faster. Would have to check it out on large data sets.
        Fred
        "fharris"+Chr$(64)+"evenlink"+Chr$(46)+"com"

        Comment


        • #5
          John,

          Since this program will be used only once the speed does not matter much.
          I think ARRAY SORT is the way to go as the code will be simpler and will take less time to write.
          Old QB45 Programmer

          Comment


          • #6
            Count unique keys and occurrences thereof in sequential file

            PS:
            I was going to run the ARRAY SORT feature ... go through each record, one at a time, to see if the current record equaled the previous record....
            My question is.. Is this the best way?
            Yes.
            Last edited by Michael Mattias; 16 May 2009, 09:20 AM.
            Michael Mattias
            Tal Systems (retired)
            Port Washington WI USA
            [email protected]
            http://www.talsystems.com

            Comment


            • #7
              No Sledge Hammer Approaches, Please!

              Guy Dombrowski Wrote…

              I think ARRAY SORT is the way to go as the code will be simpler and will take less time to write.
              First, I don’t think there is any code to write, because I’ve already written it for John (although I think he should try to understand it).

              Second, I’d never recommend using a powerful and very high level language feature that is, however, inevitably slow, such as any sorting algorithm (however sophisticated) must necessarily be, to accomplish a simpler task not requiring a ‘sledge hammer’ approach. Selecting the distinct entities in a field doesn’t require the time consuming process of sorting.

              Just out of curiousity I ran a simpler version of the above code where I duplicated 8000 entries of the strName() field in a one dimensional array. I ran it using my algorithm first, then the Array Sort method. Here are both programs and the results using QueryPerformanceCounter() after each program. Not even close! (This is something I do quite a lot in the type of work I do, and I thought it was worth my while testing this - as I had never used Array Sort).

              Code:
              #Compile Exe
              #Dim All
              #Include "Win32Api.inc"
              
              Sub SelectDistinct(strN() As String, strU() As String)
                Register i As Long
                Register j As Long
                Local iCount As Long
              
                strU(1)=strN(1) : iCount=1
                For i=2 To UBound(strN,1)
                  For j=1 To iCount
                    If strN(i,0)=strU(j) Then
                       Exit For
                    End If
                    If j=iCount Then
                       Incr iCount
                       Redim Preserve strU(iCount)
                       strU(iCount)=strN(i,0)
                    End If
                  Next j
                Next i
              End Sub
              
              
              Function PBMain() As Long
                Local lpPerformanceCount1,lpPerformanceCount2 As Quad
                Local strUnique(),strName() As String
                Register i As Long
              
                Redim strName(8000) As String     'Use data set of 8000 names
                Redim strUnique(1) As String
                For i=1 To 1000                   'Duplicate a thousand times
                  strName(8*(i-1)+1)="Fred"
                  strName(8*(i-1)+2)="Mark"
                  strName(8*(i-1)+3)="Fred"
                  strName(8*(i-1)+4)="Mark"
                  strName(8*(i-1)+5)="Elsie"
                  strName(8*(i-1)+6)="Fred"
                  strName(8*(i-1)+7)="John"
                  strName(8*(i-1)+8)="Fred"
                Next i
                Call QueryPerformanceCounter(lpPerformanceCount1)
                Call SelectDistinct(strName(), strUnique())           'Call my almost original algorithm
                Call QueryPerformanceCounter(lpPerformanceCount2)
                Print "Ticks Used: " lpPerformanceCount2 - lpPerformanceCount1
                Print "====================="
                For i=1 To UBound(strUnique)
                  Print i, strUnique(i)
                Next i
                Erase strName, strUnique
                Waitkey$
              
                PBMain=0
              End Function
              
              'Ticks Used:  4201
              '=====================
              ' 1            Fred
              ' 2            Mark
              ' 3            Elsie
              ' 4            John

              Then using Array Sort

              Code:
              #Compile Exe
              #Dim All
              #Include "Win32Api.inc"
              
              Function PBMain() As Long
                Local lpPerformanceCount1,lpPerformanceCount2 As Quad
                Local strUnique(),strName() As String
                Register iCount As Long
                Register i As Long
                
                Redim strName(8000) As String
                Redim strUnique(1) As String
                For i=1 To 1000
                  strName(8*(i-1)+1)="Fred"
                  strName(8*(i-1)+2)="Mark"
                  strName(8*(i-1)+3)="Fred"
                  strName(8*(i-1)+4)="Mark"
                  strName(8*(i-1)+5)="Elsie"
                  strName(8*(i-1)+6)="Fred"
                  strName(8*(i-1)+7)="John"
                  strName(8*(i-1)+8)="Fred"
                Next i
                Call QueryPerformanceCounter(lpPerformanceCount1)
                Array Sort strName()
                For i=1 To UBound(strName,1)
                  If i=1 Then
                     iCount=1
                     strUnique(iCount)=strName(iCount)
                  Else
                     If strName(i)<>strName(i-1) Then
                        Incr iCount
                        Redim Preserve strUnique(iCount)
                        strUnique(iCount)=strName(i)
                     End If
                  End If
                Next i
                Call QueryPerformanceCounter(lpPerformanceCount2)
                Print  "Ticks Used: " lpPerformanceCount2 - lpPerformanceCount1
                Print  "===================="
                For i=1 To UBound(strUnique,1)
                  Print i, strUnique(i)
                Next i
                Erase strName,strUnique
                Waitkey$
              
                PBMain=0
              End Function
              
              'Ticks Used:  22308
              '====================
              ' 1            Elsie
              ' 2            Fred
              ' 3            John
              ' 4            Mark
              Now, I've never used Array Sort before, although I've nothing against it. It seems to work real nice and from now on when I need to sort data I'll be strongly inclined to use it. But I wouldn't use it to get the distinct data from a field.
              Last edited by Fred Harris; 18 May 2009, 05:19 PM.
              Fred
              "fharris"+Chr$(64)+"evenlink"+Chr$(46)+"com"

              Comment


              • #8
                >But I wouldn't use it to get the distinct data from a field.

                Why not? The demo to which I posted a link in post #6 just does that.

                So does this one, although from the thread title it's not quite so obvious:
                Win32: Find Differences in WIN32API.INC and other text files April 21, 2000

                That's actually a generic 'merge-purge.'

                This is amazing. On one day we have one post alleging the use of INSTR in a loop is a better way to count occurences of a token in a string than is TALLY; and here someone implying there is a problem with ARRAY SORT.

                You guys must have access to the 'super secret bug list' because I have never had a problem with either ARRAY SORT or TALLY.

                MCM
                Michael Mattias
                Tal Systems (retired)
                Port Washington WI USA
                [email protected]
                http://www.talsystems.com

                Comment


                • #9
                  Guy and MCM, I concur. Fred, your technique is the fastest in the above example, but only if there are just a few unique records. As the percentage of unique records increases, it slows dramatically. Here's code comparing 1000 unique records in the 8000 total:
                  Code:
                  #COMPILE EXE
                  #DIM ALL
                  #INCLUDE "Win32Api.inc"
                  
                  SUB SelectDistinct(strN() AS STRING, strU() AS STRING)
                    REGISTER i AS LONG
                    REGISTER j AS LONG
                    LOCAL iCount AS LONG
                  
                    strU(1)=strN(1,0) : iCount=1
                    FOR i=2 TO UBOUND(strN,1)
                      FOR j=1 TO iCount
                        IF strN(i,0)=strU(j) THEN
                           EXIT FOR
                        END IF
                        IF j=iCount THEN
                           INCR iCount
                           REDIM PRESERVE strU(iCount)
                           strU(iCount)=strN(i,0)
                        END IF
                      NEXT j
                    NEXT i
                  END SUB
                  
                  
                  FUNCTION PBMAIN() AS LONG
                    LOCAL lpPerformanceCount1,lpPerformanceCount2 AS QUAD
                    LOCAL strUnique(),strName() AS STRING
                    LOCAL i, j AS LONG
                  
                    RANDOMIZE 23.3343e8
                    REDIM strName(8000) AS STRING     'Use data set of 8000 names
                    REDIM strUnique(1) AS STRING
                    FOR i=1 TO 1000                   'Duplicate a thousand times
                      strName(8*(i-1)+1)=MKL$(RND(1000000000, 1000001000))'"Fred"
                      strName(8*(i-1)+2)=MKL$(RND(1000000000, 1000001000))'"Mark"
                      strName(8*(i-1)+3)=MKL$(RND(1000000000, 1000001000))'"Fred"
                      strName(8*(i-1)+4)=MKL$(RND(1000000000, 1000001000))'"Mark"
                      strName(8*(i-1)+5)=MKL$(RND(1000000000, 1000001000))'"Elsie"
                      strName(8*(i-1)+6)=MKL$(RND(1000000000, 1000001000))'"Fred"
                      strName(8*(i-1)+7)=MKL$(RND(1000000000, 1000001000))'"John"
                      strName(8*(i-1)+8)=MKL$(RND(1000000000, 1000001000))'"Fred"
                  
                    NEXT i
                    CALL QueryPerformanceCounter(lpPerformanceCount1)
                    CALL SelectDistinct(strName(), strUnique())           'Call my almost original algorithm
                    CALL QueryPerformanceCounter(lpPerformanceCount2)
                    PRINT "Ticks Used: " lpPerformanceCount2 - lpPerformanceCount1
                    PRINT "Unique Strings" UBOUND(strUnique)
                    PRINT "====================="
                    FOR i=1 TO 30
                      IF strUnique(i) <> "" THEN PRINT i, strUnique(i);
                    NEXT i
                    ERASE strName, strUnique
                    PRINT
                    PRINT
                    PRINT "now use array sort"
                    PRINT
                  ' Waitkey$
                  
                    RANDOMIZE 23.3343e8
                    REDIM strName(8000) AS STRING     'Use data set of 8000 names
                    REDIM strUnique(8000) AS STRING
                    FOR i=1 TO 1000                   'Duplicate a thousand times
                      strName(8*(i-1)+1)=MKL$(RND(1000000000, 1000001000))'"Fred"
                      strName(8*(i-1)+2)=MKL$(RND(1000000000, 1000001000))'"Mark"
                      strName(8*(i-1)+3)=MKL$(RND(1000000000, 1000001000))'"Fred"
                      strName(8*(i-1)+4)=MKL$(RND(1000000000, 1000001000))'"Mark"
                      strName(8*(i-1)+5)=MKL$(RND(1000000000, 1000001000))'"Elsie"
                      strName(8*(i-1)+6)=MKL$(RND(1000000000, 1000001000))'"Fred"
                      strName(8*(i-1)+7)=MKL$(RND(1000000000, 1000001000))'"John"
                      strName(8*(i-1)+8)=MKL$(RND(1000000000, 1000001000))'"Fred"
                    NEXT i
                    CALL QueryPerformanceCounter(lpPerformanceCount1)
                       ARRAY SORT strName()
                       FOR i = 1 TO 8000
                          IF strName(i) <> strName(i - 1) THEN
                             strUnique(j) = strName(i)
                             INCR j
                          END IF
                       NEXT
                    CALL QueryPerformanceCounter(lpPerformanceCount2)
                    PRINT "Ticks Used: " lpPerformanceCount2 - lpPerformanceCount1
                    PRINT "Unique Strings" j
                    PRINT "====================="
                    FOR i=1 TO 30           'UBound(strUnique)
                      PRINT i, strUnique(i);
                    NEXT i
                    ERASE strName, strUnique
                    WAITKEY$
                    
                  
                    PBMAIN=0
                  END FUNCTION
                  Last edited by John Gleason; 18 May 2009, 07:45 PM.

                  Comment


                  • #10
                    Mr. Gleason, you should have pointed out that your update of Mr. Harris' code eliminates (nUnique-1) unnecessary REDIM PRESERVE operations by initially sizing the unique array to the 'worst case scenario' of each element = unique and then re-sizing to fit only when done finding the unique items.

                    Elminating unnecessary operations like this is pure technique and has little to do with the choice of tools (eg ARRAY SORT) used. That is, it's not ARRAY SORT which is 'fast' or 'slow', it's how it's used.

                    Something about paintbrushes and artists is trying to come out here.

                    MCM
                    Michael Mattias
                    Tal Systems (retired)
                    Port Washington WI USA
                    [email protected]
                    http://www.talsystems.com

                    Comment


                    • #11
                      Unless there is a reason to keep the original data in the same order I would simply read it in an array, Array Sort it and then using the sorted array simply write the result to disk skipping the duplicated records.
                      Old QB45 Programmer

                      Comment


                      • #12
                        I would simply read it in an array, Array Sort it and then using the sorted array simply write the result to disk skipping the duplicated records.
                        You mean kind of like the way I did it in the link in post #6 above?

                        It's not like it's terribly innovative, I think people have been doing it this way for about forty years. Good enough for Grandpa, good enough for me.
                        Michael Mattias
                        Tal Systems (retired)
                        Port Washington WI USA
                        [email protected]
                        http://www.talsystems.com

                        Comment


                        • #13
                          I would give 2 to 1 odds that behind Excel's UNIQUE function is a SORT of some type. One can use a TAG array if you need to maintain original order and if not I think its actually faster to reorder strings on the tag.

                          Comment


                          • #14
                            Originally posted by Michael Mattias View Post
                            You mean kind of like the way I did it in the link in post #6 above?

                            It's not like it's terribly innovative, I think people have been doing it this way for about forty years. Good enough for Grandpa, good enough for me.
                            If it is stupid and it works it is not stupid
                            Old QB45 Programmer

                            Comment


                            • #15
                              Unique elements

                              Code:
                              'Unique elements is simple.
                              'Maintaining original order might take some work since it is not a stable sort.
                              'A stable sort with option to remove duplicates would be great.
                              #COMPILE EXE
                              #DIM ALL
                              FUNCTION PBMAIN () AS LONG
                                LOCAL x AS LONG, low AS LONG, high AS LONG, counter AS LONG, sPrevious AS STRING
                                low = 0
                                high = 7
                                REDIM sOriginal (low TO high) AS STRING
                                REDIM Tag(low TO high) AS LONG
                                FOR x = low TO high: tag(x)=x:NEXT  'fill tag array
                                ARRAY ASSIGN sOriginal() = "First","C","A","B","B","C","A","Last"
                                ? "VALUE","TAG"
                                ? "-----","---"
                                FOR x = low TO high
                                  ? sOriginal(x), x
                                NEXT
                                ARRAY SORT sOriginal(), TAGARRAY Tag()
                                REDIM Unique(high-low+1) AS LONG
                                counter = 0
                                FOR x = low TO high
                                  IF sOriginal(x) <> sPrevious THEN
                                    sPrevious = sOriginal(x)
                                    INCR counter
                                    Unique(counter) = tag(x)
                                  END IF
                                NEXT
                                ARRAY SORT tag(), TAGARRAY sOriginal() 'put back into original order
                                REDIM PRESERVE Unique(Counter)
                                ARRAY SORT unique()                 'put tags into order
                                ?
                                ? "VALUE","TAG"
                                ? "-----","---"
                                FOR x = 1 TO counter
                                  ? sOriginal(unique(x)), unique(x)
                                NEXT
                                WAITKEY$
                              END FUNCTION
                              Last edited by Mike Doty; 20 May 2009, 08:47 AM.
                              CMD shortcut to open any location:
                              %windir%\system32\cmd.exe /k " cd\ & x: & cd x:\xxxx
                              Change to run as administrator
                              How long is an idea? Write it down.

                              Comment


                              • #16
                                The way I would go about this to generate a hash which is unique then check if the hash is already in use.
                                So here we are, this is the end.
                                But all that dies, is born again.
                                - From The Ashes (In This Moment)

                                Comment


                                • #17
                                  >If it is stupid and it works it is not stupid

                                  I think I like that.
                                  Michael Mattias
                                  Tal Systems (retired)
                                  Port Washington WI USA
                                  [email protected]
                                  http://www.talsystems.com

                                  Comment

                                  Working...
                                  X