Announcement

Collapse
No announcement yet.

Bag of Strings

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bag of Strings

    Any discussions for source code found here.

    This bag class contains a unique collection of strings. Once a string is added to the bag, the index will remain stable.

    I'm seeing about 6,000 tix per added entry and 1,200 tix to find an entry tested to 1,000,000 strings.

    This code started as a port of C# code found here on CodePlex.
    LarryC
    Website
    Sometimes life's a dream, sometimes it's a scream

  • #2
    interesting code and reference, Larry

    first time I came across "string interning"
    what would you use it for?
    stanthemanstan~gmail
    Dead Theory Walking
    Range Trie Tree
    HLib ~ Free Data Container Lib ~ Arrays, Lists, Stacks, Queues, Deques, Trees, Hashes

    Comment


    • #3
      Keeps you from needing to store a string more than once. Rather than listen to my clumsy attempt, you can read wiki's take here.

      btw that wasn't why I created the classes. I prefer type safe collections and wanted an option or two besides an array and I prefer classes. Probabbly would have been easier to adapt some of Stanley's code but I also am thinking of converting some .Net code I wrote and wanted to see what the experience was like.

      Edit: Oh also I use hashset, list, stack, and dictionary a ton in .net so it feels more comfortable to me to have those tools around when thinking about problems.
      Last edited by Larry Charlton; 10 May 2011, 06:23 PM.
      LarryC
      Website
      Sometimes life's a dream, sometimes it's a scream

      Comment


      • #4
        Hi Larry.
        Any chances it can work on PBCC5?
        Can it be used to create a dictionary of unique words contained inside a text file?
        Should it be faster than an hash table or binary tree doing the same thing?

        Questions come a bit late, but I'm considering this problem now.

        Salvatore

        Comment


        • #5
          It is a hashtable and it did seem fairly quick out to the million entries I tested it with. It will take a couple of changes to compile with PBCC5.
          1. You need to remove the word COMMON from all source.
          2. Remove the SLL link from test.bas.
          3. Txt.XXX should have the console equivalent (PRINT, WAITKEY$) in test.bas.
          %INCLUDE = 1
          #If %Def(%INCLUDE)
          #Include "StringBag.bas"
          #Else
          #Link "StringBag.sll"
          #EndIf

          s/b
          #Include "StringBag.bas"
          Be sure it's large enough before you start adding words by using Expand.

          I also have another version here. The Bag is just that. It holds unique strings. If you want a count, you'd need a Map (StringLongMap).

          Stanley Durham also has some non-class versions of containers that are pretty quick. Not sure if they run on PBCC5 atm but they used to.

          If you have trouble getting it to compile, holler, I should be able to dig up the PBCC5 compiler.
          LarryC
          Website
          Sometimes life's a dream, sometimes it's a scream

          Comment


          • #6
            >Inherit IUnknown

            You might have made this DUAL so members who have not upgraded to PBCC/6 or PBWIN/10 could use it. (easily, that is).

            Also since your brothers and sisters with PB/CC 5- and PB/Win 9- don't have access to the new "iPowerCollection" object, they would be the most likely benficiaries of your library.

            And since those same brothers and sisters cannot compile that software, a precompiled DLL might be a nice addition to your "bundle."

            MCM
            Last edited by Michael Mattias; 11 Feb 2013, 09:56 AM.

            Comment


            • #7
              Originally posted by Michael Mattias View Post
              >Inherit IUnknown

              You might have made this DUAL so members who have not upgraded to PBCC/6 or PBWIN/10 could use it. (easily, that is).

              Also since your brothers and sisters with PB/CC 5- and PB/Win 9- don't have access to the new "iPowerCollection" object, they would be the most likely benficiaries of your library.

              And since those same brothers and sisters cannot compile that software, a precompiled DLL might be a nice addition to your "bundle."

              MCM
              I guess I'm not following you completely. The code is designed to compile using includes and should compile with the minor changes noted. I put a DLL version in the source code library if anyone want's it. It has the DLL in the zip file along with two PBCC test routines (include and DLL versions)
              LarryC
              Website
              Sometimes life's a dream, sometimes it's a scream

              Comment


              • #8
                I guess I'm not following you completely. The code is designed to compile using includes ..
                What I was suggesting was, for those who cannot compile the function because they do not have the same level compiler as do you, it might be nice to provide something usable like a DLL.

                I thought this particularly relevant because PBCC/6+ and PBWIN/10+ users might find the intrinisic 'IPowerCollection' object a bit more comfortable. You have a couple of features in your "bag of strings" not in iPowerCollection but if you don't need those particular features, well, you don't need those features.

                Especially with the change to "Wide Char Default" and the massive change in Windows' headers, contributors to the Source Code Forum might want to think about how your no-doubt super-duper whatever it is could be used by members not running the same compiler version(s) as do you... and as is often the case, how members who have licensed for ONE of the compilers - Console Compiler or Windows GUI compiler - can use code you have contributed, code you built using the OTHER one.

                MCM

                Comment


                • #9
                  Originally posted by Larry Charlton View Post
                  It is a hashtable and it did seem fairly quick out to the million entries I tested it with. It will take a couple of changes to compile with PBCC5.
                  1. You need to remove the word COMMON from all source.
                  2. Remove the SLL link from test.bas.
                  3. Txt.XXX should have the console equivalent (PRINT, WAITKEY$) in test.bas.
                  %INCLUDE = 1
                  #If %Def(%INCLUDE)
                  #Include "StringBag.bas"
                  #Else
                  #Link "StringBag.sll"
                  #EndIf

                  s/b
                  #Include "StringBag.bas"
                  Be sure it's large enough before you start adding words by using Expand.

                  I also have another version here. The Bag is just that. It holds unique strings. If you want a count, you'd need a Map (StringLongMap).

                  Stanley Durham also has some non-class versions of containers that are pretty quick. Not sure if they run on PBCC5 atm but they used to.

                  If you have trouble getting it to compile, holler, I should be able to dig up the PBCC5 compiler.
                  Thanks Larry for your explanations.

                  Now I can probably give it a try.

                  Michael has said something that is a good point to remember.
                  I have to say that many programmers with nice libraries of code put on public
                  domain or freeware, are often giving for granted that everybody has the last
                  version of the compilers.

                  But it could happen that this is not the case, and I often read about functions
                  or features that are not on the version I'm using [PBCC 5.05], and this makes
                  me feeling a bit confused, until I realize that maybe they are talking about new
                  features, and then I try something that has the same funtionality, or similar.

                  It would be nice, for those programmers who decide to put their code on the
                  forum, or in their websites, to state in a clear way what version are they using,
                  and a possible alternative for those with previous compilers.

                  Salvatore
                  Last edited by salvatore renda; 11 Feb 2013, 06:00 PM.

                  Comment


                  • #10
                    Hi Larry.
                    I'm trying to use your bag of strings, but I get an error during compilation:
                    ==============================
                    PowerBASIC Console Compiler
                    PB/CC Version 5.05.
                    Copyright (c) 1998-2010 PowerBasic Inc.
                    Englewood, Florida USA
                    All Rights Reserved

                    Error 526 in N:\WORDSC~1\WordsCount2.bas(1022:009): Period not allowed
                    Line 1022: tst.Add( OneString )
                    ==============================
                    Compile failed at 10:47:37 on 14/02/2013
                    I Have the include:

                    Code:
                     #COMPILE EXE
                     #DIM ALL
                    
                     #INCLUDE "win32api.inc"
                     #INCLUDE "COMDLG32.INC"
                     #INCLUDE "StringBag.inc"
                    and the DIM:
                    Code:
                    FUNCTION PBMAIN () AS LONG
                    
                     DIM tst AS iStringBag
                     tst = CLASS "cStringBag"
                    but when I try to add a string to the bag, I get the error "period not allowed"
                    Code:
                            tst.Add( OneString )
                    I've probably missed something. I'm using the includes, not the DLL, but I
                    think it should work fine as well.
                    The offending code is inside a SUB.

                    Edit: I found that it is necessary to DIM tst inside the SUB in order to compile it.
                    Last edited by salvatore renda; 14 Feb 2013, 10:00 AM.

                    Comment


                    • #11
                      If my prog has to deal with 16K of unique words, how I DIM the bag
                      to be big enough for that amount of strings?

                      Thanks

                      Salvatore

                      Comment


                      • #12
                        tst.Expand( 16000 * 1.25 )

                        Would work. A bit of background. Hash tables generally work best if the size of the hash table is prime and if when "full" they have around 25% free slots (fewer collisions). The class automatically picks a reasonable prime larger than the value you specify. It doesn't have every prime so it uses increasing gaps. If you want the gap smaller that what it will pick for you, you can add your own prime or replace the function.

                        Mostly the size is a performance thing that helps in reducing hash collisions. You might find you get great performance using 8000, 4000, 2000, or even 1000. Regardless of the value you pick in expand you should be able to store any number of entries in the table. If you're performance seems low, increase the size in expand. Increasing past the total number of values you have is unlikely to make a difference.

                        I'll see if I can find my install media tonight and go through an actual compile cycle with PBCC 5.0. Sorry it's giving you grief, not sure why it would complain about a period yet.
                        LarryC
                        Website
                        Sometimes life's a dream, sometimes it's a scream

                        Comment


                        • #13
                          Originally posted by Larry Charlton View Post
                          tst.Expand( 16000 * 1.25 )

                          Would work. A bit of background. Hash tables generally work best if the size of the hash table is prime and if when "full" they have around 25% free slots (fewer collisions). The class automatically picks a reasonable prime larger than the value you specify. It doesn't have every prime so it uses increasing gaps. If you want the gap smaller that what it will pick for you, you can add your own prime or replace the function.

                          Mostly the size is a performance thing that helps in reducing hash collisions. You might find you get great performance using 8000, 4000, 2000, or even 1000. Regardless of the value you pick in expand you should be able to store any number of entries in the table. If you're performance seems low, increase the size in expand. Increasing past the total number of values you have is unlikely to make a difference.

                          I'll see if I can find my install media tonight and go through an actual compile cycle with PBCC 5.0. Sorry it's giving you grief, not sure why it would complain about a period yet.
                          Thanks Larry for the explanations.
                          I think I understand now why the compiler complains, if I don't declare
                          Code:
                           DIM tst AS iStringBag
                           tst = CLASS "cStringBag"
                          inside the SUB in which I use it, the compiler doesn't know what tst is
                          and probably it assumes it is a variable, not a CLASS.
                          Declaring "tst" inside the SUB, the compiler doesn't complain anymore.
                          I didn't test yet for performance, and I don't actually know if it works, but
                          adapting the code to my pgm is taking a while.
                          Probably also declaring "tst" as GLOBAL could solve the compilation problem.
                          But I didn't try it.

                          Salvatore

                          Comment


                          • #14
                            Ooooopssss. Another compilation problem:
                            Code:
                             FUNCTION PBMAIN () AS LONG
                            
                             DIM tst AS iStringBag
                             tst = CLASS "cStringBag"
                             tst.Expand( 16000 * 1.25 )
                            PowerBASIC Console Compiler
                            PB/CC Version 5.05.
                            Copyright (c) 1998-2010 PowerBasic Inc.
                            Englewood, Florida USA
                            All Rights Reserved

                            Error 598 in N:\WORDSC~1\WordsCount2.bas(220:014): METHOD or PROPERTY name expected
                            Line 220: tst.Expand( 16000 * 1.25 )
                            ==============================
                            Compile failed at 02:19:12 on 15/02/2013

                            Comment


                            • #15
                              I downloaded this version and test.bas compiled without change in PBCC 5.05 on my computer.

                              Took another look at the code and Expand is not necessary. It's called automatically when needed to accomodate larger collections.

                              The second problem was trying to put an expression in a method call. Switching between languages does that to a person If you could call Expand (which you can't because it's a class method), you would have to calculate the size and pass the result or put in a constant.

                              DIM collectionSize as LONG
                              collectionSize = 16000 * 1.25
                              tst.Expand( collectionSize )

                              or

                              tst.Expand( 20000 )

                              Again neither of those work in this case because it's a class method and the calling of expand is done automatically when needed. Sorry about that.

                              I don't see code, but I'm assuming OneString is a STRING variable that's populated with a value? I altered the test code and it seems to work here, could you provide a bit more code around the error?

                              Code:
                                  FOR i=0 TO UBOUND( words() )
                                      DIM oneString AS STRING
                                      oneString = words(i)
                                      tst.Add( oneString )
                                  NEXT
                              If you want to use an instance of a class in a variable in a subroutine, and the variable is instantiated outside the subroutine, you would need to either declare the variable globally or pass it by reference to the subroutine.

                              Code:
                              Sub MySub( ByRef tst AS iStringBag )
                              ...
                              End Sub
                              If you instantiate (tst = CLASS "cStringBag") locally in the sub routine, you could use a local declaration:
                              Code:
                              Local tst AS iStringBag
                              I don't have any experience with implicitly declared variables. My feeling is implicit declarations would not work for class variables but I'm not certain.
                              LarryC
                              Website
                              Sometimes life's a dream, sometimes it's a scream

                              Comment


                              • #16
                                One last thought. It would make sense to declare and instantiate tst locally in the subroutine if the life of the collection is a single call to the subroutine that is not done a lot (say less than 10,000 times a second).

                                If you need the collection to persist across multiple calls or more particularly in multiple subroutines, you should pass it as a parameter by reference (the default).
                                LarryC
                                Website
                                Sometimes life's a dream, sometimes it's a scream

                                Comment


                                • #17
                                  Larry,

                                  the test program works, and I have adapted your string bag to the logic
                                  of my test program. It is still unfinished and unchecked. But just to show
                                  you the context in which I'm using it, I post here my test program complete.

                                  The process is simple:

                                  1) choose a text file
                                  2) choose a method to extract and count single words, and it displays the
                                  alphabetical dictionary, and the occurrences for each word found.
                                  3) for the time being you have two buttons [Console style], the first is
                                  "Use Array" and it does all the job using arrays, and the second is "Use IntStr"
                                  that uses your Strings Bag.
                                  4) probably using a Method SORT on the object to sort the Bag, and a Method
                                  COUNTFREQ for counting how many times words appear, it could do the
                                  same job than "Use Array" and I can compare the performance of the two
                                  solution.

                                  - Unzip the attachment into a folder that you create, compile and run the pgm
                                  to see what it is supposed to do.

                                  Salvatore
                                  Attached Files
                                  Last edited by salvatore renda; 15 Feb 2013, 03:38 PM.

                                  Comment


                                  • #18
                                    According to my preliminary tests, the bag of strings performs 2.5:1 better than
                                    simple arrays. I still don't know how to make the program counting for unique
                                    words recurrences, and afterwards how to sort it.
                                    While it is possible to use:
                                    Code:
                                    PRINT tst.Item(x)
                                    It is apparentely not possible to do:
                                    Code:
                                    ARRAY SORT sts.Item()
                                    I'm not used to Objects, CLASSes, Property, Methods, and so on.
                                    Maybe somebody could help me.

                                    Attached the actual development of the program, only main source,
                                    the rest has already been posted in previous post.

                                    Salvatore
                                    Attached Files

                                    Comment


                                    • #19
                                      have you posted the test data?
                                      stanthemanstan~gmail
                                      Dead Theory Walking
                                      Range Trie Tree
                                      HLib ~ Free Data Container Lib ~ Arrays, Lists, Stacks, Queues, Deques, Trees, Hashes

                                      Comment


                                      • #20
                                        Originally posted by Stanley Durham View Post
                                        have you posted the test data?
                                        Any text file, max leght 16MB can be used. I didn't post mine that is about
                                        4.6MB, too big to post.
                                        Beware, the prog has still some bugs. I'll post a correct version as soon as I
                                        discover new bugs, or add some functionality.

                                        Salvatore
                                        Last edited by salvatore renda; 16 Feb 2013, 11:42 AM.

                                        Comment

                                        Working...
                                        X