Announcement

Collapse

Forum Guidelines

This forum is for finished source code that is working properly. If you have questions about this or any other source code, please post it in one of the Discussion Forums, not here.
See more
See less

Trie Tree: find 1,000,000 strings in 0.3 seconds

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trie Tree: find 1,000,000 strings in 0.3 seconds

    “trie” comes from the word “reTRIEve” : google "trie"

    comments

    this is a string tree structure: fast as hash!

    find 10,000 random strings, 5 to 20 Bytes wide; 0.00000001339 seconds

    fastest structure for "prefix" lookup (IntelliSense)

    (compiled test app in both zips)

    Compared to Hash:
    - you can traverse the tree in Key order.
    - don’t have to worry about optimal slot count.
    - - a hash has to do a complete rebuild if its count goes way over/under its capacity.
    - can do “prefix” find

    Compared to AVL or Red Black binary tree:
    - lot faster.
    - always perfectly balanced.
    - never needs to rotate.

    This is a breakthrough implementation of a Trie; “Range” Trie

    This post serves as prior-art for the concept of an adjustable child "range" for Trie/Radix nodes or containers
    May not be patented without my permission. (there are a lot of patents on Trie schemes)

    In a Trie, every character in the Key is stored in a node.
    Keys lay on to of each other; “cat” is in “cats”
    The tree branches at un-shared characters.

    Code:
        '   Keys = (cat, cats, catastrophe, catastrophes, cattle, cattleman, cattlemen, catch, catches, caught)
        '
        '   c
        '   a
        '   t*--------------u
        '   a--c--s*--t     g
        '   s  h*     l     h
        '   t  e      e*    t*
        '   r  s*     m
        '   o         a--e
        '   p         n* n*
        '   h
        '   e*
        '   s*
    zips: Class PB 5/9 - DLL PB 4/5 ok

    Code:
     
     
    'class = PB (5/9) : DLL = PB(4/8) or later
        '-------------------------------------------------------------------
        'STRING/STRING ~ Key/Value ~ Range Trie Tree   "TRIE" = (reTRIEve)
        '-------------------------------------------------------------------
        '   google: "trie", "trie tree", "trie compression", "radix tree"
        '   currently fastest sort algorithm is Trie based
        '
        '   This binary string tree structure is bad-to-the-bone!!!
        '
        '       perfectly balanced binary string tree - !!! fast as string hash !!!
        '           traverse tree in key order (also prefix order)
        '           blazing fast set, get and remove
        '           no limit on Key or Value length
        '-------------------------------------------------------------------
        '   find 1,000,000 strings in 0.3 seconds : "1" to "1000000"        (slow neetbook)
        '   find 10,000 random strings, 5 to 20 Bytes wide; 0.00000001339 seconds
        '   add  1,000,000 strings in 2.9 seconds
        '                  (pre-built strings so that Format$() not in loop)
        '                  (compiled TrieTest.exe in zips)
        '-------------------------------------------------------------------
        '       USE: IntelliSense like lookup, prefix lookup, dictionary, balanced binary string tree,
        '               string hash, Key/Value store, associative string array
        '               replacement for AVL or Red Black string tree (faster)
        '-------------------------------------------------------------------
        '   Change: 2009-05-12: Keys stored LCase
        '-------------------------------------------------------------------
        'Breakthrough Trie Implementation!!! - "Range" Trie
        '   each Node has a dynamic child node array that adjust to lowest/highest char, Node is responsible for: "Range"
        '       the range adjust as Keys are added/deleted
        '       if a node has 'a' and 'z' in it
        '           node child array will be 26 slots wide : node.lowChar = 97, node.highChar = 122
        '               first and last elements in array will be a child node; rest will be NULL
        '           if Keys using 'z' are deletedd;
        '               node array will collapse to one element : node.lowChar = 97 node.highChar = 97
        '               (in most cases, a node will be only one char wide, a few positions from root)
        '   Basically; a "Range" Trie is:
        '       - multi-dimension arrays of multi-dimension arrays
        '       - able to hold any ASCIIZ char (NULL not allowed in this implementation)
        '       - able to hold a string of any length
        '       - the Key's characters are the index through the arrays (BYTE offset from node.lowChar)
        '       BUT:
        '       - we only build the path if we need to go there
        '       - destroy the path when no longer needed
        'NOTE: this post serves as prior-art for the concept of an adjustable child "range" for Trie/Radix nodes or containers
        '   May not be patented without my permission.
        '   (if you google around on "trie"; there are a lot of resent patents on trie based algorithms)
        '-------------------------------------------------------------------
        'SPEED:
        '   Hash: has to make at least two passes through each char in a Key
        '       - build hash to get slot position
        '       - test for collision
        '   Range Trie: makes only one pass through chars in Key
        '               (it will also abort early; as soon as a missing char encountered)
        '       - as it moves through each char; it's jumping down the tree
        '           - each char in Key serves as a childIndex to next char in Key
        '                   offset from node.lowChar
        '               if that childIndex position = NULL then Key not in tree
        '               if char not within node's range then Key not in tree
        '               if child node not NULL : jump to child node and next Key char
        '       - Value stored one node below last char in Key
        '               (which may be part of other Keys)
        '                   the Value for "cat" would be stored in the "s" node of "cats"
        '-------------------------------------------------------------------
        'Other Trie Implementations:
        '   - BYTE wide nodes
        '       - can explode into massive tree
        '   - only 'A' ~ 'Z' or other limitations
        '       - limited use
        '       - can still explode into really big tree
        '           (in most cases, the "Range" Trie will be only one char wide)
        '   - PATRICIA Trie or Compact Trie
        '       - merges un-shared chars into string containers
        '           (Range Trie could be implemented as a PATRICIA Trie)
        '   - List; using sequential search of chars in node
        '       - slow
        '   - there are a lot of research papers on various schemes to limit the size of a Trie
        '       - very complex stuff
        '       - google "judy array", "burst sort"
        '
        '-------------------------------------------------------------------
        '
        '   Keys must be unique (can't contain NULL)
        '   Value (payload) stored as ASCIIZ string (can't contain NULL)
        '       no limit on Key or Value length (shorter Key = faster search)
        '
        '   Keys compare = case ignored
        '
        '   !!! fast as STRING hash !!!
        '   Keys always in alphabetical order (traverse tree in order)
        '   fastest structure for Prefix search
        '   always a perfectly balanced binary tree
        '   unlike hash: never needs to rebuild
        '   unlike balanced binary tree: never needs to rebalance, or rotate
        '
        'CON: Key always stored LCase: Key losses its case
        '-------------------------------------------------------------------
        '
        '   Keys lay on top of each other until a difference is encountered
        '   Keys = (cat, cats, catastrophe, catastrophes, cattle, cattleman, cattlemen, catch, catches, caught)
        '
        '   c
        '   a
        '   t*--------------u
        '   a--c--s*--t     g
        '   s  h*     l     h
        '   t  e      e*    t*
        '   r  s*     m
        '   o         a--e
        '   p         n* n*
        '   h
        '   e*
        '   s*
        '
        '-------------------------------------------------------------------
        'Prefix:
        '   this is where a Trie really shines
        '   once you locate the prefix in the tree;
        '       every key word below the prefix - belongs to the prefix
        '           no further test are necessary
        '-------------------------------------------------------------------
        '    Permission :
        '    Use at your own risk.
        '    Free to use in any application.
        '    My not be sold as a stand-alone Trie implementation.
        '    If you use in a commercial application; should consider throwing the author a bone.
        '        StanTheManStan ~ gmail
        '-------------------------------------------------------------------
        'Note: this implementation isn't a "PATRICIA" Trie (in the works)
        '-------------------------------------------------------------------
        '
        'use:
        'put "TrieC_DLL-5.inc" in same folder as source code
        '#Include Once "TrieC_DLL-5.inc"
        '
        'put "TrieC_DLL-5.DLL" in same folder as source code
        '
        'LOCAL tree as TrieI
        'tree = NewCom ClsId $TrieC_GUID Lib "TrieC_DLL-5.DLL"
        '   start using
        '-------------------------------------------------------------------
    $TrieC_GUID = Guid$("{1709F7B8-2DA8-408F-9539-60E0A17E6544}")
    $TrieI_GUID = Guid$("{E97DA299-A25D-4D6A-B163-8A8C8C6AE2EF}")
     
    Interface TrieI $TrieI_GUID
        Inherit IUnknown
        Method Clear()
            'delete all data
            '   call Clear() before obj = NOTHING
            '   internal resources will immediately be freed
            '       instead of waiting till obj destroyed
     
        Method Count() As Long
            'get Key/Value count
     
        Method Set(In ByRef key As String, In ByRef value As String, ByVal replaceValue As Byte)
            'add Key/Value to tree
            '   if Key exist and replaceValue = True then old Value replaced with new Value
     
        Method Get(In ByRef key As String) As String
            'get Key's associated Value
     
        Method Contains(In ByRef key As String) As Long
            'non-null if Key in tree
            '   False if Key not in tree
     
        Method Remove(In ByRef key As String)
            'remove Key/Value from tree
     
        Method KeyGoFirst() As Long
            'move to first Key in tree - True/False success
     
        Method KeyGoNext() As Long
            'move to next Key in tree - True/False success
     
        Method GoGetKey() As String
            'get Key at current position
     
        Method GoGetValue() As String
            'get Key's associated Value at current position
     
        Method GoSetValue(In ByRef value As String)
            'change Key's associated Value at current position
     
        Method PrefixGoFirst(In ByRef prefix As String) As Long
            'move to first Key in tree that matches "prefix"
            '   True/False
     
        Method PrefixGoNext() As Long
            'move to next Key in tree that matches "prefix"
            '   "prefix" set by PrefixGoFirst()
            '       (limits cursor movement to nodes below "prefix")
            '   True/False
     
        Method Store() As String
            'store tree in string (persist)
     
        Method StoreValidate(In ByRef storedTree As String) As Long
            'validate storage string
            '   string has front and back DWord validation hash cookies
     
        Method Restore(In ByRef storedTree As String)
            'restore tree from string saved with Trie_Store()
            '   tree's current contents cleared
    End Interface
    Attached Files
    Last edited by Stanley Durham; 14 May 2009, 01:01 PM.
    stanthemanstan~gmail
    Dead Theory Walking
    Range Trie Tree
    HLib ~ Free Data Container Lib ~ Arrays, Lists, Stacks, Queues, Deques, Trees, Hashes
Working...
X