Announcement

Collapse
No announcement yet.

Index method suggestions?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Index method suggestions?

    I'm faced with the problem of a requirement to index approximately
    20 to the seventh power of roughly 60 byte string data fields .. per
    record.

    Yet the average complete record will contain no more than a hundred
    populated fields. The rest of all the million plus whatever possible
    matrix chunklets - will all be totally blank. Yet in any given record,
    there will be the possibility that any mish-mash of field order might
    be expected.

    The string data is going to be totally variable length, but not more
    than 60 bytes per field. It likely may be also internally internally
    delimited within each string for pointering purposes, but that's not
    an issue for the main task of serving up the chunklets.

    The expected number of possible indexes and data to recover might well
    exceed 100,000 different record indexes/data matrixes, per computer server
    site. The file(s?) and index operations will have to be network
    available - perhaps 50 to 100 simultaneous workstations will have to
    be able to use the data at peak load. Average connected load may be
    on the order of one or two dozen boxes, max.

    This isn't a transaction processing operation making direct use of
    the proposed index/data hodge-podge. What I speak to will only be
    expected to furnish pointers to the transaction processing operation
    based on what is in the string data here. Hence the issue of what
    I know as roll-forward and roll-backward from the Btrieve game is
    maybe not as critical here as it is in some other work I've done.

    Any suggestions on how to approach this?

    In PB 3.5 ... ?)


    ------------------
    Mike Luther
    [email protected]
    Mike Luther
    [email protected]

  • #2
    index approximately 20 to the seventh power of roughly 60 byte string data fields .. per
    record. Yet the average complete record will contain no more than a hundred populated fields..
    A couple of confirmation questions:

    1. It appears you do not need to add/change/delete records in real time; this is a semi-static 'reference' database and a 'batch-type' update will be acceptable. Correct?

    2. Approx how many total logical records, each with perhaps one hundred data entries?

    (I've done something like this with both PB-DOS and MS QuickBASIC, but if that would work here depends on answers to above).

    MCM

    Michael Mattias
    Tal Systems (retired)
    Port Washington WI USA
    [email protected]
    http://www.talsystems.com

    Comment


    • #3
      Thanks Michael ..

      1. It appears you do not need to add/change/delete records in real time; this is a semi-static 'reference' database and a
      'batch-type' update will be acceptable. Correct?
      Nope .. records have to be added in real time and .. in far less often
      circumstances could be required to be edited - which could increase or
      decrease the total internal field count, the variable length text which
      is in them. Palor of gloom gathering overhead so noted.

      2. Approx how many total logical records, each with perhaps one hundred data entries?
      Upwards of a hundred thousand or so.

      Thanks for your time and thought.

      ------------------
      Mike Luther
      [email protected]
      Mike Luther
      [email protected]

      Comment


      • #4
        100,000 records with maybe 100 "things" per record?

        Code:
        TYPE Keytype
          PrimaryKey AS "whatever" ' << customer? Product? whatever.
          Infotype   AS INTEGER    ' What type of data are in Datastring; up to 32767 different possible datatypes
        END TYPE  
        TYPE RecordType
         Key         AS Keytype
         dataString  AS STRING * 60
        END TYPE
        Create an index on this datafile using all of keytype. Total records = 100,000 x 100 = 10,000,000. PowerTree(tm) would work. So would one or more of the "BT" variations floating around.

        This wastes unused space in the sixty bytes area. But you only use a record where and when a particular data type exists for a specific primary key.

        You want a design to save the unused portion of the sixty bytes? I do that kind of thing for a living.



        ------------------
        Michael Mattias
        Tal Systems Inc.
        Racine WI USA
        mailto:[email protected][email protected]</A>

        www.talsystems.com
        Michael Mattias
        Tal Systems (retired)
        Port Washington WI USA
        [email protected]
        http://www.talsystems.com

        Comment


        • #5
          No Michael, I don't think we are on the same wavelength yet here..

          This is like the lilt in Mono Lisa's eyebrow, sort of.

          There are 20 to the seventh power possible combinations of up to sixty
          bytes of text possible that COULD be used to make up a picture of Mona
          Lisa. Reduced to the lowest common GROUP of them that might interact
          in how that eyebrow was raised in the final picture, any given number
          of them COULD be used to define that lilt! However, on average, maybe
          sixteen of them ARE critical to what will be displayed in those pixels
          which constitute 'the' eyebrow.

          What is important is that, for any given 'lilt', there might be only
          a specific sixteen, and only a specific PATTERN of sixteen, which are
          actually POSSIBLE to be used in creating the lilt. Each of 'the' - 'given'
          sixteen are at some discrete place in the matrix of all possible text
          lines of data, and each are possible check points on how Mona Lisa might
          be feeling come time to raise that eyebrow!

          Now the STANDARD picture of Mono Lisa has a specific lilt. We all know
          that. Or at least some of us think we do.



          But in thinking about how to examine what it really means, we have to have
          the original matrix of the entire possible sixteen strings of data describing
          how Mona Lisa felt when she raised that eyebrow. We wonder, do I feel like
          this and what are MY chances today to be involved with this as well?



          So we rapidly deploy a composite UNIQUE raster of the sixteen different and
          uniquely positioned text snips in the matrix of them and check off what are
          in line with what Mona Lisa is feeling, compared to what we are feeling!
          That check list will produce a SIMILAR, but not necessarily EXACT profile
          of how close we are to what Mona Lisa is for that given profile. If we
          find we are close enough for government work, we might raise a corresponding
          eyebrow in salute .. and .. well you get the idea.



          Now Mona Lisa didn't have to lilt that eyebrow. It could have been the
          other one. Which might have involved, and been set up by an entirely
          different set of TEXT definable emotions and matrix to produce that. Perhaps
          fourteen completely different descriptions in the raster of emotions will
          be involved for such an action. For which we have to have an encryption
          aware sensitivity of a checklist of those EXACT fourteen DIFFERENT sixty
          byte max text strings, if we either expect to succeed, with our quest to
          relate well to this issue. Maybe government work has to really be exact here!

          Plus, maybe, two raised eyebrows means, "You'd better NOT approach me!", combined
          with a squinting of the eyes that wasn't in the original composure. And in
          this case, perhaps two hundred different sixty byte maximum strings would
          be involved in explaining why that might be good advice! And recall, each
          of them has to be cascaded in the raster checklist, in precise order and in
          precise sub-order as well.

          The technique has many possible applications. We do it all the time as
          human beings. A picture is worth a thousand words, no? But, just as in
          the famous, "One if by land, and two if by sea, and I on the opposite shore
          will be!", meant "Waiting and ready to spread the alarm." We also get paid
          in life mostly by calling, in humor here, "To Arms! To Arms!, the British
          are coming!" We are rewarded for our words; not our graphic images, at least
          in our lifetimes, somewhat.

          I think I'm right to note that Leaonardo D' Vinci never got paid much of anything
          in his lifetime for his work. Nor do I, nor I suspect, you, have any inclination
          to, on some Stary Stary Night, check out over not getting paid.

          But I think it is necessary to be able to tear apart a graphic image and
          reduce it to TEXT, to check off what to do to collect, when the difference
          comes to us is such varied graphical images these days. The difference
          between Mona Lisa and that lilt, or that image of Monroe, or, for example,
          that shot of the Boys bashing that right car fender into the wall of the
          building while hand pushing the Washington evidence car into it, are all
          reduced to WORDS, PHRASES, if you will. They will be the keys to which the value
          of an encounter is scored.

          I want to be able to CODIFY in a TEMPLATE, twenty different possible lines of
          sixty bytes of text in up to the seventh or eighth power of a possible array
          of any or all of them.

          I want to do this in real-time, on line, in a networked situation.

          I want to then enable many users to comment and work with the core data that
          was a part of the whole, out of a possible 100,000 or so possible totally
          different composite different matrices of this. I want each one to be able
          to grade and score a perception of any one of those composites of the whole,
          on line and in real-time. From that I want to be able to store their position
          in relation to the standard, on line and in real-time, and from that proceed
          to get paid for the effort as part of the work that has gone before.

          Your suggestion, as I understand it, provides for a single integer that might
          be used to designate that key match. But it doesn't provide for the manipulation
          of the entire data bank, from which to gain comparisons. Nor does it hold the
          precise matrix position for each of the possible sixty byte records, which in
          theory, could number a million or more, but in practice would never be more than
          perhaps a hundred for any given profile.

          However I freely admit I may not know enough about how to evaluate the expression
          you just suggested, in the full use of what it offers...



          ------------------
          Mike Luther
          [email protected]

          [This message has been edited by Mike Luther (edited December 05, 2002).]
          Mike Luther
          [email protected]

          Comment


          • #6
            Mike,

            I´m confused to understand what You want exactly:

            1) You´ve variable Fields delimited with ?
            Are the strings in quotions marks ?

            2) till max 60 Bytes. What kind of Data ?
            only letters, numbers, all ASCII-signs ?

            3) One record has about 100 Fields and ends with chr$(13,10) ?

            4) You´ve up to thousand records - do you append and delete records ?

            5) Do You need this structure of Your file for other programms or
            is it only a ".log"-file ?

            6) Is there an 'update-field' in the record where You can see
            changes of fields ?

            7) Is the field length ever changing if the field changes ?

            8) I think You look only for one search-string at once. Right ?

            9) But You´ve a list of millions of possible search strings to
            look for ?

            Regards

            Matthias Kuhn


            ------------------

            Comment


            • #7
              No Michael, I don't think we are on the same wavelength yet here..

              That's because you are not looking at the way I am

              Instead of 100,000 records, each of which may contain 20^7 data fields, but in practice contains only about 100, I am suggesting you create a separate record for each thing/datatype which exists. When you need 'em all, you read 'em all.

              As far as all these interrelationships (eyeballs and knees, etc).. you did not say that at first. This is a different problem.

              I start the meter when the target starts moving......

              MCM

              Michael Mattias
              Tal Systems (retired)
              Port Washington WI USA
              [email protected]
              http://www.talsystems.com

              Comment


              • #8
                Mathias ... I'll take these thoughtful answers one at a time, Michael,
                in that the answers to Mathias will likely help, I hope.

                1) You've variable Fields delimited with ?
                Are the strings in quotions marks ?
                My choice at the moment for the delimiter(s). I have older older way
                of doing things COBOL like stuff which unpacks string data delimited
                with several delimiters of 'choice' and I'll be the one designing the
                composite string here. I could put the strings in quotation marks at
                this point. I'm not sure yet, but I 'think' that there won't be anything
                in these strings which uses an embedded quotation mark internal quotation.
                I realize that you may be asking me this for the part about recording the
                strings to disk with embedded comma marks, for example. Let's put that
                sorta in the thinking about this category for the moment.

                2) till max 60 Bytes. What kind of Data ?
                only letters, numbers, all ASCII-signs ?
                I think I can restrict it to only printable ASCII characters, together with
                perhaps an agreed-upon by-design delimter affair. In that manner I may be
                able to separated the data and put it together with the maximum of 60 bytes
                of what is to be the information. That seems very relevant at the moment.

                3) One record has about 100 Fields and ends with chr$(13,10) ?
                The average 'record' will have about 100 Fields and won't likely end with
                the 'conventional' chr$(13,10) line feed. But it could. You are driving
                here, as I understand this, to a single string which could be read in from
                a sequential file ... or ... perhaps, demarked in a composite variable
                length record file which contains long integer pointers which represent the
                start/end parts of each record, and kept in some form of pointer matrix,
                correct?

                The issue here is that there will be a completely different record length
                for each record which will be needed. See below ..

                4) You've up to thousand records - do you append and delete records ?
                Err ... a bit more than that. Even the test site for this will go on line
                with upwards of 15,000 records, initially, case closed, flat out. The easy
                reach site will need indexed access to over 100,000 records, all of which
                will have to be the same variable length issue.

                5) Do You need this structure of Your file for other programms or
                is it only a ".log"-file ?
                I actually need two sets of something like this. The first 'set' will be
                perhaps a bit like the old A. L. Milne 'Watchbird' children's comic manners
                trainers that were in, I think, "Child Life" in the 1940's. A record of
                that so-called standard set would be kind of like, "This is the Watchbird
                watching Mona Lisa's lilt in her eyebrow! This is the Watchbird watching
                YOU! Are you an eyebrow lilter?"



                From that, which includes all the perspectives on the reasons why Mona Lisa
                raised that eyebrow in that unique lilt, which the painter so interestingly
                captured, we would get a comparison relevant return from the original profile.

                6) Is there an 'update-field' in the record where You can see
                changes of fields ?
                Well, the Watchbird can change and update the master profile record! But the
                poor kid the Watchbird is watching cannot! All it can do is be the subject
                of the watch process! How well it is an eyebrow lilter, or how poorly it is
                an eyebrow lilter, is all in how the system matches up the profile in the
                matrix!

                7) Is the field length ever changing if the field changes ?
                Unfortunately, sayeth the Watchbird, YES! As time goes by, the Watchbird
                will notice that some children the Watchbird is watching have a common
                characteristic with that lilt, that the Watchbird missed in the original
                profile of a lilter!

                Thus, saddly, the standard matrix will change, along with the field length
                when that happens. Good case in point, but this is not the actual original
                application! For an example, we had a blood alchohol level definition of .1
                for DWI. Many years into the burn, the Texas Legislature changes the legal
                definition for DWI to a .08 level! So, "This is the Watchbird watching a
                drunk child! This is the Watchbird watching YOU! Are you a drunk child?"



                External forces corroborate everywhere to do a professional in regardless of
                what is the profession. They are always, governmentally, it seems, changing
                the rules on watching anything and what it means!

                8) I think You look only for one search-string at once. Right ?
                Porbably initially. Recall that the Watchbird DEFINES the standard matrix
                for the operation initially. It is a TOP DOWN instropection on what position
                in the MATRIX, the EACH OCCUPIED FIELD is in. If a field, at any place in
                the matrix of a million plus possible places is occupied, than that field
                must show up in the profile of an eyebrow lilter!

                But later on...

                I'm not sure how to answer this! What this boils down to is that any of the
                first twenty strings can relate to twenty more sets of twenty more strings!
                And any of those twenty more sets of twenty more strings can relate to twenty
                more sets of twenty more sets of strings .. and on and on into the seventh
                power of this! In practice, perhaps ONE or TWO strings will be in the first
                rank of the first possible set of twenty strings. Each of those COULD relate
                to up to twenty more strings 'under' them down into the next dimension of the
                array matrix. But likely they will not. They may not relate to anything
                further down into the matrix, but they could. One of the two initial
                relationships could have only one child in this electronic geneology sort of
                thing. The other of the top level two citations could have even twenty, which
                had twenty more, which then stopped there and never were heard from again in the
                matrix! Or they could continue all the way down to the bottom of it!

                For the MOMENT, the initial project focuses only on TOP-DOWN relationships and
                the categorization of relative matrix similarity. However, like all good
                research engines, there will come some governmental super Watchbird (Watchvulture?)
                which will demand to be able to know, "Say! We have a hunch that all blonds lilt
                there eyes less than this or that! You must now tell us of all the 100,000 such
                profiles on eye lilting, what the propensity is for blonds to raise an eyebrow
                in such a pose, as contrasted to brunettes!"



                So .. TOP DOWN is not quite the answer, really. Data mining is gonna rear its
                head too.

                9) But You've a list of millions of possible search strings to
                look for ?
                Yep.

                I've said that, in the beginning, the Genisis of this thing, there will be relatively
                few total matrices of what the Watchbird thinks is relevant! I expect, because of
                the real work at creating profiles, that most sites will only spend the time to
                create maybe fifty to a hundred. They will consist of not more than a hundred total
                of these sixty byte strings and related other data. What you have to keep in mind
                is that level one of the matrix must be able to point to elements in level two,
                which point to elements in level three, and thence to level four, and so on down
                through all the possible dimensions.

                That's how we find out about the relating of someone else to Mona's lilt! But at the
                same time then there will be about fifty to a hundred other quirks about that face we
                also want to profile and check on relationships!

                What we are saying is that -- you betcha! A picture *IS* worth a thousand words!
                But so now, that admitted, exactly what thousand words are there which tie the
                suspect to the picture or a specific part or parts of it?

                The background on all this!

                For a long time I've had a paid-for professional version of the unique askSam database
                engine! It's a very interesting product. You can toss in anything you want to into it
                as free form text. Thousands of records!

                Then, at any time from the collage of text, you can go and create your own fields for
                investigation and organize a search based on what you want AFTER the thing is created!

                The askSam is, as far as I know, a fully relational database. In other words, you may
                have the word "the" in it a thousand times. But the word is only there once in the
                actual database. Every instance of the word "the" in it, is a pointer to the stored
                location and your actual output is totally generated by pointers! This product is used
                by criminal investigators, things like that. It was, for example, used in breaking the
                Manuel Noriega case in Panama by the US Government. You crank in all the mail traffic
                into it and compose any rule of field mix you want and VOILA! Out come all the who done
                what to whom and when.

                For example, I happen to have the last ten years total of all of the FidoNet Net 117
                message traffic of the official NET traffic in an askSam database here. I can tell you
                EXACTLY who said what dirty word exactly when and so on for the last ten years and all
                the thread stuff that went with it. It was partly what led to the demise of FidoNet
                locally here. There was simply no place to hide and you couldn't run from political
                mistakes in Fight-O-Net locally, grin.

                Now .. I want a simplified form of this locally, from within the PowerBASIC DOS operation.
                I want it hand coded so that I can either port it to C/C++ if I have to leave PB, or
                extend it into LINUX as needed. There won't be any WIN use of it, in all likelyhood, as
                it will either have to go forward in DOS ... in OS/2 ... or to AIX in the long run.

                So there!

                ------------------
                Mike Luther
                [email protected]

                [This message has been edited by Mike Luther (edited December 05, 2002).]
                Mike Luther
                [email protected]

                Comment


                • #9
                  What this boils down to is that any of the first twenty strings can relate to twenty more sets of twenty more strings!
                  And any of those twenty more sets of twenty more strings can relate to twenty
                  more sets of twenty more sets of strings .. and on and on into the seventh
                  power of this!
                  OK, one more freebie...

                  This is not a relational database: it is hierarchical.

                  Think "tree," mangaged with a linked list.

                  Don Dickinson's free XML parser ( http://dickinson.basicguru.com ) includes functions which do this (So does my non-free XML parser, but it would hardly be worth it to license that source code so you could modify it when the same principles apply to something free). Or, if you are familiar with the older IBM mainframe products, both DL/1 and IMS DB are fundamentally hierarchical.

                  MCM


                  [This message has been edited by Michael Mattias (edited December 06, 2002).]
                  Michael Mattias
                  Tal Systems (retired)
                  Port Washington WI USA
                  [email protected]
                  http://www.talsystems.com

                  Comment

                  Working...
                  X