Announcement

Collapse
No announcement yet.

Index method suggestions?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Michael Mattias
    replied
    What this boils down to is that any of the first twenty strings can relate to twenty more sets of twenty more strings!
    And any of those twenty more sets of twenty more strings can relate to twenty
    more sets of twenty more sets of strings .. and on and on into the seventh
    power of this!
    OK, one more freebie...

    This is not a relational database: it is hierarchical.

    Think "tree," mangaged with a linked list.

    Don Dickinson's free XML parser ( http://dickinson.basicguru.com ) includes functions which do this (So does my non-free XML parser, but it would hardly be worth it to license that source code so you could modify it when the same principles apply to something free). Or, if you are familiar with the older IBM mainframe products, both DL/1 and IMS DB are fundamentally hierarchical.

    MCM


    [This message has been edited by Michael Mattias (edited December 06, 2002).]

    Leave a comment:


  • Mike Luther
    replied
    Mathias ... I'll take these thoughtful answers one at a time, Michael,
    in that the answers to Mathias will likely help, I hope.

    1) You've variable Fields delimited with ?
    Are the strings in quotions marks ?
    My choice at the moment for the delimiter(s). I have older older way
    of doing things COBOL like stuff which unpacks string data delimited
    with several delimiters of 'choice' and I'll be the one designing the
    composite string here. I could put the strings in quotation marks at
    this point. I'm not sure yet, but I 'think' that there won't be anything
    in these strings which uses an embedded quotation mark internal quotation.
    I realize that you may be asking me this for the part about recording the
    strings to disk with embedded comma marks, for example. Let's put that
    sorta in the thinking about this category for the moment.

    2) till max 60 Bytes. What kind of Data ?
    only letters, numbers, all ASCII-signs ?
    I think I can restrict it to only printable ASCII characters, together with
    perhaps an agreed-upon by-design delimter affair. In that manner I may be
    able to separated the data and put it together with the maximum of 60 bytes
    of what is to be the information. That seems very relevant at the moment.

    3) One record has about 100 Fields and ends with chr$(13,10) ?
    The average 'record' will have about 100 Fields and won't likely end with
    the 'conventional' chr$(13,10) line feed. But it could. You are driving
    here, as I understand this, to a single string which could be read in from
    a sequential file ... or ... perhaps, demarked in a composite variable
    length record file which contains long integer pointers which represent the
    start/end parts of each record, and kept in some form of pointer matrix,
    correct?

    The issue here is that there will be a completely different record length
    for each record which will be needed. See below ..

    4) You've up to thousand records - do you append and delete records ?
    Err ... a bit more than that. Even the test site for this will go on line
    with upwards of 15,000 records, initially, case closed, flat out. The easy
    reach site will need indexed access to over 100,000 records, all of which
    will have to be the same variable length issue.

    5) Do You need this structure of Your file for other programms or
    is it only a ".log"-file ?
    I actually need two sets of something like this. The first 'set' will be
    perhaps a bit like the old A. L. Milne 'Watchbird' children's comic manners
    trainers that were in, I think, "Child Life" in the 1940's. A record of
    that so-called standard set would be kind of like, "This is the Watchbird
    watching Mona Lisa's lilt in her eyebrow! This is the Watchbird watching
    YOU! Are you an eyebrow lilter?"



    From that, which includes all the perspectives on the reasons why Mona Lisa
    raised that eyebrow in that unique lilt, which the painter so interestingly
    captured, we would get a comparison relevant return from the original profile.

    6) Is there an 'update-field' in the record where You can see
    changes of fields ?
    Well, the Watchbird can change and update the master profile record! But the
    poor kid the Watchbird is watching cannot! All it can do is be the subject
    of the watch process! How well it is an eyebrow lilter, or how poorly it is
    an eyebrow lilter, is all in how the system matches up the profile in the
    matrix!

    7) Is the field length ever changing if the field changes ?
    Unfortunately, sayeth the Watchbird, YES! As time goes by, the Watchbird
    will notice that some children the Watchbird is watching have a common
    characteristic with that lilt, that the Watchbird missed in the original
    profile of a lilter!

    Thus, saddly, the standard matrix will change, along with the field length
    when that happens. Good case in point, but this is not the actual original
    application! For an example, we had a blood alchohol level definition of .1
    for DWI. Many years into the burn, the Texas Legislature changes the legal
    definition for DWI to a .08 level! So, "This is the Watchbird watching a
    drunk child! This is the Watchbird watching YOU! Are you a drunk child?"



    External forces corroborate everywhere to do a professional in regardless of
    what is the profession. They are always, governmentally, it seems, changing
    the rules on watching anything and what it means!

    8) I think You look only for one search-string at once. Right ?
    Porbably initially. Recall that the Watchbird DEFINES the standard matrix
    for the operation initially. It is a TOP DOWN instropection on what position
    in the MATRIX, the EACH OCCUPIED FIELD is in. If a field, at any place in
    the matrix of a million plus possible places is occupied, than that field
    must show up in the profile of an eyebrow lilter!

    But later on...

    I'm not sure how to answer this! What this boils down to is that any of the
    first twenty strings can relate to twenty more sets of twenty more strings!
    And any of those twenty more sets of twenty more strings can relate to twenty
    more sets of twenty more sets of strings .. and on and on into the seventh
    power of this! In practice, perhaps ONE or TWO strings will be in the first
    rank of the first possible set of twenty strings. Each of those COULD relate
    to up to twenty more strings 'under' them down into the next dimension of the
    array matrix. But likely they will not. They may not relate to anything
    further down into the matrix, but they could. One of the two initial
    relationships could have only one child in this electronic geneology sort of
    thing. The other of the top level two citations could have even twenty, which
    had twenty more, which then stopped there and never were heard from again in the
    matrix! Or they could continue all the way down to the bottom of it!

    For the MOMENT, the initial project focuses only on TOP-DOWN relationships and
    the categorization of relative matrix similarity. However, like all good
    research engines, there will come some governmental super Watchbird (Watchvulture?)
    which will demand to be able to know, "Say! We have a hunch that all blonds lilt
    there eyes less than this or that! You must now tell us of all the 100,000 such
    profiles on eye lilting, what the propensity is for blonds to raise an eyebrow
    in such a pose, as contrasted to brunettes!"



    So .. TOP DOWN is not quite the answer, really. Data mining is gonna rear its
    head too.

    9) But You've a list of millions of possible search strings to
    look for ?
    Yep.

    I've said that, in the beginning, the Genisis of this thing, there will be relatively
    few total matrices of what the Watchbird thinks is relevant! I expect, because of
    the real work at creating profiles, that most sites will only spend the time to
    create maybe fifty to a hundred. They will consist of not more than a hundred total
    of these sixty byte strings and related other data. What you have to keep in mind
    is that level one of the matrix must be able to point to elements in level two,
    which point to elements in level three, and thence to level four, and so on down
    through all the possible dimensions.

    That's how we find out about the relating of someone else to Mona's lilt! But at the
    same time then there will be about fifty to a hundred other quirks about that face we
    also want to profile and check on relationships!

    What we are saying is that -- you betcha! A picture *IS* worth a thousand words!
    But so now, that admitted, exactly what thousand words are there which tie the
    suspect to the picture or a specific part or parts of it?

    The background on all this!

    For a long time I've had a paid-for professional version of the unique askSam database
    engine! It's a very interesting product. You can toss in anything you want to into it
    as free form text. Thousands of records!

    Then, at any time from the collage of text, you can go and create your own fields for
    investigation and organize a search based on what you want AFTER the thing is created!

    The askSam is, as far as I know, a fully relational database. In other words, you may
    have the word "the" in it a thousand times. But the word is only there once in the
    actual database. Every instance of the word "the" in it, is a pointer to the stored
    location and your actual output is totally generated by pointers! This product is used
    by criminal investigators, things like that. It was, for example, used in breaking the
    Manuel Noriega case in Panama by the US Government. You crank in all the mail traffic
    into it and compose any rule of field mix you want and VOILA! Out come all the who done
    what to whom and when.

    For example, I happen to have the last ten years total of all of the FidoNet Net 117
    message traffic of the official NET traffic in an askSam database here. I can tell you
    EXACTLY who said what dirty word exactly when and so on for the last ten years and all
    the thread stuff that went with it. It was partly what led to the demise of FidoNet
    locally here. There was simply no place to hide and you couldn't run from political
    mistakes in Fight-O-Net locally, grin.

    Now .. I want a simplified form of this locally, from within the PowerBASIC DOS operation.
    I want it hand coded so that I can either port it to C/C++ if I have to leave PB, or
    extend it into LINUX as needed. There won't be any WIN use of it, in all likelyhood, as
    it will either have to go forward in DOS ... in OS/2 ... or to AIX in the long run.

    So there!

    ------------------
    Mike Luther
    [email protected]

    [This message has been edited by Mike Luther (edited December 05, 2002).]

    Leave a comment:


  • Michael Mattias
    replied
    No Michael, I don't think we are on the same wavelength yet here..

    That's because you are not looking at the way I am

    Instead of 100,000 records, each of which may contain 20^7 data fields, but in practice contains only about 100, I am suggesting you create a separate record for each thing/datatype which exists. When you need 'em all, you read 'em all.

    As far as all these interrelationships (eyeballs and knees, etc).. you did not say that at first. This is a different problem.

    I start the meter when the target starts moving......

    MCM

    Leave a comment:


  • Matthias Kuhn
    replied
    Mike,

    I´m confused to understand what You want exactly:

    1) You´ve variable Fields delimited with ?
    Are the strings in quotions marks ?

    2) till max 60 Bytes. What kind of Data ?
    only letters, numbers, all ASCII-signs ?

    3) One record has about 100 Fields and ends with chr$(13,10) ?

    4) You´ve up to thousand records - do you append and delete records ?

    5) Do You need this structure of Your file for other programms or
    is it only a ".log"-file ?

    6) Is there an 'update-field' in the record where You can see
    changes of fields ?

    7) Is the field length ever changing if the field changes ?

    8) I think You look only for one search-string at once. Right ?

    9) But You´ve a list of millions of possible search strings to
    look for ?

    Regards

    Matthias Kuhn


    ------------------

    Leave a comment:


  • Mike Luther
    replied
    No Michael, I don't think we are on the same wavelength yet here..

    This is like the lilt in Mono Lisa's eyebrow, sort of.

    There are 20 to the seventh power possible combinations of up to sixty
    bytes of text possible that COULD be used to make up a picture of Mona
    Lisa. Reduced to the lowest common GROUP of them that might interact
    in how that eyebrow was raised in the final picture, any given number
    of them COULD be used to define that lilt! However, on average, maybe
    sixteen of them ARE critical to what will be displayed in those pixels
    which constitute 'the' eyebrow.

    What is important is that, for any given 'lilt', there might be only
    a specific sixteen, and only a specific PATTERN of sixteen, which are
    actually POSSIBLE to be used in creating the lilt. Each of 'the' - 'given'
    sixteen are at some discrete place in the matrix of all possible text
    lines of data, and each are possible check points on how Mona Lisa might
    be feeling come time to raise that eyebrow!

    Now the STANDARD picture of Mono Lisa has a specific lilt. We all know
    that. Or at least some of us think we do.



    But in thinking about how to examine what it really means, we have to have
    the original matrix of the entire possible sixteen strings of data describing
    how Mona Lisa felt when she raised that eyebrow. We wonder, do I feel like
    this and what are MY chances today to be involved with this as well?



    So we rapidly deploy a composite UNIQUE raster of the sixteen different and
    uniquely positioned text snips in the matrix of them and check off what are
    in line with what Mona Lisa is feeling, compared to what we are feeling!
    That check list will produce a SIMILAR, but not necessarily EXACT profile
    of how close we are to what Mona Lisa is for that given profile. If we
    find we are close enough for government work, we might raise a corresponding
    eyebrow in salute .. and .. well you get the idea.



    Now Mona Lisa didn't have to lilt that eyebrow. It could have been the
    other one. Which might have involved, and been set up by an entirely
    different set of TEXT definable emotions and matrix to produce that. Perhaps
    fourteen completely different descriptions in the raster of emotions will
    be involved for such an action. For which we have to have an encryption
    aware sensitivity of a checklist of those EXACT fourteen DIFFERENT sixty
    byte max text strings, if we either expect to succeed, with our quest to
    relate well to this issue. Maybe government work has to really be exact here!

    Plus, maybe, two raised eyebrows means, "You'd better NOT approach me!", combined
    with a squinting of the eyes that wasn't in the original composure. And in
    this case, perhaps two hundred different sixty byte maximum strings would
    be involved in explaining why that might be good advice! And recall, each
    of them has to be cascaded in the raster checklist, in precise order and in
    precise sub-order as well.

    The technique has many possible applications. We do it all the time as
    human beings. A picture is worth a thousand words, no? But, just as in
    the famous, "One if by land, and two if by sea, and I on the opposite shore
    will be!", meant "Waiting and ready to spread the alarm." We also get paid
    in life mostly by calling, in humor here, "To Arms! To Arms!, the British
    are coming!" We are rewarded for our words; not our graphic images, at least
    in our lifetimes, somewhat.

    I think I'm right to note that Leaonardo D' Vinci never got paid much of anything
    in his lifetime for his work. Nor do I, nor I suspect, you, have any inclination
    to, on some Stary Stary Night, check out over not getting paid.

    But I think it is necessary to be able to tear apart a graphic image and
    reduce it to TEXT, to check off what to do to collect, when the difference
    comes to us is such varied graphical images these days. The difference
    between Mona Lisa and that lilt, or that image of Monroe, or, for example,
    that shot of the Boys bashing that right car fender into the wall of the
    building while hand pushing the Washington evidence car into it, are all
    reduced to WORDS, PHRASES, if you will. They will be the keys to which the value
    of an encounter is scored.

    I want to be able to CODIFY in a TEMPLATE, twenty different possible lines of
    sixty bytes of text in up to the seventh or eighth power of a possible array
    of any or all of them.

    I want to do this in real-time, on line, in a networked situation.

    I want to then enable many users to comment and work with the core data that
    was a part of the whole, out of a possible 100,000 or so possible totally
    different composite different matrices of this. I want each one to be able
    to grade and score a perception of any one of those composites of the whole,
    on line and in real-time. From that I want to be able to store their position
    in relation to the standard, on line and in real-time, and from that proceed
    to get paid for the effort as part of the work that has gone before.

    Your suggestion, as I understand it, provides for a single integer that might
    be used to designate that key match. But it doesn't provide for the manipulation
    of the entire data bank, from which to gain comparisons. Nor does it hold the
    precise matrix position for each of the possible sixty byte records, which in
    theory, could number a million or more, but in practice would never be more than
    perhaps a hundred for any given profile.

    However I freely admit I may not know enough about how to evaluate the expression
    you just suggested, in the full use of what it offers...



    ------------------
    Mike Luther
    [email protected]

    [This message has been edited by Mike Luther (edited December 05, 2002).]

    Leave a comment:


  • Michael Mattias
    replied
    100,000 records with maybe 100 "things" per record?

    Code:
    TYPE Keytype
      PrimaryKey AS "whatever" ' << customer? Product? whatever.
      Infotype   AS INTEGER    ' What type of data are in Datastring; up to 32767 different possible datatypes
    END TYPE  
    TYPE RecordType
     Key         AS Keytype
     dataString  AS STRING * 60
    END TYPE
    Create an index on this datafile using all of keytype. Total records = 100,000 x 100 = 10,000,000. PowerTree(tm) would work. So would one or more of the "BT" variations floating around.

    This wastes unused space in the sixty bytes area. But you only use a record where and when a particular data type exists for a specific primary key.

    You want a design to save the unused portion of the sixty bytes? I do that kind of thing for a living.



    ------------------
    Michael Mattias
    Tal Systems Inc.
    Racine WI USA
    mailto:[email protected][email protected]</A>

    www.talsystems.com

    Leave a comment:


  • Mike Luther
    replied
    Thanks Michael ..

    1. It appears you do not need to add/change/delete records in real time; this is a semi-static 'reference' database and a
    'batch-type' update will be acceptable. Correct?
    Nope .. records have to be added in real time and .. in far less often
    circumstances could be required to be edited - which could increase or
    decrease the total internal field count, the variable length text which
    is in them. Palor of gloom gathering overhead so noted.

    2. Approx how many total logical records, each with perhaps one hundred data entries?
    Upwards of a hundred thousand or so.

    Thanks for your time and thought.

    ------------------
    Mike Luther
    [email protected]

    Leave a comment:


  • Michael Mattias
    replied
    index approximately 20 to the seventh power of roughly 60 byte string data fields .. per
    record. Yet the average complete record will contain no more than a hundred populated fields..
    A couple of confirmation questions:

    1. It appears you do not need to add/change/delete records in real time; this is a semi-static 'reference' database and a 'batch-type' update will be acceptable. Correct?

    2. Approx how many total logical records, each with perhaps one hundred data entries?

    (I've done something like this with both PB-DOS and MS QuickBASIC, but if that would work here depends on answers to above).

    MCM

    Leave a comment:


  • Mike Luther
    started a topic Index method suggestions?

    Index method suggestions?

    I'm faced with the problem of a requirement to index approximately
    20 to the seventh power of roughly 60 byte string data fields .. per
    record.

    Yet the average complete record will contain no more than a hundred
    populated fields. The rest of all the million plus whatever possible
    matrix chunklets - will all be totally blank. Yet in any given record,
    there will be the possibility that any mish-mash of field order might
    be expected.

    The string data is going to be totally variable length, but not more
    than 60 bytes per field. It likely may be also internally internally
    delimited within each string for pointering purposes, but that's not
    an issue for the main task of serving up the chunklets.

    The expected number of possible indexes and data to recover might well
    exceed 100,000 different record indexes/data matrixes, per computer server
    site. The file(s?) and index operations will have to be network
    available - perhaps 50 to 100 simultaneous workstations will have to
    be able to use the data at peak load. Average connected load may be
    on the order of one or two dozen boxes, max.

    This isn't a transaction processing operation making direct use of
    the proposed index/data hodge-podge. What I speak to will only be
    expected to furnish pointers to the transaction processing operation
    based on what is in the string data here. Hence the issue of what
    I know as roll-forward and roll-backward from the Btrieve game is
    maybe not as critical here as it is in some other work I've done.

    Any suggestions on how to approach this?

    In PB 3.5 ... ?)


    ------------------
    Mike Luther
    [email protected]
Working...
X