Announcement

Collapse
No announcement yet.

uCalc Transform tip: Extract literal strings from your PB code

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • uCalc Transform tip: Extract literal strings from your PB code

    uCalc Transform is capable of easily parsing PowerBASIC code in all kinds of interesting ways.

    Part 1

    Let's explore how to extract a list of literal strings (text within double quotes) found in your source code, with just one line of uCalc Transform code.

    Problem:

    You would like to spellcheck all the literal strings in your PB source code to uncover typos that might be hiding in your message boxes, splash screens, string equates, etc. You don't want to run the entire code through a spellchecker since it would underline just about everything. So, you'd like to extract a list of only text that is in between double quotes.

    Challenge:

    Before looking at the uCalc Transform solution below, please take a moment to think about how you'd undertake this task. If you have time, create a program with PowerBASIC (or any tool) to do it, and test it on your own source code. How many lines of code did it take? How much time did it take you to write it? Did it do everything you wanted?

    uCalc Transform solution: (Part 1)

    Run uCalc Transform (download from here). Load a PB source code file into the editor. Now enter the following pattern in the Find box: {"\q[^\q]*\q"}, and the following in the Replace with box: {Self}, as such:



    Click on the Filter button, and you'll get a list of quotes extracted from your PB source code. That's it!



    Explanation:

    Some of you may immediately recognize \q[^\q]*\q as a Regular Expression (or RegEx) pattern. If not, don't worry, you can Google Regex for more details on how that works. Basically that patern searches for a quote, \q, followed by 0 or more characters that are not quotes, [^\q]*, followed by another quote, \q.

    Although uCalc supports regular expression patterns, you can do this example using a plain uCalc pattern without any regular expression like this:



    uCalc patterns are often easier to deal with than RegEx patterns, and are very flexible. The two can be mixed together. Our RegEx pattern was actually part of a uCalc partern, and it was denoted by being placed within quotes surrounded by curly braces.

    Anyway, the second pattern searches for a quote character, {q}, alternatively followed by any tokens, {text}, until it reaches another quote, {q}. The square brackets around {text} indicate that the presence of actual text is optional; so empty literals will match as well. {q} is like a uCalc keyword, which denotes the quote character, whereas {text} is a pattern variable which could have been given any name (like {txt}, {MyLiteral}, or {QuotedText}, etc.).

    uCalc gives quotes special properties which can make them difficult to work with when strictly using uCalc patterns. You'd have to do things like turn off the Quote sensitive property to use {q}[{text}]{q} as a pattern, etc. So for this solution, we'll stick with the RegEx pattern.

    {Self} is a uCalc keyword used in the Replace with box, which represents the text that matches the entire pattern from the Find box. It replaces the pattern with itself. Leaving the box empty would have replaced the match with nothing, and your result would be a list of empty space.

    We could have given a name such as Quote to the Regex pattern and put {Quote:"\q[^\q]*\q"} in the find box and {Quote} in the Replace box like this:



    That's all for Part 1. Comments? Questions? I'd be happy to respond before we move on to the next part.

    Next time:

    The above may very well be all you need. However, there's more you can do. Next time, we'll explore one of the following:
    • Skipping string literals within comments
    • Skipping literals that are too short, non-alphanumeric, etc.
    • Skipping duplicates
    • Including literals from "#Include" files
    • Running this in batch mode on many files
    • Adding line numbers
    • Sorting the strings: up/down, by # of chars, # of words
    • Including literals with missing quote
    • Changing all literals to upper/lower/mixed case
    • Dealing with nested quotes
    • Adding an ongoing tally
    • Changing literals to equates
    • Tallying duplicate strings
    Daniel Corbier
    uCalc Fast Math Parser
    uCalc Language Builder
    sigpic

  • #2
    Skipping over comments

    Part 2: Skipping over comments

    In part 1, we found out how to filter out a list of literal text strings in our code. What if you wanted to skip over quotes when they're in a commented part of your code? Simple! Just add another line with a pattern that matches comments, and set its Skip over property to True like this:



    I highlighted in yellow on the right the Skip over property, which was set to True, and circled in red the string literals that were skipped. uCalc Transform highlighted the matches in blue. Those are the ones that would make it on the list if I had clicked on Filter.

    This one uses a Regular Expression to define the pattern:
    Code:
    {Comment:"'.*"}
    which represents the single quote character, ', followed by 0 or more characters up to the end of line, .*
    Daniel Corbier
    uCalc Fast Math Parser
    uCalc Language Builder
    sigpic

    Comment


    • #3
      Adding line numbers

      Part 3: Adding line numbers

      What if you need to know which line number each string literal appears on? Easy. Simply create a variable that keeps count of each NewLine occurrence, and insert the current count in front of each string occurrence. Remember to click Filter instead of Transform for this one.




      Explanation:

      This line defines a variable named LineNumber, and initializes it with a starting value of 1:

      Code:
      {@Var: LineNumber=1}
      Note: {@Var: . . . } is a shortcut for {@Define: Var: . . .}


      The next one simply increments the line count for each {nl} (new line character) occurrence:

      Code:
      {@Exec: LineNumber++}
      Note: {@Exec: . . .} is like {@Eval: . . .} except the return value (if there's one) is ignored.


      The line number, {@Eval: LineNumber}, is inserted in front of the literal string occurrence represented by {Self}:

      Code:
      Line #{@Eval: LineNumber}:  {Self}{nl}
      Daniel Corbier
      uCalc Fast Math Parser
      uCalc Language Builder
      sigpic

      Comment


      • #4
        Adding ongoing tally

        Part 4: Adding ongoing tally

        What if you wanted to keep count of each literal string occurrence? Here's how:




        Explanation:

        This example is similar to the previous one, except we keep count of literal strings instead of NewLine characters.
        Daniel Corbier
        uCalc Fast Math Parser
        uCalc Language Builder
        sigpic

        Comment


        • #5
          Daniel,
          This is good stuff.

          Somewhere around here I've posted an example of a function to extract literal strings from PowerBASIC code. It may have been part of a spell check app that I wrote.

          I'll find the code and distill it down to just the extraction code so that we can see the differences in how to implement extraction by uCalc and by home grown code.

          Comment


          • #6
            And, Daniel,

            Just so I'm clear - your screen shots are PowerBASIC apps which are using the uCalc DLL. Yes?

            So the extraction you're discussing could be done within our PowerBASIC apps, bypassing as much as we want of the GUI you've shown. Yes?

            Comment


            • #7
              You can call uCalc DLL from PB

              Good question, Gary. The screenshots are of an app (uCalc Transform) created with VB.NET that calls the uCalc DLL. You can create an app with an interface of your own design with PowerBASIC and call the DLL as well.

              People had difficulty relating to uCalc patterns, and the uCalc Language Builder concept, etc. So as a practical example, I decided to create an app that calls the DLL, but with an interactive GUI interface. This example ended up as uCalc Transform, which is now more than just a demo. Prior to morphing into uCalc Transform it was uCalc Search.

              The idea was (and still is) for people creating text editors, IDEs, and the like, to offer uCalc patterns as an option for Find, and Find/Replace operations. Text editors often add RegEx as an option for Search and Search/Replace. Now uCalc patterns can be an additional option for users. (And since uCalc patterns support RegEx, you actually wouldn't need RegEx as a separate option in your apps).
              Last edited by Daniel Corbier; 13 Nov 2013, 12:15 PM. Reason: corrected typo and added link
              Daniel Corbier
              uCalc Fast Math Parser
              uCalc Language Builder
              sigpic

              Comment


              • #8
                Bypassing uCalc Transform GUI

                Part 5: Same example using uCalc String Library

                What if you wanted to create your own app in PowerBASIC to do this? Perhaps you want to handle the interface yourself. Do you necessarily have to run uCalc Transform to do it? No. You can do this same quoted string extraction by directly calling uCalc String Library routines in the uCalc DLL, like in this PB/CC example:

                Code:
                #Include "uCalcPB.Bas"
                
                Function PBMain () As Long
                   InputCode$ = ucFile("Win32API.inc") ' ucFile() returns contents of file in 1 step
                   Outputcode$ = ucRetain(InputCode$, "{Quote:'\q[^\q]*\q'}", "{Quote}{#13#10}", ucSkip("{Comment:""'.*""}"))
                
                   Open "Quotes.out" For Output As #1
                   Print #1, Outputcode$
                End Function
                The essential part above was broken down into two lines to accommodate a comment, but it could have been done in one line without necessarily being too crammed:
                Code:
                Outputcode$ = ucRetain(ucFile("Win32API.inc"), "{Quote:'\q[^\q]*\q'}", "{Quote}{#13#10}", ucSkip("{Comment:""'.*""}"))
                Explanation:

                ucFile() opens up one (or more) file(s), and returns the contents into a string. This saves you from a number of extra steps. This handy function can also open multiple files, concatenating them all into one string:
                Code:
                ucFile("*.Bas")
                You can also specify paths for it to search in (separate from the Windows command prompt path), for instance:
                Code:
                ucDefine("Path: C:\PB\WINAPI;C:\MyPrograms;C:\uCalc\Demo")
                ucRetain() is uCalc SL's counterpart of PB's Retain$() function. The difference is that the second arg is a uCalc pattern, instead of a literal string. There's no need for the "ANY" keyword in uCalc SL routines. There are various ways of emulating PB's ANY keyword:

                Code:
                i$ = Retain$(MyString$, ANY "12345") ' PB's Retain$
                i$ = ucRetain(MyString$, "{ 1 | 2 | 3 | 4 | 5 }") ' uCalc pattern
                i$ = ucRetain(MyString$, "{'[1-5]+'}") ' uCalc pattern using RegEx
                However, PB's ANY keyword is limited to just characters. With uCalc's ucRetain(), you can have it retain a list of patterns, not just characters:
                Code:
                i$ = ucRetain(MyString$, "{ This | That | Min({args}) | {'[1-5]+'} }
                So if MyString$ is the following string, it would retain the parts that are highlighted:

                "Print Min(a, b, c) ' This is test #1. This & That"

                ucRetain(MainString, Pattern, Filter). ucRetain requires two arguments. However, an optional 3rd argument determines which part of the match is actually returned. This can actually be any text, including (or excluding) any combination of pattern variables from the second argument. In this particular example, we wanted to return not just the quote, but we also appended a {#13#10} (ASCII characters 13 and 10 for carriage return/line feed) so that each quote is featured on a separate line, instead of all being jammed together into one long line.

                ucSkip(). uCalc SL counterparts of PB string routines typically have an optional final argument that lets you configure the operation in many ways. Here we chose to have it skip over comments, with ucSkip("{Comment:""'.*""}"), when finding matches.

                More on uCalc String Library

                uCalc String Library is an API that consists of string handling routines for advanced parsing. While uCalc Transform provides the end user with a visual interface, uCalc String Library has none. These are DLL routines you can call directly from your PB source code in whichever way you'd like. Many uCalc SL routines are based on similarly named PB routines (like InStr, Left$, Len, UCase$, Replace, Tally, Extract$, etc), but with one twist. The uCalc counterparts of these routines handle uCalc patterns, instead of plain string characters (though you can mimic the later with the ucChar property).

                Resources:

                uCalc String Library main page
                Documentation for uCalc String Library help
                Documentation for uCalc patterns

                Feedback requested:

                I would like to thank Gary Beene for his feedback so far (as well as someone else who also sent a PM related to this). Today's topic might not have occurred without this feedback. Although the first post in this thread lists other parts of this example that I plan to cover, user feedback can help make these examples even more relevant than what I originally had in mind. So keep the comments coming in .
                Daniel Corbier
                uCalc Fast Math Parser
                uCalc Language Builder
                sigpic

                Comment


                • #9
                  Sorting the quotes

                  Part 6: Sorting the list of quotes

                  What if you want to sort the quoted text alphabetically. Simply change the Sort property to True. If you want to avoid duplicates in the list while you're at it, change the Unique property to True as well. Those properties are found on the right, under the Filter category.

                  This example sorts the quotes in the Win32Api.Inc file (for PB/CC 5):

                  Daniel Corbier
                  uCalc Fast Math Parser
                  uCalc Language Builder
                  sigpic

                  Comment


                  • #10
                    Sorting by length

                    Part 7: Sorting by number of characters

                    What if you wanted to sort your quoted strings by length instead of alphabetically? Simply add < Length(x) in the Sort(x, y) formula box. Or change it to > Length(x) to sort it in the other direction. You can in fact use any formula in that box as long as it evaluates to Boolean True or False.

                    Length(x) is a shortcut for Len(x, uc_Char). uCalc String Library (whose functions can be called from uCalc Transform) contains a list of string functions that would be familiar to PB users. They are similar to the PB counterparts but operate on tokens instead of characters, unless you add the optional uc_Char argument.

                    < f(x) is a shortcut. Starting with < or > implies an f(y). The formula in its full form is f(x) < f(y).

                    The following example sorts the Win32API.inc quoted strings by number of characters:

                    Daniel Corbier
                    uCalc Fast Math Parser
                    uCalc Language Builder
                    sigpic

                    Comment


                    • #11
                      Tallying occurrence count for each quote

                      Part 8: Tallying occurrence count for each quote

                      Let's say some quotes appear multiple times in your source code. Perhaps instead of simply filtering out a list of unique quotes, you may also want to know exactly how many times each given quote is featured in your code. You can do this using 2 passes. In the first pass, each quote is inserted into a table. If the table already contains a particular quote, then the count for it is simply incremented. When there are multiple passes, filtering takes place in the final pass. In that pass, each quote is preceded by the occurrence number that was stored in the table.

                      Here's what this example looks like when filtering Win32API.INC:



                      Note that > Val(x) must be used here instead of > x for the Sort formula (and make sure the Sorting and Unique properties are set to True), otherwise you end up with a head scratcher, where the numbers are sorted by string value instead of numeric value and you'd get something like:

                      Code:
                      5 "test..."
                      44 "test..."
                      3 "test..."
                      25 "test..."
                      1 "test..."
                      when you really want:

                      Code:
                      44 "test..."
                      25 "test..."
                      5 "test..."
                      3 "test..."
                      1 "test..."
                      Daniel Corbier
                      uCalc Fast Math Parser
                      uCalc Language Builder
                      sigpic

                      Comment

                      Working...
                      X