Announcement

Collapse

New Sub-Forum

In an effort to help make sure there are appropriate categories for topics of discussion that are happening, there is now a sub-forum for databases and database programming under Special Interest groups. Please direct questions, etc., about this topic to that sub-forum moving forward. Thank you.
See more
See less

Decode An XML String

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Decode An XML String

    I'm reading a very large XML spreadsheet and need to decode the individual data elements.

    I've created this function.

    Just wanted everyone opinion if I have everything correct?

    Code:
    FUNCTION Decode_XML_String( BYVAL S AS STRING ) AS STRING
    '
    LOCAL TL1 AS LONG
    LOCAL TL2 AS LONG
    LOCAL Encoded_Char AS STRING
    '
    TL1 = 1
    '
    DO
    '
    TL2 = INSTR( TL1, S, "&#" )
    '
    IF TL2 = 0 THEN
    '
    EXIT DO
    '
    END IF
    '
    Encoded_Char = MID$( S, TL2, INSTR( TL2 + 1, S, ";" ) - TL2 + 1 )
    '
    TL1 = TL2 + LEN( Encoded_Char )
    '
    S = LEFT$( S, TL2 - 1 ) + CHR$( VAL( MID$( Encoded_Char, 3 ))) + MID$( S, TL1)
    '
    LOOP
    '
    REPLACE "'" WITH "'" IN S
    REPLACE "&lt;" WITH "<" IN S
    REPLACE "&gt;" WITH ">" IN S
    REPLACE "&quot;" WITH $DQ IN S
    REPLACE "&amp;" WITH "&" IN S
    '
    FUNCTION = S
    '
    END FUNCTION

  • #2
    TL1 doesn't do anything.
    "Not my circus, not my monkeys."

    Comment


    • #3
      Ah yes well spotted Eric

      Updated the code

      Comment


      • #4
        Code:
        FUNCTION Decode_XML_String( BYVAL S AS STRING ) common AS STRING
        '
        LOCAL TL1 AS LONG
        LOCAL TL2 AS LONG
        LOCAL Encoded_Char AS STRING
        '
        TL1 = 1
        '
        DO
        '
        TL2 = INSTR( TL1, S, "&#" )
        '
        IF TL2 = 0 THEN
        '
        EXIT DO
        '
        END IF
        '
        Encoded_Char = MID$( S, TL2, INSTR( TL2 + 1, S, ";" ) - TL2 + 1 )
        '
        S = LEFT$( S, TL2 - 1 ) + CHR$( VAL( MID$( Encoded_Char, 3 ))) + MID$( S, TL2 + LEN( Encoded_Char ))
        '
        TL1 = TL2 + 1
        '
        LOOP
        '
        REPLACE "&apos;" WITH "'" IN S
        REPLACE "&lt;" WITH "<" IN S
        REPLACE "&gt;" WITH ">" IN S
        REPLACE "&quot;" WITH $DQ IN S
        REPLACE "&amp;" WITH "&" IN S
        '
        FUNCTION = S
        '
        END FUNCTION

        Comment


        • #5
          Hey Steve,

          I was playing around with this a bit, and found an XML file that has interesting character encoding:
          don&amp;#39;t close
          You may not be seeing this, but it could make a difference to others, so I moved one line from the end of the function to immediately before the DO

          Code:
          ...
          TL1 = 1
          
          ' convert these now, otherwise the DO loop below will miss chars encoded like this:  don&amp;#39;t close
          REPLACE "&amp;" WITH "&" IN S 
          
          DO
          ...
          -John

          Comment


          • #6
            John, that looks like encoded HTML

            Comment


            • #7
              Originally posted by Steve Bouffe View Post
              John, that looks like encoded HTML
              It's a "double encoded" HTML entity
              &amp; decodes to "&"
              So it decodes to &#39; which in turn decodes to apostrophe (ASCII 39 aka &H27)

              Comment


              • #8
                Steve,

                For testing, I gathered a set of 14 large and small XML files totaling about 30MB.

                Using the loop in the original post, the process ran about 14 seconds, and I wondered if it could be made faster.

                Seeing the repeated use of the LEFT$ and MID$ to place the decoded character, came up with the following, which converted the same set of files in about 2 seconds.

                Part of the difference derives from not hitting the string-processing "engine" so hard, and part from the fact that REPLACE$ handles the change in string length invisibly (as opposed to doing the replacement via MID$ statement)



                Code:
                'DecodeXML.inc by steve Bouffe
                'from: https://forum.powerbasic.com/forum/user-to-user-discussions/powerbasic-for-windows/788608-decode-an-xml-string msg#4
                '===========================
                
                FUNCTION Decode_XML_String( BYVAL sWork AS STRING, OPTIONAL lInsertCRLFs AS LONG ) AS STRING
                'Takes a string of XML and "decodes" it.
                '
                ' The optional flag specifies whether tokens <br /> and </br> should be followed by $CRLF.
                ' This option would really only be needed to clarify formatting, in the case where the decoded file
                ' was going to be printed or examined by eye...
                '
                
                   LOCAL TL1 AS LONG
                   LOCAL TL2 AS LONG
                   LOCAL Encoded_Char, Repl_Char AS STRING
                   TL1 = 1
                
                   REPLACE "&amp;" WITH "&" IN sWork ' convert these now...
                   ' ...otherwise, the loop below will miss chars encoded like this: don&amp;#39;t close
                
                   DO
                
                      TL2 = INSTR(sWork, "&#" )
                      IF TL2 = 0 THEN EXIT DO
                      Encoded_Char = MID$( sWork, TL2, INSTR( TL2 + 1, sWork, ";" ) - TL2 + 1 )
                      Repl_Char = CHR$( VAL( MID$( Encoded_Char, 3 )))
                      REPLACE Encoded_Char WITH Repl_Char IN sWork
                
                   LOOP
                
                   REPLACE "&apos;" WITH "'" IN sWork
                   REPLACE "&lt;" WITH "<" IN sWork
                   REPLACE "&gt;" WITH ">" IN sWork
                   REPLACE "&quot;" WITH $DQ IN sWork
                   REPLACE "&nbsp;" WITH " " IN sWork
                
                   IF NOT ISMISSING(lInsertCRLFs) THEN
                      IF lInsertCRLFs = %true THEN
                         REPLACE "<br />" WITH "<br />" & $CRLF IN sWork
                         REPLACE "</br>" WITH "</br>" & $CRLF IN sWork
                      END IF
                   END IF
                
                
                   FUNCTION = sWork
                
                END FUNCTION

                I find REPLACE$ to be a fast tool when the actual location of the substring doesn't really need to be managed... (in effect, we're really only working with the pattern).

                I had fun with this; I hope my mods are useful to you!

                -John



                Comment


                • #9
                  Are you aware that there are dozens (hundreds?) of other named entities apart from the few you are testing for?


                  https://www.freeformatter.com/html-entities.html

                  Comment


                  • #10
                    Hi Stuart,

                    I knew there are many more, but I do appreciate having solid reference data.

                    I just stayed within the range of items that Steve started with.

                    Thanks!
                    -John

                    Comment


                    • #11
                      Thanks everyone for your help

                      Comment

                      Working...
                      X