Announcement

Collapse
No announcement yet.

UTF-8 to ASCII

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • UTF-8 to ASCII

    I'm using code like this:

    Code:
    Sub UTFtoASCII(tempZ As WStringZ * %Max_Path)
       If InStr(tempZ,"?UTF-8") Then
          Replace "=?UTF-8?Q?" With "" In tempZ
          Replace "=?UTF-8" With ""    In tempZ
          Replace "=20"    With $Spc   In TempZ
          Replace "=3A"    With "&"    In TempZ
          Replace "?="     With "?"    In TempZ
          Replace "=26"    With "&"    In TempZ
       End If
    End Sub
    to decode email header content likethis ...

    Code:
    From: "=?UTF-8?Q?Yahoo?=" <[email protected]>
    It works, but surely there's an API for the conversion? With a large character set, there's no way I can create a function that covers all the possibilities.

  • #2
    Use the function (something like this came up before) to get wide character string, then ACODE$ to get ASCII. Or, modify the to return ASCII (it is just low byte of wide character (for just ASCII)).
    Code:
    ' file demoUTF8_3.bas
    #compile exe
    #dim all
    declare function MessageBoxW lib "User32.dll" alias "MessageBoxW" _
        (byval hWnd as dword, byval lpText as wstringz pointer, lpCaption as wstringz, _
        byval dwType as dword) as long
    '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    '  \/  \/ what you're looking for \/  \/
    '+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
    function MyUTF8_to_wide(byref UTF8In as string) as string
      local ByteCnt as long 'count of bytes in input string
      local CharCnt as long 'count of characters in input string
      local pByte as byte pointer '
      local pBytePast as long
      local OutStr as string
      local pOutWd as word pointer
      local TempWChar as word
      local CurChar, CntBits as long
      '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      'Size and create OutStr with nulls to avoid repeated concatinations
      '
      ByteCnt = len(UTF8In)
      pByte = strptr(UTF8In)
      pBytePast = pByte + ByteCnt
      'u8InLen = len(UTF8In)
      'u8Temp = UTF8In 'so no changes to input string
      do
        if bit(@pByte, 7) = 0 then 'character is one byte (0xxxxxxx)
          incr pByte
        else 'character is two, three or four bytes
          if bit(@pByte, 6) = 1 then
            if bit(@pByte, 5) = 0 then 'two               (110xxxxx)
               pByte += 2
            else '
              if bit(@pByte, 4) = 0 then 'three           (1110xxxx)
                pByte += 3
              else
                if bit(@pByte, 3) = 0 then 'four          (11110xxx)
                  'pByte += 4
                  goto RangeError '4th byte is more than 16 bits a wide can hold
                end if
              end if
            end if
          else 'bit six never 0 in 1st byte if there are 2, 3 or 4 bytes; _
               'always 0 in 2nd, 3rd and 4th, here 1st byte unless sync error
            goto SyncError 'error
          end if
        end if
        incr CharCnt
        if pByte = pBytePast then
          exit loop
        elseif pByte > pBytePast then
          goto SyncError
        end if
      loop
      OutStr = string$(CharCnt * 2, $nul)
      pOutWd = strptr(OutStr)
      '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      'get size of each character and shift bits into place
      pByte = strptr(UTF8In)
      for CurChar = 1 to CharCnt
        'bytes in this character by first 0 bit position from left
        if bit(@pByte, 7) = 0 then
          CntBits = 6  'to indicate 1 byte length in select case
        else '
          for CntBits = 5 to 3 step -1
            if bit(@pByte, CntBits) = 0 then
              exit for
            end if
          next
        end if
        select case const CntBits
          case 6 'is a one byte character
            TempWChar = mak(word, @pByte, 0)
          case 5 'is a two byte character
            TempWChar = mak(word, @pByte, 0) and &b0000000000011111??
            gosub AddIn2nd3rd4th
          case 4 'is a three byte character
            TempWChar = mak(word, @pByte, 0) and &b0000000000001111??
            gosub AddIn2nd3rd4th
            gosub AddIn2nd3rd4th
          case 3 'four byte character is not used. Is past 16 bits.
                 'For future reference only.
            TempWChar = mak(word, @pByte, 0) and &b0000000000000111??
            gosub AddIn2nd3rd4th
            gosub AddIn2nd3rd4th
            gosub AddIn2nd3rd4th
        end select
        pByte += 1 'point to first (or only) byte in next character
        @pOutWd = TempWChar
        incr pOutWd
      next
      function = OutStr
      exit function 'don't "fall" into subs
      '- - SUBs and GOTO target code - - - - - - - - - - - - - - - - - - - - - - - -
      AddIn2nd3rd4th:
        pByte += 1 'point to next byte in this character
        shift left TempWChar, 6
        TempWChar or= (mak(word, @pByte, 0) and &b0000000000111111??)
      return
      SyncError:
        function = build$("E", $nul, "r", $nul, "r", $nul, "o", $nul, "r", $nul)
        exit function 'exit on error vs. return
      RangeError:
        function = build$("O", $nul, "v", $nul, "e", $nul, "r", $nul, $spc, $nul, _
           "r", $nul, "a", $nul, "n", $nul, "g", $nul, "e", $nul)
    end function
    '+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
    '^^^ You want the above for use. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ' Below here makes file to "see" bits of wide and UTF-8 to figure a
    ' manipulation for the convertion.
    ' Also, test of the proposed function.
    function pbmain () as long
      local wChrs() as word
      local wStr as wstring
      local u8Str, wHex, u8Hex, u8Bits, wBits as string
      local Sz as long
      local x as word
      local pByte as byte pointer
      local nFile as dword
      '
    
    '- - Test the function - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      local u8TestStr as string
      local ResultStr as string
      local ResultHex as string
      'Contents of string c of Eros' test
      u8TestStr = chr$(&hE3, &h83, &h8F, _ 'h30CF
                       &hE3, &h83, &hAD, _ 'h30ED
                       &hE3, &h83, &hBC, _ 'h30FC
                       &hE3, &h83, &hAF, _ 'h30EF
                       &hE3, &h83, &hBC, _ 'h30FC
                       &hE3, &h83, &hAB, _ 'h30EB
                       &hE3, &h83, &h89)   'h30C9
    
      ResultStr = MyUTF8_to_wide(u8TestStr)
      messageboxW 0, strptr(ResultStr), "Hi", 0
      ResultHex = hex$(asc(ResultStr, 1), 2) + hex$(asc(ResultStr, 2), 2) + $spc + _
                  hex$(asc(ResultStr, 3), 2) + hex$(asc(ResultStr, 4), 2) + $spc + _
                  hex$(asc(ResultStr, 5), 2) + hex$(asc(ResultStr, 6), 2) + $spc + _
                  hex$(asc(ResultStr, 7), 2) + hex$(asc(ResultStr, 8), 2) + $spc + _
                  hex$(asc(ResultStr, 9), 2) + hex$(asc(ResultStr,10), 2) + $spc + _
                  hex$(asc(ResultStr,11), 2) + hex$(asc(ResultStr,12), 2) + $spc + _
                  hex$(asc(ResultStr,13), 2) + hex$(asc(ResultStr,14), 2)
    
      ?  ResultHex
      #if %def(%pb_cc32)
        ? "Done. Any key to end."
        waitkey$
      #else
        ? "Done. OK button to end."
      #endif
    end function
    Double check, Maybe result is wide characters in normal STRING.
    Dale

    Comment


    • #3
      Wait a minute PowerBASIC has UTF8ToChr$ function and ChrToUTF8$ function. See Help.

      Cheers,
      Dale

      Comment


      • #4

        That's not UTF-8 to ASCII. You are wanting to decode "quoted printable" MIME encoding. The =?UTF-8?Q tells you the following data is UTF-8 "quoted printable" encoded. "=20" is just QP encoding for a space Chr$(32)/Chr$(&H20), "=3A" is just a colon CHR$(58) / Chr$($H3A) All you need to know is in RFC 2047 https://tools.ietf.org/html/rfc2047.

        Comment


        • #5
          Good intro to the topic here: https://ncona.com/2011/06/using-utf-...-mail-subject/

          Comment


          • #6
            Hi Dale.
            Yes, I'm aware of the PowerBASIC functions but in my tests they did not do the conversion of the example I gave in the OP. As Stuart noted, they don't work on "quoted printables".

            And, Stuart,
            Yes, the intro in that first link was a very nice intro/summary. Bummer that php has the capability and PowerBASIC does not.

            I can't get the RFC link to come up, but I Googled and got another link. Reading the RFC 2047, I can't tell exactly how many encodings it says are are possible, but if the end result can only be a printable ASCII character, then the number is around 128-32? That's not so large that a function couldn't be generated to convert them all.

            I've not read through the entire RFC yet, so there are details I've to yet understand. But there sure is a lot of discussion in the RFC for something that might seem to be essentially a series of substitutions (at least for the Q encoding).

            Comment


            • #7
              There's a good StackOverFlow discussion here.

              This point is particularly useful ...

              If you are decoding quoted-printable with UTF-8 encoding you will need to be aware that you cannot decode each quoted-printable sequence one-at-a-time as the others have shown if there are runs of quoted printable characters together.
              Things are rarely as simple as we'd like them to be.

              Comment


              • #8
                And, now that Stuart put a name to it "quoted printable", here's a PowerBASIC thread on the topic.

                Comment


                • #9
                  Dale,
                  Using your code in post #2, I get a hex output string. I can't tell from your comments what to do with it. Could you say a little more about it please?

                  I also tried the code from the link I just gave but it did not decode the string from the OP either.

                  I'll take a look at it all closer. Most likely something I'm doing wrong or something I'm not understanding.

                  Comment


                  • #10
                    Originally posted by Gary Beene View Post
                    And, now that Stuart put a name to it "quoted printable", here's a PowerBASIC thread on the topic.
                    Looks like it does just what you want:

                    This codes allows you to decode quoted printable headers for mail and
                    news. UTF-8, ISO-8859-1 and ISO-8859-15 (Euro) support.

                    Specifically:
                    FUNCTION Get_Qp2Text(BYVAL EncodedText AS STRING) AS STRING

                    Comment


                    • #11
                      Here's a stripped down version of the previous code: https://forum.powerbasic.com/forum/u...table-routines

                      Comment


                      • #12
                        I'll throw https://pbcrypto.basicaware.de/ShowAlgorithm.aspx?id=28 into the mix.

                        Comment


                        • #13
                          I remembered why I wrote that code. The OP (at the time) needed wide characters in a STRING (vs WSTRING).

                          The link givin bt Stuart in post 4 says headers must be ASCII.

                          ASCII and UTF-8 are exactly the same for values 0 to 127 inclusive.

                          Therefore you're dealing with something other than straight UTF-8. (Base64? Base64 recoded in UTF-8? ???)

                          Think I'm outa here on this thread.
                          Dale

                          Comment


                          • #14
                            Dale,
                            Well, don't go too far. Threads have a way of wandering into new directions...

                            Comment


                            • #15
                              Originally posted by Dale Yarker View Post
                              I remembered why I wrote that code. The OP (at the time) needed wide characters in a STRING (vs WSTRING).

                              The link givin bt Stuart in post 4 says headers must be ASCII.

                              ASCII and UTF-8 are exactly the same for values 0 to 127 inclusive.

                              Therefore you're dealing with something other than straight UTF-8. (Base64? Base64 recoded in UTF-8? ???)

                              Think I'm outa here on this thread.
                              Actually, you're dealing with straight UTF-8 encoded as ASCII.

                              That's the whole point of QP encoding. It enables you to encode UTF-8 using just ASCII characters for transmission. That allows you to send an email with a Subject or From field containing foreign language characters etc

                              Comment


                              • #16
                                Actually, you're dealing with straight UTF-8 encoded as ASCII.
                                Actually, you didn't read your own reference!

                                Gary, If it were just ASCII nothing else would need to be done. Trying to "hide" a character above value of 127 using ASCII "=" and "?" patterns is not ASCII nor UTF-8. I do not know about that, and likelyhood of response like post number 15, is the reason I said I could no longer contribute to this thread.

                                (somebody thinks they're MCM's replacement _______)
                                Dale

                                Comment


                                • #17
                                  Originally posted by Dale Yarker View Post
                                  Trying to "hide" a character above value of 127 using ASCII "=" and "?" patterns is not ASCII nor UTF-8.
                                  Let me try to clarify again.

                                  What we have in the original post is an ASCII string transmitted as the FROM header of an email. For historical reasons, email header can only contain ASCII characters.

                                  The content of an email is not limited to ASCII characters, but the actual transmission of the headers is.

                                  Therefore there is a need to encode a message or parts of a message so that the non-ASCII characters can be transmitted when required in the headers.

                                  RFC 2047 is the standard way to do this. It allows for various original character sets including UTF-8 and various OEM etc "high bit ASCII" character sets to be encoded just using the ASCII character set. The two main encoding methods in the RFC are Quoted Printable and Base-64.

                                  That "From:..." string in the original post is an ASCII string resulting from "Quoted Printable" encoding of a UTF-8 string. An email client needs to be able to recognise it as such and decode the QP string back to it's original form. In this case UTF-8.


                                  Comment


                                  • #18
                                    Dale,
                                    I don't think I yet understand what you're trying to tell me.

                                    The incoming string that got me started on this thread was this, which is all ASCII characters, corresponding to what I read in RFC as encoding type "Q", or quoted-printable.

                                    Code:
                                    From: "=?UTF-8?Q?Yahoo?=" <[email protected]>
                                    Here's another from an email I just downloaded.

                                    Code:
                                    Subject: =?UTF-8?Q?Hi=2Cstay=20connected=20with=20Yahoo=20Mail=20mobile=20app?=
                                    From what I have read, the Q characters must be from characters in the range 0x21 to 0x7E, so a space, which is out of that range, has to be Q coded, where "=20" is the Q code for a space. In general, the RFC says this about the Q coding ...

                                    the techniques outlined here were designed to allow the use of non-ASCII characters in message headers
                                    Is the bottom line that the "?UTF-8" is misleading, in that combined with the "Q", there's a hybrid encoding that taking place. It doesn't seem to me that you and Stuart are talking apples to apples.

                                    As best I know, my need is to decode type Q coded message headers.

                                    ... added ... Stuart, just missed your post. This comment of yours ...


                                    can only contain ASCII characters.
                                    ... doesn't clarify that characters 32 and below are not allowed Q characters. Do I have that correct?

                                    Comment


                                    • #19
                                      And, wouldn't you know it , today I ran across a type B encoding ...

                                      Code:
                                      From: =?UTF-8?B?UGllciAx?= <[email protected]>

                                      Comment


                                      • #20
                                        Originally posted by Gary Beene View Post
                                        Dale,
                                        I don't think I yet understand what you're trying to tell me.

                                        The incoming string that got me started on this thread was this, which is all ASCII characters, corresponding to what I read in RFC as encoding type "Q", or quoted-printable.

                                        Code:
                                        From: "=?UTF-8?Q?Yahoo?=" &lt;[email protected]m&gt;
                                        Here's another from an email I just downloaded.

                                        Code:
                                        Subject: =?UTF-8?Q?Hi=2Cstay=20connected=20with=20Yahoo=20Mail=20mobile=20app?=
                                        From what I have read, the Q characters must be from characters in the range 0x21 to 0x7E, so a space, which is out of that range, has to be Q coded, where "=20" is the Q code for a space. In general, the RFC says this about the Q coding ...





                                        Yes you have that correct. Note the word "Printable" in "Quoted Printable Encoding". Only printable characters are used in QP encoding. And there are a few additional rules about certain of those characters and the need to encode them.

                                        e.g. RFC 2047 Para 5(3):
                                        In this case the set of characters that may be used in a "Q"-encoded 'encoded-word' is restricted to: <upper and lower case ASCII letters, decimal digits, "!", "*", "+", "-", "/", "=", and "_" (underscore, ASCII 95.)>. [QUOTE]Is the bottom line that the "?UTF-8" is misleading, in that combined with the "Q", there's a hybrid encoding that taking place. It doesn't seem to me that you and Stuart are talking apples to apples. As best I know, my need is to decode type Q coded message headers. ... added ... Stuart, just missed your post. This comment of yours ...
                                        ... doesn't clarify that characters 32 and below are not allowed Q characters. Do I have that correct? {/QUOTE] The ?UTF-8 is not misleading at all. It is telling you that thehe original header before QP eoncoding was a UTF-8 string (or converted to that from a Windows WString.) Without knowing that,you cannot convert the QP string back to its correct form (a UTF-8 string).

                                        Comment

                                        Working...
                                        X