Announcement

Collapse
No announcement yet.

Unicode Character Replacement

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unicode Character Replacement

    I downloaded a file and it displayed this:

    He'd
    but when I looked at the file in hex, it had this ...

    Code:
    48 65 E2 80 99 64
    where the E2 80 99 is apparently Unicode for an apostrophe

    I can't say I've run across that before. I thought the rule of thumb was that a Unicode files started wtih EF BB BF, which I don't see in this file.

    This will get rid of the Unicode bytess easily enough ...
    .
    Code:
    Replace Chr$(&HE2) + Chr$(&H80) + Chr$(&H99) With "'" In temp$
    Insight anyone?

  • #2
    Originally posted by Gary Beene View Post
    I downloaded a file and it displayed this:



    but when I looked at the file in hex, it had this ...

    Code:
    48 65 E2 80 99 64
    . I thought the rule of thumb was that a Unicode files started wtih EF BB BF, which I don't see in this file.
    [/CODE]
    Unicode is not a file format, it is a character encoding. How the character encoding is represented in a file is a different matter.
    (IOW, there is not a single format for a "Unicode file").

    There are several ways that Unicode can be represented in a file, including the commonly encountered UTF-16 and UTF-8.

    It would appear that the file you have is UTF-8 encoded since some characters are encoded with a single byte. UTF-8 encodes Unicode characters with anything from 1 to 4 bytes.

    EF BB BF is the optional Byte OrderMark for UTF-8. ( For UTF-16, it is FE FF)

    To quote Wikipedia:

    BOM use is optional. Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream.

    IOW, it's up to you to know or determine what the format of an unknown text file format which as you are learning can be difficult


    Although similar in appearance, E28099 is not an apostrophe. It is 3 byte UTF-8 encoding of the Unicode code point &H2019 which is the "right single quotation mark" (Those nasty things that MS Word likes to throw in and call "smart quotes").

    http://unicode.scarfboy.com/?s=U%2BE28099

    Comment


    • #3
      load the file into a string = MyFile
      mystring as string
      myWstring as wstring

      mystring = utf8tochr$ (Myfile) you will get Ansi version.
      myWstring = utf8tochr$ (Myfile) you will get wide version.

      Comment

      Working...
      X