No announcement yet.

Determining file saved as unicode?

  • Filter
  • Time
  • Show
Clear All
new posts

  • Determining file saved as unicode?

    Is there an API to test the storagetype of the (text)file?

  • #2
    Edwin, simply open the file and read the first 2 bytes. If the file is stored in Unicode the first 2 bytes will be 0xFFFE or 0xFEFF (depending on the endian used). To create such a file yourself just fire up write.exe, which allows you to save text files in Unicode.


    • #3
      I was confused and though ordinary fileopening skipped those bytes.
      It ain't so and i can read these bytes indeed (tested).

      Thanks, i'll make a filter for this.


      • #4
        Something like this i guess..
            [color=#0000FF]Local[/color] T [color=#0000FF]As[/color] [color=#0000FF]String[/color]
            T = [color=#0000FF]VD_LoadFromFile[/color]( "C:\mydoc.txt" )
            [color=#0000FF]MsgBox[/color] [color=#0000FF]Format$[/color]( FileIsUnicode( T ) ) & [color=#0000FF]$CrLf[/color] & [color=#0000FF]Left$[/color]( T, 10 )
        [color=#0000FF]Function[/color] FileIsUnicode( [color=#0000FF]ByVal[/color] sLeftData [color=#0000FF]As[/color] [color=#0000FF]String[/color] * 6 ) [color=#0000FF]As[/color] [color=#0000FF]Long[/color]
            [color=#0000FF]Local[/color] pByte [color=#0000FF]As[/color] [color=#0000FF]Byte[/color] [color=#0000FF]Ptr[/color]
            sLeftData = [color=#0000FF]Left$[/color]( sLeftData, 3 ) & String$( 3, 0 )
            pByte = [color=#0000FF]VarPtr[/color]( sLeftData )
            [color=#0000FF]If[/color] ( [color=#7F007F]@pByte[0][/color] = &HFF [color=#0000FF]And[/color] [color=#7F007F]@pByte[1][/color] = &HFE ) _
                [color=#0000FF]Or[/color] ( [color=#7F007F]@pByte[0][/color] = &HFE [color=#0000FF]And[/color] [color=#7F007F]@pByte[1][/color] = &HFF ) [color=#0000FF]Then[/color]
                [color=#0000FF]Function[/color] = 1
            [color=#0000FF]End[/color] [color=#0000FF]If[/color]
        [color=#0000FF]End[/color] [color=#0000FF]Function[/color]


        • #5
          Get yourself a hex editor eheh - very useful for these sorta situations. I use Hex Workshop

          To test those two bytes its best to use a WORD PTR rather than BYTE PTR ... I'd go with something like ...
          FUNCTION IsFileUnicode(BYVAL hFile AS DWORD) AS DWORD
          LOCAL sBuf AS STRING * 2, wPtr AS WORD PTR
          SEEK #hFile, 1:  GET$ #hFile, 2, sBuf
          wPtr = VARPTR(sBuf)
          IF @wPtr = &h0000FFFE OR @wPtr = &h0000FEFF THEN FUNCTION = 1
          END FUNCTION
          LOCAL hFile AS DWORD
          hFile = FREEFILE
          OPEN "c:\unicode.txt" FOR BINARY ACCESS READ LOCK SHARED AS #hFile
           IF IsFileUnicode(BYVAL hFile) = 1 THEN
               MSGBOX "File is unicode"
               MSGBOX "Not unicode"
           END IF
          CLOSE #hFile
          END FUNCTION
          Last edited by Wayne Diamond; 30 Jul 2009, 11:35 AM.


          • #6
            The WinAPI function IsTextUnicode() sounds handy for your task.

            If you are on Win9x, here is what you MUST use to use this API: PB/CC: IsTextUnicode with Microsoft Unicode Layer for Win95/98/ME April 11, 2002

            Should work on anything after 9x with only minor tweaks.

            Michael Mattias
            Tal Systems (retired)
            Port Washington WI USA
            [email protected]


            • #7
              IsTextUnicode is probably the better way to go if you need specific info about the file, there's a fair amount to it at the assembly level though (mainly because it can tell you exactly what type of file it is - see the IsTextUnicode documentation) ... here is just a small fraction from the start of ntdll's RtlIsTextUnicode

              77F612E5 >  55                   push ebp
              77F612E6    8BEC                 mov ebp, esp
              77F612E8    83EC 5C              sub esp, 5C
              77F612EB    8B4D 0C              mov ecx, dword ptr ss:[ebp+C]
              77F612EE    53                   push ebx
              77F612EF    33DB                 xor ebx, ebx
              77F612F1    56                   push esi
              77F612F2    8BF1                 mov esi, ecx
              77F612F4    D1EE                 shr esi, 1
              77F612F6    B8 00010000          mov eax, 100
              77F612FB    3BF0                 cmp esi, eax
              77F612FD    895D CC              mov dword ptr ss:[ebp-34], ebx
              77F61300    895D D0              mov dword ptr ss:[ebp-30], ebx
              77F61303    895D D4              mov dword ptr ss:[ebp-2C], ebx
              77F61306    895D D8              mov dword ptr ss:[ebp-28], ebx
              77F61309    895D DC              mov dword ptr ss:[ebp-24], ebx
              77F6130C    895D AC              mov dword ptr ss:[ebp-54], ebx
              77F6130F    895D B0              mov dword ptr ss:[ebp-50], ebx
              77F61312    895D BC              mov dword ptr ss:[ebp-44], ebx
              77F61315    895D C0              mov dword ptr ss:[ebp-40], ebx
              77F61318    895D C4              mov dword ptr ss:[ebp-3C], ebx
              77F6131B    895D C8              mov dword ptr ss:[ebp-38], ebx
              77F6131E    895D E0              mov dword ptr ss:[ebp-20], ebx
              77F61321    895D B4              mov dword ptr ss:[ebp-4C], ebx
              77F61324    895D B8              mov dword ptr ss:[ebp-48], ebx
              77F61327    895D A8              mov dword ptr ss:[ebp-58], ebx
              77F6132A    895D F0              mov dword ptr ss:[ebp-10], ebx
              77F6132D    895D E8              mov dword ptr ss:[ebp-18], ebx
              77F61330    895D EC              mov dword ptr ss:[ebp-14], ebx
              77F61333    895D E4              mov dword ptr ss:[ebp-1C], ebx
              77F61336    895D F4              mov dword ptr ss:[ebp-C], ebx
              77F61339    895D FC              mov dword ptr ss:[ebp-4], ebx
              77F6133C    8975 A4              mov dword ptr ss:[ebp-5C], esi
              77F6133F    8945 F8              mov dword ptr ss:[ebp-8], eax
              77F61342    77 03                ja short ntdll.77F61347
              77F61344    8975 F8              mov dword ptr ss:[ebp-8], esi
              77F61347    83F9 02              cmp ecx, 2
              77F6134A  ^ 0F82 0C6CFFFF        jb ntdll.77F57F5C
              77F61350    8B55 08              mov edx, dword ptr ss:[ebp+8]
              77F61353    0F84 96330200        je ntdll.77F846EF
              77F61359    83F9 02              cmp ecx, 2
              77F6135C    76 18                jbe short ntdll.77F61376
              77F6135E    3BF0                 cmp esi, eax
              77F61360    77 14                ja short ntdll.77F61376
              77F61362    F645 0C 01           test byte ptr ss:[ebp+C], 1
              77F61366    75 0E                jnz short ntdll.77F61376
              77F61368    8B45 F8              mov eax, dword ptr ss:[ebp-8]
              77F6136B    F64442 FF FF         test byte ptr ds:[edx+eax*2-1], 0FF   [B]<-- Check for the FF magic byte[/B]
              77F61370    0F84 91330200        je ntdll.77F84707


              • #8
                Thanks for the API.
                I'll use my code for now until i encounter a new issue.
                I know i don't catch all types but it ain't a big deal right now..



                • #9
                  Yo, Wayne, your "check the disk file header" code may be working now, but there is a bug waiting to bite you in a tender area....
                  OPEN "c:\unicode.txt" FOR BINARY ACCESS READ LOCK SHARED AS #hFile
                  SEEK #hFile, 0
                  You should "SEEK hFile, 1" unless you have used the "BASE=0" clause in the OPEN, because the default is BASE=1.

                  Or, you can remove the guesswork entirely (recommended):
                  SEEK #hFile, FILEATTR(hFile, -2&)
                  Last edited by Michael Mattias; 30 Jul 2009, 09:47 AM.
                  Michael Mattias
                  Tal Systems (retired)
                  Port Washington WI USA
                  [email protected]


                  • #10
                    Also if you want to get halfword at start of file, no sense creating a string and then requesting the address of the data...

                    LOCAL FileHeaderWord AS WORD
                      SEEK ...
                      GET hFile, ,FileHeaderWord

                    You might have to twink with the order of the bytes in your comparison.

                    Not that it matters, because OPENing and reading the file takes a LOT more time than does building a temp string and this is really moot performance-wise. But I do believe fewer steps to be a tad less cryptic when you go back to this code in the future.

                    Michael Mattias
                    Tal Systems (retired)
                    Port Washington WI USA
                    [email protected]


                    • #11
                      Michael, yes you're correct the SEEK instruction should've been set at 1 (without BASE declaration), well spotted and cheers for the heads up. I've updated the aforementioned code sample.