Announcement

Collapse
No announcement yet.

Recognize text file?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Recognize text file?

    Hi fellows,

    An application of mine should convert a text file to a special format.
    The text file is an original Word-document, which the user should open with MS-Word and then save it as a plain text file, before running my app.
    MS-Word gives 2 possibilities: save as text or save as DOS-text. The latter filter obviously uses CharToOem internally, in order to replace Ansi-characters with the Oem-characterset.

    My question: can my app. recognize which characterset the text file does contain? In case it is Ansi, my app. should call CharToOem initially, otherwise this step should absolutely be skipped.

    Thanks


    ------------------
    mailto:[email protected][email protected]</A>
    www.basicguru.com/zijlema/

    [This message has been edited by Egbert Zijlema (edited October 08, 2001).]

    Egbert Zijlema, journalist and programmer (zijlema at basicguru dot eu)
    http://zijlema.basicguru.eu
    *** Opinions expressed here are not necessarily untrue ***

  • #2
    Egbert, i think the only real way would be to scan the file for certain characters - for example, a text file shouldn't contain chr$(0), but it should contain a lot of chr$(13,10)'s
    Best,
    Wayne


    ------------------
    -

    Comment


    • #3
      Fellows,

      Obviously I did'nt make myself very clear. What I mean is this: A user may save the document in 2 ways, either as Windows text (ANSI) or as "Dos" text (OEM).
      Because you never can be sure of which way a 'stupid user' will select (text is text, is'nt it?) my app. should check which characterset was used.
      Is this possible?
      Or is it save to call CharToOem anyway, hoping that nothing changes in case it was already done by MS-Word (i.e. when text was saved as Dos-text)?


      ------------------
      mailto:[email protected][email protected]</A>
      www.basicguru.com/zijlema/

      Egbert Zijlema, journalist and programmer (zijlema at basicguru dot eu)
      http://zijlema.basicguru.eu
      *** Opinions expressed here are not necessarily untrue ***

      Comment


      • #4
        Actually there are no direct ways to recognize charset in plain text and user always should have a possibility to specify OEM/ANSI.
        Meanwhile it's possible to try to do this by program way, but I am afraid that not for all languages.
        Russian OEM and ANSI are very different.
        So, it's possible to compare no. of symbols in OEM diapasons (128-175, 224-239) and ANSI (192-255).



        ------------------
        E-MAIL: [email protected]

        Comment


        • #5
          Fellows,

          I was afraid already of what Semen reports. Both text formats use the .TXT extension, so there is no visible difference.
          But now I found the following. Say, just as an example, you want to convert an accented e (é) into the at-sign (@) then your code might look as follows:
          Code:
          REPLACE CHR$(130) with CHR$(64) IN FileConten$
          This works for the OEM-charset only. But after replacing the code by
          Code:
          REPLACE "é" WITH CHR$(64) IN FileContent$
          it works correctly for both charsets. So, using a string literal instead of its ASCII code appears to do the trick.
          Is this a reliable method under all circumstances?

          ------------------
          mailto:[email protected][email protected]</A>
          www.basicguru.com/zijlema/

          Egbert Zijlema, journalist and programmer (zijlema at basicguru dot eu)
          http://zijlema.basicguru.eu
          *** Opinions expressed here are not necessarily untrue ***

          Comment


          • #6
            Like Semen points out - it is language dependent. I once wrote something
            similar for a text editor and used following table for some work. From
            this table, you can see what characters you would like to look for in
            your language. Not sure it is 100% correct, but think it is okay. Pick
            out the most common ones and look for them with INSTR, ANY. I remember
            using letters like üåäöñ£ from table below for my needs.

            Another way could be by doing an OemToAnsi converstion and then compare
            result and original, but slow and still not very secure.
            Code:
              CASE 128 'Ç
              CASE 129 'ü
              CASE 130 'é
              CASE 131 'â
              CASE 132 'ä
              CASE 133 'à
              CASE 134 'å
              CASE 135 'ç
              CASE 136 'ê
              CASE 137 'ë
              CASE 138 'è
              CASE 139 'ï
              CASE 140 'î
              CASE 141 'ì
              CASE 142 'Ä
              CASE 143 'Å
              CASE 144 'É   
              CASE 145 'æ   
              CASE 146 'Æ
              CASE 147 'ô
              CASE 148 'ö
              CASE 149 'ò
              CASE 150 'û
              CASE 151 'ù
              CASE 152 'ÿ
              CASE 153 'Ö
              CASE 154 'Ü
              CASE 155 '¢
              CASE 156 '£
              CASE 157 '¥
              CASE 160 'á
              CASE 161 'í
              CASE 162 'ó
              CASE 163 'ú
              CASE 164 'ñ
              CASE 165 'Ñ
              CASE 168 '¿
              CASE 171 '½
              CASE 172 '¼
              CASE 230 'µ

            ------------------

            Comment


            • #7
              Code:
              REPLACE CHR$(130) with CHR$(64) IN FileConten$
              This works for the OEM-charset only. But after replacing the code by
              Code:
              REPLACE "é" WITH CHR$(64) IN FileContent$
              it works correctly for both charsets.
              There must have been some error in your testing. REPLACE has no special
              magic. If "é" = CHR$(130), the results will be the same whether you use "é"
              or CHR$(130) in REPLACE.

              ------------------
              Tom Hanlin
              PowerBASIC Staff

              Comment


              • #8
                Borje,

                As I pointed out, my program does not know whether the user saved the Word-document as a (Windows) text or a DOS text.
                In Windows (ANSI-charset), for instance, the accented e (é) is CHR$(233), while in DOS (OEM-charset) it is CHR$(130). See the problem?
                To put it clear: I'm not the user of the application (would be no problem), but the same user that may have made a 'wrong' Word-to-text
                conversion in advance.

                ------------------
                mailto:[email protected][email protected]</A>
                www.basicguru.com/zijlema/

                Egbert Zijlema, journalist and programmer (zijlema at basicguru dot eu)
                http://zijlema.basicguru.eu
                *** Opinions expressed here are not necessarily untrue ***

                Comment


                • #9
                  Probably was a bit unclear, small sample may show better:
                  Code:
                    LOCAL txt AS STRING, p AS LONG
                    txt = "dfglihdfvnioöéergjodf"
                  
                    'CharToOem BYVAL STRPTR(txt), BYVAL STRPTR(txt) 'for OEM test..
                  
                    p = INSTR(txt, CHR$(233))
                    IF p THEN
                       MSGBOX "is ANSI"  '..or do whatever
                    ELSE
                       p = INSTR(txt, CHR$(130))
                       IF p THEN MSGBOX "is OEM"
                    END IF
                  Of course, looking for one character only is not enough, but using
                  this way on maybe 4-5 most common ones should work quite well.


                  ------------------

                  Comment

                  Working...
                  X