Announcement

Collapse
No announcement yet.

Long CSV fields

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by Pierre Bellisle View Post
    I found a glitch in my code above, it is now corrected.
    Rod, you have the same issue.
    For a field to be considerred double-quoted,
    the first character and the last character of the field have to be double-quote.
    So a record like:
    Field 1, "Field 2",Field 3
    is not considered to have the second field as double-quoted
    because of the leading space in front of "Field 2".
    Here is another Anomaly. Excel does not appear to follow that rule completely.

    Given:
    Code:
    Field 1, "Field 2",Field 3
    Field 1,"Field 2" ,Field 3
    Field 1, "Field 2" ,Field 3
    
    Excel Parses like this.
    [Field 1][ "Field 2"][Field 3]
    [Field 1][Field 2 ][Field 3]
    [Field 1][ "Field 2" ][Field 3]
    
    Pierre Code like this.
    [Field 1][ "Field 2"][Field 3]
    [Field 1]["Field 2" ][Field 3]
    [Field 1][ "Field 2" ][Field 3]

    Comment


    • Thanks Dale.... Hard to see the CRLF with all those CRLF in there!

      Just for fun... When I open this in Excel it looks like these are 92 columns. Interestingly the email address looks like always shows up in column 60!

      So if we could establish that these are X number of columns with $CRLF row delimiters then SQL Server import will do this for you.
      Now you may not know how to do SQL Server import - BUT if it works it proves this data (these data??) can be reliably parsed.

      Comment


      • Pierre, You're absolutely right, I was ignoring rule 5... I guess I was influenced by what Microsoft Excel allowed. I suppose my program could restrict its use to a rigorous adherence to RFC4180, and give the user an error message, or simply let it conform to what Excel will accept. Anyway, I found a couple of glitches with the myParse function, and the corrected code follows:

        Code:
        function myParse(byval s as string, n as long) as string
           local i, j, start, jSave, done, fieldNo as long
           local sField, si, sii as string
           done = 0: fieldNo = 0: j = 0
           do until done
              i = j + 1
              start = i: jSave = 0
              si = mid$(s, i, 1)
              sii = mid$(s, i, 2)
              if si = $dq and sii <> $dq2 then 'field begins with $DQ... search for next $DQ 
                 do 
                    j = instr(i+1, s, $dq) 'next $DQ
                    if j and mid$(s, j, 2) <> $dq2 then 'look for next comma
                       jSave = j 'end of $DQ span                                        
                       j = instr(jSave, s, ",") 'end of field
                       if j = 0 then j = len(s) + 1      
                       incr fieldNo: if fieldNo = n then done = 1
                    elseif j = 0 then 'end of string
                       incr fieldNo: if fieldNo = n then done = 1: j = len(s) + 1: exit DO
                    elseif mid$(s, j, 2) = $dq2 then 'it is $DQ2... keep looking for $DQ to end field 
                       s = clip$(mid s, j, 1) 'change $DQ2 to $DQ by clipping out the first $DQ
                       if j = len(s) then
                          incr fieldNo: if fieldNo = n then done = 1
                          j = len(s) + 1: exit: exit
                       else
                          i = j
                       end if
                    end if
                 loop until jSave                      
              elseif sii = $DQ2 then 'ignore beginning $DQ2
                 s = clip$(mid s, i, 2)
              else 'look for next comma to end the field
                 j = instr(i, s, ",")
                 if j = 0 then j = len(s) + 1: done = -1
                 incr fieldNo: if fieldNo = n then done = 1
              end if
           loop
           if done > 0 then
              sField = mid$(s, start, j - start)
              jSave = jSave - start + 1
              if mid$(sField, 1, 1) = $dq then 'remove $DQ's marking $DQ span
                 if jSave then sField = clip$(mid sField, jSave, 1)
                 sField = clip$(left sField, 1)
              end if                      
           end if 
           function = sField  
        end function

        Comment


        • When reading a not compliant RFC4180 CSV file
          Excel will probably try to get data the best it can
          and this is quite OK as far as I'm concerned, I think it's good practice.
          But of course it has some limits, to a certain point only garbage can be expected.

          From my point of view, if I remember well, in all the test I've done,
          when a CSV file is valid, I got the right data in Excel.
          With few exceptions like a field in double-quotes containing a number
          is considered as text cell by Excel and as a number cell with no double-quotes.
          But it's more a goodie than anything else. And all remain RFC4180 compliant.

          If a CSV file is invalid, Excel may "repair" it more or less, and again, it's fine.
          Still on my part, my only goal is to read a valid RFC4180 file the correct way.
          If the file is invalid, I got two options,
          a) Get a new valid file if possible.
          b) Check the file in an editor to see what can be done to go forward.

          Comment


          • Here is a little better formulation that doesn't involve the need for $DQ2 tokens. Just read the data from notepad or any other text editor and use the myParse function unchanged.

            Code:
            FUNCTION PBMAIN()                                              
               local i, iMax, hFile as long
               local s, sFields as string
            
               iMax = 1000
               dim a(1 to iMax) as byte
               'read in a csv record from notepad or any other text editor
               hFile = FREEFILE                          
               OPEN "notepad0.csv" FOR BINARY AS hFile
               do 
                  incr i
                  GET hFile,, a(i)
                  if a(i) = asc($cr) then exit do
                  s = s + chr$(a(i))
               loop until i = iMax
               i = 1
               do 
                  sFields = sFields + myParse(s, i) + $crlf: incr i
               loop until len(myParse(s, i)) = 0
               msgbox sFields
            END FUNCTION
            .

            Comment


            • Here's a class I wrote a few years back: http://cur-ion.net/delimited.html. Download links are at the bottom right of the page.
              LarryC
              Website
              Sometimes life's a dream, sometimes it's a scream

              Comment


              • Since arrays aren't necessary, I updated post #145 here. I have also included an outer loop to read multiple csv strings created with notepad. I actually just wanted to edit post #145 to reflect this change, but the editing process for this forum is confusing to me... sometimes my posts have an edit icon and other times not. I looked in the community forum's help but didn't find anything on editing there.

                This code assumes the user enters string data into notepad that is consistent with RFC 4180, especially rules 5 and 7. Perhaps, later I can add code to help adhere to those rules

                Code:
                FUNCTION PBMAIN()                                              
                   local i, limit, hFile as long
                   local s, sFields as string
                   local b as byte
                
                   limit = 1000                                                      
                   'read in some csv records from a notepad created file (or any other text editor)
                   hFile = FREEFILE                          
                   OPEN "notepad1.csv" FOR BINARY AS hFile  
                   do
                      s = "": i = 0                                                  
                      do
                         incr i
                         GET hFile,, b
                         if b = asc($cr) then exit do
                         s = s + chr$(b)
                      loop until i >= limit
                      i = 1: sFields = ""
                      do
                         sFields = sFields + myParse(s, i) + $crlf: incr i
                      loop until len(myParse(s, i)) = 0
                      msgbox sFields
                   loop until eof(hFile)
                END FUNCTION
                Last edited by Charles Dietz; 12 Aug 2017, 03:11 PM.

                Comment


                • Charles
                  RFC4180 2.6 Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
                  This compliant record fails with the changes of post #145 and post #147, actually program gets into endless loop

                  Code:
                  1996,Jeep,Grand Cherokee,"MUST SELL!
                  air, moon roof, loaded","""4799.00"""
                  there is a CRLF embedded in 4th field after !

                  Also if I the last field does not come out OK should be ["4799.00"] but with myPasre i get [4799.00"] (leading DQ misisng)
                  Last edited by Rod Macia; 12 Aug 2017, 10:47 PM.

                  Comment


                  • Pierre,

                    Using your code, if you remove the first line(Titles->92 columns) from Kerry's "confangus.csv" since the next line only has 86 fields, your listview will only use 86 Columns instead of the 92 needed.

                    Looks like you are parsing rows of varying column count correctly, but assigning listview columns based on first row count.

                    ps: Outlook Export, does not respect RFC4180 2.4 "Each line should contain the same number of fields throughout the file." so the resulting Outlook contacts csv contains varying columns from 86 to 92.

                    Comment


                    • Rod,
                      As stated before, I try to respect RFC4180 rules.

                      Rule number 4 is pretty clear:
                      Within the header and each record, there may be one or more fields, separated by commas.
                      Each line should contain the same number of fields throughout the file.


                      So, yes I use the first line to count the fields, as stated before I tried to keep
                      the code as simple as I can, so it will be easy to read for the most of us.
                      I got some other code that count the maximum count of fields based on all records,
                      the maximum characters count in any fields, etc... But I do not use it here by choice.

                      About, ConfAngus.csv, true, it is not compliant.
                      This said, if you take it as is, my code will read it from start to end no glitch.
                      All 86 and 92 columns record will fall in place.
                      Excel will do the same. I guess it should be good because that is what Kerry want.

                      An interesting thing to know would be: Does Outlook can create a CSV with the first record having less fields than any other record?

                      Since this file is hard to read with only "X" for character.
                      I did a clone that use "A" to "Z" to compare fields between application.
                      Just rename to ".csv"
                      Attached Files
                      Last edited by Pierre Bellisle; 12 Aug 2017, 08:51 PM.

                      Comment


                      • All 86 and 92 columns record will fall in place.
                        Excel will do the same
                        But Excel will also display the file without header, with the proper amount of columns even if the first line has less columns.

                        Pierre,
                        I know your code as is will do the job required, by Kerry (so does mine -- shameless plug)
                        I just pointed out the missing columns in the listview (under specific conditions) in case other may not notice it. When you have large amount of columns and rows it may be hard to notice.


                        Now that I re-read RCF4180, Outlook should be considered compliant regarding the varying number of fields.

                        Rule number 3 and 4:
                        3. There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file
                        4. Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma.

                        Should ==> not must
                        So its highly recommended but not an obligation, No wonder everybody is confused, on how to properly Parse CSV's

                        Comment


                        • >But Excel will also display the file without header,
                          >with the proper amount of columns even if the first line has less columns.

                          As I said before:
                          >When reading a not compliant RFC4180 CSV file
                          >Excel will probably try to get data the best it can
                          >and this is quite OK as far as I'm concerned, I think it's good practice.
                          Also, not only the first record, but many have either 86 and 92 columns.

                          >No wonder everybody is confused...
                          I'm not, as I see it, the form "should be" is used because of this preceding statement...
                          >Surprisingly, while this format is very common, it has never been formally documented.
                          So you have to consider this informal context.
                          If you dismiss all the non imperative statements in RFC4180, there is nothing left. Simple.

                          No offense Rod, but I see that I'm now going in circle with this conversation,
                          repeating what's already been said. I'm afraid that it's not really my cup of tea. Sorry. ;-)

                          Comment


                          • Pierre
                            It looks like I'm the one that offended you if that is the case I apologize.
                            did you notice the smileys I was being sarcastic in an non confronting way.

                            When I said everybody, I meant the internet not you specifically, I google quite a bit about CSV the past few days, and I found no real consensus. The closest thing was RFC4180. Most appear to adapt it to their specific needs. People appear to pick and choose which rules they comply with by choice, by misunderstanding, or by need.

                            My point is at which point of picking and choosing are you considered mostly Compliant? (this is a rhetorical question its not aimed at anyone)
                            I guess that compliance is much more important when creating the CSV. When reading it, if you don't have access to original creator, you need to adapt.

                            Comment


                            • Originally posted by Pierre Bellisle View Post
                              Hey Knuth,

                              In BASIC "123" is a string and 123 a number.
                              Those rules are not CSV rules.
                              Double-quotes are not used to differentiate between number and string.
                              Double-quotes are used to embed separator, aka comma,
                              or CRLF, or other double-quotes (expressed as dual double-quotes) as a part of a field.
                              So they are not confused as a LineBreak or a field separator.

                              Reading in a CSV
                              "123",123,"ABC",ABC
                              is the same as reading
                              123,"123",ABC,"ABC"
                              as
                              123,123,ABC,ABC
                              as
                              "123","123","ABC","ABC"
                              Not enclosing string data in double quotes in CSV is like using VARIANT for any variable in BASIC; yes it might work most of the times, but when it fails, it fails spectacularly. Most CSV importers/parser will choke on lots of non-ASCII data in string columns, if the data is not enclosed in quotes. This goes way beyond commas or double quotes. As a starter I herewith lend you a pair of umlauts: äöüÄÖÜ. Try those in a character column without enclosing it in double quotes.

                              Comment


                              • Originally posted by Rod Macia View Post

                                ...

                                Now that I re-read RCF4180, Outlook should be considered compliant regarding the varying number of fields.

                                ...

                                Should ==> not must
                                So its highly recommended but not an obligation, No wonder everybody is confused, on how to properly Parse CSV's
                                It's not even highly recommended. The issue of "compliance" or "obligation" is moot..

                                A couple of points about RFC 4180:

                                1. It is an "Informational" RFC. It is not on the "Standards Track",
                                "Status of This Memo
                                This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited."
                                2. Section 2: " there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files. This section documents the format that seems to be followed by most implementations:" (emphasis added) The best that can be said of a "well formed" CSV file is that either it conforms to a defined set of rules or it conforms to the recommendations of Section 2 of the RFC..

                                --
                                [URL="http://www.camcopng.com"]CAMCo - Applications Development & ICT Consultancy[/URL][URL="http://www.hostingpng.com"]
                                PNG Domain Hosting[/URL]

                                Comment


                                • Originally posted by Knuth Konrad View Post

                                  Not enclosing string data in double quotes in CSV is like using VARIANT for any variable in BASIC; yes it might work most of the times, but when it fails, it fails spectacularly. Most CSV importers/parser will choke on lots of non-ASCII data in string columns, if the data is not enclosed in quotes. This goes way beyond commas or double quotes. As a starter I herewith lend you a pair of umlauts: äöüÄÖÜ. Try those in a character column without enclosing it in double quotes.
                                  I disagree ( surprise, surprise!). There should be no need to quote ANY string unless it contains a comma or a quote, regardless of whether the encoding is ASCII; extended ASCII; UTF8, UTF16,UTF32 Unicode; or any encoding that the parsing program can interpret.



                                  --
                                  [URL="http://www.camcopng.com"]CAMCo - Applications Development & ICT Consultancy[/URL][URL="http://www.hostingpng.com"]
                                  PNG Domain Hosting[/URL]

                                  Comment


                                  • Originally posted by Stuart McLachlan View Post

                                    I disagree ( surprise, surprise!). There should be no need to quote ANY string unless it contains a comma or a quote, regardless of whether the encoding is ASCII; extended ASCII; UTF8, UTF16,UTF32 Unicode; or any encoding that the parsing program can interpret.
                                    I agree with you (in theory).

                                    However, MS seems to not agree with us here. The fact that I mentioned umlauts in the first place stems from my experience trying to import unqoted CSVs into MS SQL-Server through - I don't remember exactly, either ODBC ot ADO. The import bombed until I enquoted the string columns that may contain special characters such as umlauts or "ß". Granted - that dates back a couple of years. But that's when I decide that if I hav to deal with CSVs, I always enquoted data in string columns. And I've never looked back.

                                    Unrelated: what makes me both chuckle and tear off my hair in despair that by the time this whole thread is going and produces failing parsing routines, one could have written a XML exporter for Outlook. Either as a standalone executable or as a Outlook (VBA) macro and be done with it once and for all.

                                    Comment


                                    • I have about run out of steam trying to help Kerry. I have tried to conform to RFC4180 the best that I know how. And I want to thank Pierre and Rod for showing the errors I made conforming to RFC4180. I have attached the source code, the executable file, and a notepad data file for testing a PowerBasic implementation for parsing csv files.
                                      Attached Files
                                      Last edited by Charles Dietz; 14 Aug 2017, 06:24 PM.

                                      Comment


                                      • Thanks for all your effort Charles

                                        I have been trying to implement your suggestions

                                        I will check what you have just posted

                                        I like to think that this discussion has been useful beyond my question

                                        Thanks again

                                        Kerry
                                        [I]I made a coding error once - but fortunately I fixed it before anyone noticed[/I]
                                        Kerry Farmer

                                        Comment


                                        • Hey Charles,
                                          I did a quick test, not a glitch. Great job... :-)

                                          Comment

                                          Working...
                                          X