Announcement

Collapse
No announcement yet.

How to remove selected quotes from a .CSV file

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Are you going to rewrite the program. and convert the file.
    just want clean up the file.

    if you can send the data to a new file and filter it with Remove$,
    x$ = REMOVE$(MainString, [ANY] MatchString)
    in random access file you use a type files containing strings with fixed length.

    Comment


    • #22
      Originally posted by Jim Robinson View Post
      Hmmm, I've never run into that character before. I guess not too many people are writing to paper tape any more.
      Doh. I meant "nothing ABOVE CHR$(127) - I've edited my previous post

      Comment


      • #23
        Originally posted by James C Morgan View Post
        Are you going to rewrite the program. and convert the file.
        just want clean up the file.

        if you can send the data to a new file and filter it with Remove$,
        x$ = REMOVE$(MainString, [ANY] MatchString)
        in random access file you use a type files containing strings with fixed length.
        It doesn't need "cleaning" .
        You should not be removing anything from the data!
        Using fixed length strings will make the file MUCH larger.

        Comment


        • #24
          Stuart,
          Doh. I meant "nothing ABOVE CHR$(127) - I've edited my previous post
          Ah, that makes more sense. And we do deal with characters above x'127'. We don't want to mess up names like Noël, José, Nuñez, or Agnès to name a few.
          ... .... . ... . . ... ... .... . .. ... .... .. ... .... .. .... ..

          n6jah @ yahoo.com

          Comment


          • #25
            Originally posted by Jim Robinson View Post
            Stuart,
            Ah, that makes more sense. And we do deal with characters above x'127'. We don't want to mess up names like Noël, José, Nuñez, or Agnès to name a few.
            So "Almost all the files we get are ASCII format." is incorrect?

            Comment


            • #26
              Sturat,
              So "Almost all the files we get are ASCII format." is incorrect?
              Well, technically it is incorrect. I should have said:
              Almost all the files we get are in the Extended ASCII format.
              ... .... . ... . . ... ... .... . .. ... .... .. ... .... .. .... ..

              n6jah @ yahoo.com

              Comment


              • #27
                Originally posted by Jim Robinson View Post
                Sturat,

                Well, technically it is incorrect. I should have said:

                Nothing personaI!

                I will unashamedly keep beating the ASCII/ANSII and Unicode / UTF-8 / UTF-16 drum as long as I see the terms being used inaccurately. I've seen to much confusion resulting from their misuse.

                Comment


                • #28
                  There are 7,894,412 records in the file. I have discovered that 6 of the records are corrupt. By corrupt I mean that instead of 48 records for those 6 individuals there are 49. The reason for this is that some clerks recorded address information enclosed in quotes.
                  First thing, you thank your lucky stars you only have six bad records out of seven million. In my experience that would be one of the "cleanest" 'miltiple-operator-entered' CSV files ever given me to process. Second, I'd folllow the advice suggested several times above... read the entire line and spit out any with more the 48 quotes and then manually add those back without the extraneous quotation marks. Trying to work with or fix these records "in line, on the fly" is a nice dream, but that's all it is, a dream. MCM
                  Michael Mattias
                  Tal Systems (retired)
                  Port Washington WI USA
                  [email protected]
                  http://www.talsystems.com

                  Comment


                  • #29
                    Originally posted by Michael Mattias View Post
                    I'd folllow the advice suggested several times above... read the entire line and spit out any with more the 48 quotes and then manually add those back without the extraneous quotation marks. Trying to work with or fix these records "in line, on the fly" is a nice dream, but that's all it is, a dream. MCM
                    Au contraire. That is totally unnecessary and would be a major error. There are at least 96 quotes in every line

                    As has been pointed out more than once above, there are no "bad records" identified.
                    Parsing the file correctly using the appropriate delimiiter CHR$($DQ,",",$DQ) is all that is required.

                    Comment


                    • #30
                      actually a bit simpler.
                      PB PARSE$() has its own internal rules - which are used in the absence of a specified delimiter.
                      see the note in the code:

                      FUNCTION PBMAIN () AS LONG
                      OPEN "test.in.csv" FOR INPUT AS 1
                      OPEN "test.out.txt" FOR OUTPUT AS 2
                      FOR i&=1 TO 3
                      LINE INPUT #1, D$
                      PRINT "D$= ";D$
                      ct&=PARSECOUNT(D$, ",")
                      PRINT ct&
                      REDIM E$(1 TO ct&)
                      FOR x&=1 TO ct&
                      'E$(x&)=PARSE$(D$,x&) 'this does not work so well
                      E$(x&)=PARSE$(D$,",",x&) 'this produces the inner quote
                      PRINT x&,E$(x&)
                      PRINT#2, x&,E$(x&)
                      REPLACE CHR$(34) WITH "" IN E$(x&)
                      PRINT x&,E$(x&)
                      PRINT#2, x&,E$(x&)
                      NEXT x&
                      NEXT i&
                      WAITKEY$
                      END FUNCTION

                      produces (output lines are doubled, showing the effect of REPLACE)

                      1 FIRST
                      1 FIRST
                      2 LAST
                      2 LAST
                      3 ADDR2
                      3 ADDR2
                      4 ADDR1
                      4 ADDR1
                      5 CITY
                      5 CITY
                      6 ST
                      6 ST
                      7 ZIP
                      7 ZIP
                      1 "JAMES ""JIMBO"""
                      1 JAMES JIMBO
                      2 ROBINSON
                      2 ROBINSON
                      3
                      3
                      4 1234 MAIN ST
                      4 1234 MAIN ST
                      5 ANYTOWN
                      5 ANYTOWN
                      6 CA
                      6 CA
                      7 90942
                      7 90942
                      1 DAN
                      1 DAN
                      2 DRUFF
                      2 DRUFF
                      3 "DAN'S "REALLY GOOD" BARBER SHOP"
                      3 DAN'S REALLY GOOD BARBER SHOP
                      4 4321 BLONDE AVE
                      4 4321 BLONDE AVE
                      5 ANYTOWN
                      5 ANYTOWN
                      6 CA
                      6 CA
                      7 90942
                      7 90942

                      =======
                      so the question is, what do you need/want to do with the quoted bits, once you have a handle on them....

                      Comment


                      • #31
                        Originally posted by Tom Tallardy View Post
                        actually a bit simpler.
                        PB PARSE$() has its own internal rules - which are used in the absence of a specified delimiter.
                        see the note in the code:
                        ...
                        ct&=PARSECOUNT(D$, ",")
                        ...
                        'E$(x&)=PARSE$(D$,x&) 'this does not work so well
                        E$(x&)=PARSE$(D$,",",x&) 'this produces the inner quote
                        ...
                        REPLACE CHR$(34) WITH "" IN E$(x&)
                        ..
                        .
                        As has been pointed out several times, the OP should use CHR$($DQ,",",$DQ) as the delimiter. Trying to parse on a single comma will fail every time there is a comma embedded in a field. (Try changing your test date to "1234, MAIN STREET" ! )
                        Using the correct delimiter also completed resolves the issue of embedded quotes in fields.

                        Removing valid quotes in fields is not an appropriate solution. It is altering the supplied data for no justifiable reason.

                        FOR x&=1 TO ct&
                        ...
                        E$(x&)=PARSE$(D$,",",x&) 'this produces the inner quote
                        ...
                        NEXT x&
                        The PARSE instruction would replace that loop with one, much more efficient line.
                        Last edited by Stuart McLachlan; 2 Nov 2020, 08:47 PM.

                        Comment


                        • #32
                          Originally posted by Stuart McLachlan View Post
                          As has been pointed out several times, the OP should use CHR$($DQ,",",$DQ) as the delimiter. ...
                          That does leave an opening $DQ in the first field and a closing $DQ in the last field that will need to be removed.
                          Dan

                          Comment


                          • #33
                            Don't remove $DQs as asked for above. Add $DQs (probably manually when program gets confused.)
                            Code:
                            . . .  ,"4TH PLT "IMMORTALS"", . . .
                            should be
                            Code:
                            . . .  ,"4TH PLT ""IMMORTALS""", . . .
                            The format (even though the rules are loose) is Comma Separated Values, not Quote Comma Quote Separated Values. Don't expect the entry operator to double up on imbedded double quotes. The program that converted whatever to csv should have done it. You're just stuck with the result.

                            Think of $DQs as odd and even. Odd for DQ on, even for DQ off. A comma while DQ is on is imbedded and not a delimiter. A comma while DQ off is a delimiter. Two $DQs together (with nothing between) are (off/on) one imbedded $DQ. The third $DQ after IMMORTALS is DQ off for that field (so an imbedded comma would not have been a problem in the field).

                            Code:
                            . . . ,"1234", . . .
                            is text that contains only number characters (like part number).
                            Code:
                            . . . ,1234, . . .
                            is a number (like quantity).

                            (if you get a CSV from MS Excel it is a whole new world of weird. )

                            What I hinted at above means writing your own parser, versus PB's PARSE$ which works on really simple csv.

                            (all just IMO of course)

                            Cheers,
                            Dale

                            Comment


                            • #34
                              Originally posted by Dan Soper View Post
                              That does leave an opening $DQ in the first field and a closing $DQ in the last field that will need to be removed.
                              Yes it does, was addressed in post #11
                              of course its based on the CSV that I found which appears to match the OP's description of the file he has, but the OP has not supplied a link or a sample for us to properly analyze.
                              Code:
                              DIM strFields(1 TO 48) AS STRING
                              LINE INPUT #ff, strRecord                      ' Input complete line
                              strRecord = TRIM$(strRecord,$DQ)               ' get rid of first and last quote on line
                              PARSE strRecord,strFields(), CHR$($DQ,",",$DQ) ' strFields() now has all 48 records

                              Comment


                              • #35
                                anyone who has worked with .csv files knows - they have lots of variations.
                                data exported from proprietary systems via homespun report generators can be challenging, to say the least.

                                using a comma delimiter works except where you have _both_ embedded comma(s) _and_ embedded_ quotes.

                                using "," as a delimiter fails on many .csv because blank fields often show up as ",,"

                                "quoted strings" are not always "proper" quoted strings
                                Excel auto-quotes any cell with a comma when saved as .csv
                                (not sure what Excel does if the comma is used as system decimal place....)

                                for
                                D$= "1234, MAIN STREET","1235, MAIN STREET","1236, MAIN STREET"

                                for x&=1 to 3
                                PARSE$(D$,x&)
                                next
                                produces
                                1234, MAIN STREET
                                1235, MAIN STREET
                                1236, MAIN STREET

                                using
                                ","
                                as a delimited fails when blank fields are not double quoted - often the case

                                using PARSE$(D$,",",x&) fails when the quoted strings contain _both_ a comma _and_ embedded quotes

                                basically one first has to find out how many goofy situations are in the .csv file before one can "handle" it....
                                like one .csv I got with form feeds embedded every x lines....

                                Comment


                                • #36
                                  As has been pointed out more than once above, there are no "bad records" identified.
                                  Parsing the file correctly using the appropriate delimiiter CHR$($DQ,",",$DQ) is all that is required.
                                  IMO you are thinking like a government contractor. You are allowing perfection to be the enemy of the usable. This is sometimes called "paralysis by analysis."

                                  You are also assuming the ONLY error in a CSV file created by multiple operators is doubled quotes. As I pointed out above, with only six bad records out of seven million you are already luckier than the fox who fell asleep and woke up in the henhouse.

                                  Now that I am retired I should probably create that 'long read' I have thought about over the years.. the difference between "systems and utilities" programmers and "applications" programmers.

                                  MCM
                                  Michael Mattias
                                  Tal Systems (retired)
                                  Port Washington WI USA
                                  [email protected]
                                  http://www.talsystems.com

                                  Comment


                                  • #37
                                    Originally posted by Dan Soper View Post
                                    That does leave an opening $DQ in the first field and a closing $DQ in the last field that will need to be removed.
                                    That's why both Post #11 and #15 have.

                                    Code:
                                    strRecord = TRIM$(strRecord,$DQ) ' get rid of first and last quote on line
                                    PARSE strRecord,strFields(), $DQ & "," & $DQ

                                    Comment


                                    • #38
                                      Originally posted by Dale Yarker View Post
                                      Don't remove $DQs as asked for above. Add $DQs (probably manually when program gets confused.)
                                      ...
                                      The format (even though the rules are loose) is Comma Separated Values, not Quote Comma Quote Separated Values
                                      ...
                                      What I hinted at above means writing your own parser, versus PB's PARSE$ which works on really simple csv.
                                      .
                                      For the 5th (?) time! You don't need to do anything with the $DQs if you PARSE on CHR$($DQ,","$DQ) ( other than removing the leading and trailling one on each record0;

                                      The format of the data file in question is "quoted comma separated values"

                                      PARSE and PARSE$ work on whatever delimiter string you pass to them.
                                      Look at posts 11 and 15.

                                      (Doesn't anyone read other people's posts? )

                                      Comment


                                      • #39
                                        Originally posted by Rod Macia View Post

                                        Yes it does, was addressed in post #11
                                        of course its based on the CSV that I found which appears to match the OP's description of the file he has, but the OP has not supplied a link or a sample for us to properly analyze.
                                        Given that the the problem data shown by the OP matches your sample file exactly, the odds are pretty good

                                        Comment


                                        • #40
                                          (Doesn't anyone read other people's posts? )
                                          Yes! Repeating it 4 times inspired me.
                                          Dale

                                          Comment

                                          Working...
                                          X