Announcement

Collapse
No announcement yet.

How to remove selected quotes from a .CSV file

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by Michael Mattias View Post

    IMO you are thinking like a government contractor. You are allowing perfection to be the enemy of the usable. This is sometimes called "paralysis by analysis."

    You are also assuming the ONLY error in a CSV file created by multiple operators is doubled quotes.
    YO is incorrect. It's got nothing to do with "perfection".
    I'd bet that there are numerous data entry errors in that 7+ million records. I've spent years working with a similar sized electoral roll with all sorts of strange characters and bad data.
    There is no evidence to suggest any such thing.

    I'm stating that there are no identified errors in the data and there is no evidence to suggest that it cannot be correctly parsed into data fields using CHR$($DQ,","$DQ) as the PARSE function delimiter.


    Comment


    • #42
      I did a job for a business who was seriously impacted by "the e-world changes" - absolute potential to kill the business. analyzing their history vs present and modeling projections was crucial - and dang it, really needed to be "right"

      simple, eh? here's the structure, here's the fields, here's the codes....

      one year order lines was 110 million. and boy you are really spot on about 'errors.' there were fields and each field had expected values and on the first passes some 25% of the data was "bad"
      frankly the real question in my mind was how in the world do you operate a business with that kind of "records?"

      the suggestion about "seeking perfection" is valid, under circumstances. the point is, is it important to create code to handle it, or is it more important to 'get the job done.'
      for one-off smallish things it's usually much more effective to identify and manually correct "bad record"

      25% of 110 million....? nah, not going there. multiple years, multiple model projections....
      eventually we managed to winnow down and 'auto correct' many errors, but in the end had to throw away about 5% of the data as "WTF, over?" - typically coded fields that virtually no one had any idea what that specific errant code meant, or how the system allowed mystery codes to get there.

      Comment


      • #43
        Originally posted by Stuart McLachlan View Post

        That's why both Post #11 and #15 have.

        Code:
        strRecord = TRIM$(strRecord,$DQ) ' get rid of first and last quote on line
        PARSE strRecord,strFields(), $DQ & "," & $DQ
        Overlooked that, sorry. Is TRIM$() the best way to do it, eg if the last field ends with an embedded $DQ? I'd probably use MID$(strRecord, 2, LEN(strRecord) - 2)
        Dan

        Comment


        • #44
          Dan, The idea is to not remove the imbedded DQ(s), just the leading (at start of 1st field) and ending (end of last field) ones when the parse is on quote comma quote.

          (Stuart, I read the posts, just don't fully agree with parsing that way. )
          Dale

          Comment


          • #45
            Originally posted by Dan Soper View Post

            Overlooked that, sorry. Is TRIM$() the best way to do it, eg if the last field ends with an embedded $DQ? I'd probably use MID$(strRecord, 2, LEN(strRecord) - 2)
            Yep, that's a safer way to do it in case the first field starts with or the last field ends with an embedded quote.

            Comment


            • #46
              Originally posted by Dale Yarker View Post
              (Stuart, I read the posts, just don't fully agree with parsing that way. )
              Why not?

              Comment


              • #47
                Closer to proposal in RFC 4180 regarding double quotes (even though everyone does what want). And, "not invented here" (because I figured out a way to implement it maybe 15 years ago!). ((And, and I'm entitled to do it the way I want too, and only said so once, )) Cheers,
                Dale

                Comment


                • #48
                  the point is, is it important to create code to handle it, or is it more important to 'get the job done ]?].'
                  Exactly!

                  This is kind of tied in to my earlier musings about the difference between "tools and utilities" programmers and "applications" programmers. "Tools and Utilities" guys* believe their software will be used often, and forever. Applications guys* more easily can handle thinking in terms of a specific job+.

                  MCM
                  * The term "guys" may be considered by some to be sexist. Tough.
                  + I am an 'applications' guy so I may be biased.
                  Michael Mattias
                  Tal Systems Inc. (retired)
                  Racine WI USA
                  [email protected]
                  http://www.talsystems.com

                  Comment


                  • #49
                    Steuart:
                    Yes, I DID misunderstand the problem. I realize now the data is valid and that I was using the wrong "tool/technique" to try and extract the data. Thank you for the method in post #11. I'm working with this voter data set for my own "personal enjoyment." I won't be mail bombing anybody!

                    Manuel:
                    I did realize that each record contained 48 fields by using LINE INPUT # and then using TALLEY to count the quote marks. The records that gave me trouble had more than 96 "'s. That's how I located the 6 records that I couldn't read correctly with INPUT.

                    James:
                    When I first started playing with voter record data sets they came as RANDOM access files - up until the last 2 times. [Rod identified the web location for these files in post #8] I used those datasets to become familiar with working with random access files. I have programs that are written to read RANDOM access files I make with selected fields from main dataset. What I have been doing with the .CSV files is read them, then write them out into random access files so they will work with my old programs. NOT VERY EFFICIENT, I know.

                    MCM:
                    I had considered "hand fixing" the dataset because there's only 6 records to deal with (2 records in the previous set). However, I'm thinking the trend is that more data items will appear in the future that are quoted within a field. So I'm learning how to handle this situation now. I may even get inspired to change my voter registration FIRST_NAME to "Tim, "The Tool Man"",.

                    I thought my original post would result in a simple answer. I was surprised at the amount of discussion it generated. I appreciate all who took time to post. I learned alot from the discussion along with a solution to my original question. I'm still studying the responses and learning along the way.

                    Tim

                    Comment


                    • #50
                      Originally posted by Stuart McLachlan View Post
                      Originally posted by Dan Soper
                      Overlooked that, sorry. Is TRIM$() the best way to do it, eg if the last field ends with an embedded $DQ? I'd probably use MID$(strRecord, 2, LEN(strRecord) - 2)

                      Yep, that's a safer way to do it in case the first field starts with or the last field ends with an embedded quote.
                      If you are using PBWin10 or PBCC6
                      you can use the new function "UNWRAP$"

                      Code:
                      strRecord = UNWRAP$(strRecord, $DQ,$DQ)
                      Purpose Remove paired characters from the beginning and end of a string.
                      Syntax s$ = UNWRAP$(StringExpression, LeftChar$, RightChar$)
                      Remarks The UNWRAP$ function removes the characters in LeftChar$ from the beginning of StringExpression, if there is an exact match. It then removes the characters in RightChar$ from the end, if there is an exact match. The remaining character are then returned. For example:
                      UNWRAP$("<MyWord>", "<", ">") returns "MyWord"
                      UNWRAP$ is particularly useful for removing parentheses, quotes, brackets, etc. from a text item

                      Comment


                      • #51
                        thought my original post would result in a simple answer. I was surprised at the amount of discussion it generated.
                        I'm sure "CSV file parsing" is in the "All Time Top Five" (at least!) subjects in terms of "posts per thread" here.

                        It's kind of like the "correct placement of the full stop" threads in COBOL newsgroups. ( full stop = period ==> "." )
                        Michael Mattias
                        Tal Systems Inc. (retired)
                        Racine WI USA
                        [email protected]
                        http://www.talsystems.com

                        Comment

                        Working...
                        X