Announcement

Collapse
No announcement yet.

How to remove selected quotes from a .CSV file

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Michael Mattias
    replied
    thought my original post would result in a simple answer. I was surprised at the amount of discussion it generated.
    I'm sure "CSV file parsing" is in the "All Time Top Five" (at least!) subjects in terms of "posts per thread" here.

    It's kind of like the "correct placement of the full stop" threads in COBOL newsgroups. ( full stop = period ==> "." )

    Leave a comment:


  • Rod Macia
    replied
    Originally posted by Stuart McLachlan View Post
    Originally posted by Dan Soper
    Overlooked that, sorry. Is TRIM$() the best way to do it, eg if the last field ends with an embedded $DQ? I'd probably use MID$(strRecord, 2, LEN(strRecord) - 2)

    Yep, that's a safer way to do it in case the first field starts with or the last field ends with an embedded quote.
    If you are using PBWin10 or PBCC6
    you can use the new function "UNWRAP$"

    Code:
    strRecord = UNWRAP$(strRecord, $DQ,$DQ)
    Purpose Remove paired characters from the beginning and end of a string.
    Syntax s$ = UNWRAP$(StringExpression, LeftChar$, RightChar$)
    Remarks The UNWRAP$ function removes the characters in LeftChar$ from the beginning of StringExpression, if there is an exact match. It then removes the characters in RightChar$ from the end, if there is an exact match. The remaining character are then returned. For example:
    UNWRAP$("<MyWord>", "<", ">") returns "MyWord"
    UNWRAP$ is particularly useful for removing parentheses, quotes, brackets, etc. from a text item

    Leave a comment:


  • Tim Collins
    replied
    Steuart:
    Yes, I DID misunderstand the problem. I realize now the data is valid and that I was using the wrong "tool/technique" to try and extract the data. Thank you for the method in post #11. I'm working with this voter data set for my own "personal enjoyment." I won't be mail bombing anybody!

    Manuel:
    I did realize that each record contained 48 fields by using LINE INPUT # and then using TALLEY to count the quote marks. The records that gave me trouble had more than 96 "'s. That's how I located the 6 records that I couldn't read correctly with INPUT.

    James:
    When I first started playing with voter record data sets they came as RANDOM access files - up until the last 2 times. [Rod identified the web location for these files in post #8] I used those datasets to become familiar with working with random access files. I have programs that are written to read RANDOM access files I make with selected fields from main dataset. What I have been doing with the .CSV files is read them, then write them out into random access files so they will work with my old programs. NOT VERY EFFICIENT, I know.

    MCM:
    I had considered "hand fixing" the dataset because there's only 6 records to deal with (2 records in the previous set). However, I'm thinking the trend is that more data items will appear in the future that are quoted within a field. So I'm learning how to handle this situation now. I may even get inspired to change my voter registration FIRST_NAME to "Tim, "The Tool Man"",.

    I thought my original post would result in a simple answer. I was surprised at the amount of discussion it generated. I appreciate all who took time to post. I learned alot from the discussion along with a solution to my original question. I'm still studying the responses and learning along the way.

    Tim

    Leave a comment:


  • Michael Mattias
    replied
    the point is, is it important to create code to handle it, or is it more important to 'get the job done ]?].'
    Exactly!

    This is kind of tied in to my earlier musings about the difference between "tools and utilities" programmers and "applications" programmers. "Tools and Utilities" guys* believe their software will be used often, and forever. Applications guys* more easily can handle thinking in terms of a specific job+.

    MCM
    * The term "guys" may be considered by some to be sexist. Tough.
    + I am an 'applications' guy so I may be biased.

    Leave a comment:


  • Dale Yarker
    replied
    Closer to proposal in RFC 4180 regarding double quotes (even though everyone does what want). And, "not invented here" (because I figured out a way to implement it maybe 15 years ago!). ((And, and I'm entitled to do it the way I want too, and only said so once, )) Cheers,

    Leave a comment:


  • Stuart McLachlan
    replied
    Originally posted by Dale Yarker View Post
    (Stuart, I read the posts, just don't fully agree with parsing that way. )
    Why not?

    Leave a comment:


  • Stuart McLachlan
    replied
    Originally posted by Dan Soper View Post

    Overlooked that, sorry. Is TRIM$() the best way to do it, eg if the last field ends with an embedded $DQ? I'd probably use MID$(strRecord, 2, LEN(strRecord) - 2)
    Yep, that's a safer way to do it in case the first field starts with or the last field ends with an embedded quote.

    Leave a comment:


  • Dale Yarker
    replied
    Dan, The idea is to not remove the imbedded DQ(s), just the leading (at start of 1st field) and ending (end of last field) ones when the parse is on quote comma quote.

    (Stuart, I read the posts, just don't fully agree with parsing that way. )

    Leave a comment:


  • Dan Soper
    replied
    Originally posted by Stuart McLachlan View Post

    That's why both Post #11 and #15 have.

    Code:
    strRecord = TRIM$(strRecord,$DQ) ' get rid of first and last quote on line
    PARSE strRecord,strFields(), $DQ & "," & $DQ
    Overlooked that, sorry. Is TRIM$() the best way to do it, eg if the last field ends with an embedded $DQ? I'd probably use MID$(strRecord, 2, LEN(strRecord) - 2)

    Leave a comment:


  • Tom Tallardy
    replied
    I did a job for a business who was seriously impacted by "the e-world changes" - absolute potential to kill the business. analyzing their history vs present and modeling projections was crucial - and dang it, really needed to be "right"

    simple, eh? here's the structure, here's the fields, here's the codes....

    one year order lines was 110 million. and boy you are really spot on about 'errors.' there were fields and each field had expected values and on the first passes some 25% of the data was "bad"
    frankly the real question in my mind was how in the world do you operate a business with that kind of "records?"

    the suggestion about "seeking perfection" is valid, under circumstances. the point is, is it important to create code to handle it, or is it more important to 'get the job done.'
    for one-off smallish things it's usually much more effective to identify and manually correct "bad record"

    25% of 110 million....? nah, not going there. multiple years, multiple model projections....
    eventually we managed to winnow down and 'auto correct' many errors, but in the end had to throw away about 5% of the data as "WTF, over?" - typically coded fields that virtually no one had any idea what that specific errant code meant, or how the system allowed mystery codes to get there.

    Leave a comment:


  • Stuart McLachlan
    replied
    Originally posted by Michael Mattias View Post

    IMO you are thinking like a government contractor. You are allowing perfection to be the enemy of the usable. This is sometimes called "paralysis by analysis."

    You are also assuming the ONLY error in a CSV file created by multiple operators is doubled quotes.
    YO is incorrect. It's got nothing to do with "perfection".
    I'd bet that there are numerous data entry errors in that 7+ million records. I've spent years working with a similar sized electoral roll with all sorts of strange characters and bad data.
    There is no evidence to suggest any such thing.

    I'm stating that there are no identified errors in the data and there is no evidence to suggest that it cannot be correctly parsed into data fields using CHR$($DQ,","$DQ) as the PARSE function delimiter.


    Leave a comment:


  • Dale Yarker
    replied
    (Doesn't anyone read other people's posts? )
    Yes! Repeating it 4 times inspired me.

    Leave a comment:


  • Stuart McLachlan
    replied
    Originally posted by Rod Macia View Post

    Yes it does, was addressed in post #11
    of course its based on the CSV that I found which appears to match the OP's description of the file he has, but the OP has not supplied a link or a sample for us to properly analyze.
    Given that the the problem data shown by the OP matches your sample file exactly, the odds are pretty good

    Leave a comment:


  • Stuart McLachlan
    replied
    Originally posted by Dale Yarker View Post
    Don't remove $DQs as asked for above. Add $DQs (probably manually when program gets confused.)
    ...
    The format (even though the rules are loose) is Comma Separated Values, not Quote Comma Quote Separated Values
    ...
    What I hinted at above means writing your own parser, versus PB's PARSE$ which works on really simple csv.
    .
    For the 5th (?) time! You don't need to do anything with the $DQs if you PARSE on CHR$($DQ,","$DQ) ( other than removing the leading and trailling one on each record0;

    The format of the data file in question is "quoted comma separated values"

    PARSE and PARSE$ work on whatever delimiter string you pass to them.
    Look at posts 11 and 15.

    (Doesn't anyone read other people's posts? )

    Leave a comment:


  • Stuart McLachlan
    replied
    Originally posted by Dan Soper View Post
    That does leave an opening $DQ in the first field and a closing $DQ in the last field that will need to be removed.
    That's why both Post #11 and #15 have.

    Code:
    strRecord = TRIM$(strRecord,$DQ) ' get rid of first and last quote on line
    PARSE strRecord,strFields(), $DQ & "," & $DQ

    Leave a comment:


  • Michael Mattias
    replied
    As has been pointed out more than once above, there are no "bad records" identified.
    Parsing the file correctly using the appropriate delimiiter CHR$($DQ,",",$DQ) is all that is required.
    IMO you are thinking like a government contractor. You are allowing perfection to be the enemy of the usable. This is sometimes called "paralysis by analysis."

    You are also assuming the ONLY error in a CSV file created by multiple operators is doubled quotes. As I pointed out above, with only six bad records out of seven million you are already luckier than the fox who fell asleep and woke up in the henhouse.

    Now that I am retired I should probably create that 'long read' I have thought about over the years.. the difference between "systems and utilities" programmers and "applications" programmers.

    MCM

    Leave a comment:


  • Tom Tallardy
    replied
    anyone who has worked with .csv files knows - they have lots of variations.
    data exported from proprietary systems via homespun report generators can be challenging, to say the least.

    using a comma delimiter works except where you have _both_ embedded comma(s) _and_ embedded_ quotes.

    using "," as a delimiter fails on many .csv because blank fields often show up as ",,"

    "quoted strings" are not always "proper" quoted strings
    Excel auto-quotes any cell with a comma when saved as .csv
    (not sure what Excel does if the comma is used as system decimal place....)

    for
    D$= "1234, MAIN STREET","1235, MAIN STREET","1236, MAIN STREET"

    for x&=1 to 3
    PARSE$(D$,x&)
    next
    produces
    1234, MAIN STREET
    1235, MAIN STREET
    1236, MAIN STREET

    using
    ","
    as a delimited fails when blank fields are not double quoted - often the case

    using PARSE$(D$,",",x&) fails when the quoted strings contain _both_ a comma _and_ embedded quotes

    basically one first has to find out how many goofy situations are in the .csv file before one can "handle" it....
    like one .csv I got with form feeds embedded every x lines....

    Leave a comment:


  • Rod Macia
    replied
    Originally posted by Dan Soper View Post
    That does leave an opening $DQ in the first field and a closing $DQ in the last field that will need to be removed.
    Yes it does, was addressed in post #11
    of course its based on the CSV that I found which appears to match the OP's description of the file he has, but the OP has not supplied a link or a sample for us to properly analyze.
    Code:
    DIM strFields(1 TO 48) AS STRING
    LINE INPUT #ff, strRecord                      ' Input complete line
    strRecord = TRIM$(strRecord,$DQ)               ' get rid of first and last quote on line
    PARSE strRecord,strFields(), CHR$($DQ,",",$DQ) ' strFields() now has all 48 records

    Leave a comment:


  • Dale Yarker
    replied
    Don't remove $DQs as asked for above. Add $DQs (probably manually when program gets confused.)
    Code:
    . . .  ,"4TH PLT "IMMORTALS"", . . .
    should be
    Code:
    . . .  ,"4TH PLT ""IMMORTALS""", . . .
    The format (even though the rules are loose) is Comma Separated Values, not Quote Comma Quote Separated Values. Don't expect the entry operator to double up on imbedded double quotes. The program that converted whatever to csv should have done it. You're just stuck with the result.

    Think of $DQs as odd and even. Odd for DQ on, even for DQ off. A comma while DQ is on is imbedded and not a delimiter. A comma while DQ off is a delimiter. Two $DQs together (with nothing between) are (off/on) one imbedded $DQ. The third $DQ after IMMORTALS is DQ off for that field (so an imbedded comma would not have been a problem in the field).

    Code:
    . . . ,"1234", . . .
    is text that contains only number characters (like part number).
    Code:
    . . . ,1234, . . .
    is a number (like quantity).

    (if you get a CSV from MS Excel it is a whole new world of weird. )

    What I hinted at above means writing your own parser, versus PB's PARSE$ which works on really simple csv.

    (all just IMO of course)

    Cheers,

    Leave a comment:


  • Dan Soper
    replied
    Originally posted by Stuart McLachlan View Post
    As has been pointed out several times, the OP should use CHR$($DQ,",",$DQ) as the delimiter. ...
    That does leave an opening $DQ in the first field and a closing $DQ in the last field that will need to be removed.

    Leave a comment:

Working...
X