Announcement

Collapse
No announcement yet.

How to remove selected quotes from a .CSV file

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to remove selected quotes from a .CSV file

    I have a voter registration file that is in a .CSV file format. Each individual's data is recorded in 48 fields. The fields are simply things like last name, first name, middle name, addresses, etc. There are 7,894,412 records in the file. I have discovered that 6 of the records are corrupt. By corrupt I mean that instead of 48 records for those 6 individuals there are 49. The reason for this is that some clerks recorded address information enclosed in quotes. Here's an example:

    18 fields followed by: ,"4TH PLT "IMMORTALS"", ...> remaining fields. By enclosing the platoon nickname in quotes the record becomes corrupt.

    I can use: INPUT #1, RECORDS

    because each record ends with CR/LF. However, if I try to read in each field ie:
    INPUT #1, LAST_NAME
    INPUT #1, FIRST_NAME
    INPUT #1, MIDDLE_NAME, etc. All goes well until I reach the record with the quoted nickname. Eventually the next record begins with the last field of the corrupt record so the first field of the new record, LAST NAME, is incorrect.

    I tried to use the PBCC REPLACE command to get rid of the quote mark around the word: 'IMMORTALS'. I've tried various methods including using CHR$(34), for example
    REPLACE "CHR$(34)IMMORTALS CHR$(34)" and "IMMORTALS""

    but these don't work. I am applying the REPLACE command to the entire record of 48 / 49 fields.

    So, what is a way to eliminate the extra " marks around the word IMMORTALS?

    I'm asking just to learn how to handle this situation. I'm not working for any governmental agency or political party. My alternate path forward is just to eliminate the corrupt records, but then I learn nothing about programming this problem.
    Thanks

    Tim

  • #2
    Read the file in as a binary string then search for two double quotes and a comma together ( "", ) then step backward until you find the nearest " then replace that " with a space and then replace "", with ", when done write the string to a new file.

    KS

    Comment


    • #3
      Instead of inputting 1 item at a time, use LINE INPUT # to read an entire line at once.
      Then use PARSE$() to extract each of the 48 items from the line.

      Comment


      • #4
        If a one-time thing (you know the incorrect 6-records) then use 6 replace statements.
        Code:
        FUNCTION PBMAIN () AS LONG
         'create bad file
         LOCAL s AS STRING
         s = CHR$("4TH PLT ""IMMORTALS""")
         OPEN "junk.txt" FOR BINARY AS #1
         PUT$ #1,s
         CLOSE #1
         ? s,,"Bad"
        END FUNCTION
        CMD shortcut to open any location:
        %windir%\system32\cmd.exe /k " cd\ & x: & cd x:\xxxx
        Change to run as administrator
        How long is an idea? Write it down.

        Comment


        • #5
          Backup if rewriting to same file
          Code:
          FUNCTION PBMAIN () AS LONG
          'fix bad file
           OPEN "junk.txt" FOR BINARY AS #1
           GET$ #1,LOF(1), s
          
           'Replace statements
           REPLACE """IMMORTALS""" WITH "IMMORTALS" IN s
          
           'rewrite
           SEEK #1,1
           PUT$ #1, s
           SETEOF #1  'file is smaller so truncate
           CLOSE #1
          
          END FUNCTION
          CMD shortcut to open any location:
          %windir%\system32\cmd.exe /k " cd\ & x: & cd x:\xxxx
          Change to run as administrator
          How long is an idea? Write it down.

          Comment


          • #6
            Thanks guys,
            I've not worked with BINARY files before so I'm glad to know I'll be learning something new. I will study each of your replies. Yes, this is a one time thing & just for fun. I do wonder why the software county clerks use allow this to happen? I suppose their databases handle this situation without problem.
            Tim

            Comment


            • #7
              I suspect that you may be misunderstanding your problem.
              CSV files are a real PITA since there is no standard for them. It sounds like your data follows the common practice of quoting fields when they contain commas or quotes.
              It is quite likely that the data export routine that created the file put the quotes in, not "some clerk who recorded the addresses".

              Do the problem records contain commas or pairs of quotes in the data?
              You may well get data which looks like:
              ...,Mr,"James aka ""Jim""",Sminth,"47, High Street",Hickesville,...
              Note the doubled embedded $DQ
              ...
              You are much better off getting the data as TAB separated values with no quoting if you can. (I always insist on it when getting such data)

              If you can't get a better constructed file, then you have to jump through hoops to deal with the various potential formats.
              Here's a typical method to deal with the above.

              '
              Code:
              #COMPILE EXE
              #DIM ALL
              
              FUNCTION PBMAIN() AS LONG
              LOCAL ff, lngFields,flgDQs, lngRecNum AS LONG
              LOCAL strRecord AS STRING
              DIM strFields(1 TO 48) AS STRING
              ff = FREEFILE
              OPEN "myCSVFile.csv" FOR INPUT AS #ff
              WHILE NOT EOF(ff)
                  LINE INPUT #ff, strRecord
                  INCR lngRecNum
                  lngFields = PARSECOUNT(strRecord,",")
                  flgDQs = INSTR(strRecord,$DQ & $DQ) ' contains double quotes.
                  SELECT CASE AS LONG lngFields
                      CASE 48 'no embedded commas
                          IF flgDQs THEN 'doubled quotes, so presume soem fields are quoted.
                              FixCSV strRecord
                              PARSE strRecord,strFields(), $TAB
                          ELSE
                              PARSE strRecord,strFields(), ","
                          END IF
                      CASE > 48 'probably embedded commas
                       FixCSV strRecord
                       PARSE strRecord,strFields(), $TAB
                      CASE ELSE 'bad record, not enough fields!
                          ? USING$("Insufficient fields ( # ) in Record # ",lngFields,lngRecNum)
                  END SELECT
                  '...
              WEND
              '...
              END FUNCTION
              
              FUNCTION FixCSV(BYREF s AS STRING) AS LONG
                  LOCAL x, flgInquotes AS LONG
              
                  FOR x = 1 TO LEN(s)
                      IF MID$(s,x,1) = $DQ THEN
                          IF flgInquotes AND MID$(s,x+1,1) = $DQ THEN
                              'it's first of two quotes inside a quoted field so needs to be retained.
                              'change it temporarily to a control code so that we can replace it later
                              MID$(s,x,1) = CHR$(1)
                          ELSE
                              flgInquotes = (NOT flgInquotes)
                          END IF
                          ITERATE
                      END IF
                      IF (NOT flgInquotes) AND (MID$(s,x,1) = ",") THEN
                          'change delimiter to TAB character
                          MID$(s,x,1) = $TAB
                      END IF
                  NEXT
                  s = REMOVE$(s,$DQ) ' get rid of all the unwanted quotes
                  REPLACE CHR$(1) WITH $DQ IN s
              END FUNCTION
              '
              Last edited by Stuart McLachlan; 30 Oct 2020, 10:37 PM.

              Comment


              • #8
                Tim since your from Michigan, are you using the CSV from https://michiganvoters.info/
                http://69.64.83.144/~mi/download/20201012/
                "Entire State October.zip"

                that CSV file does have records like you described.
                Code:
                ........ "","DETROIT","MI","48228","4TH PLT "IMMORTALS"","DELTA COMPANY, 3-34 IN REGI, 11952 GOLDEN ARROW R","FORT JACKSON, SC 29207" ........
                It does not appear to be using doubled embedded $DQ as Stuart suggested. (which is the proper way to embed $DQ's)
                It has embedded $DQ's and Embedded comas
                Every record is wrapped in DQ's, Coma separated and it appears that every line has 2 CRLF

                Comment


                • #9
                  Originally posted by Rod Macia View Post
                  Tim since your from Michigan, are you using the CSV from https://michiganvoters.info/
                  http://69.64.83.144/~mi/download/20201012/
                  "Entire State October.zip"

                  that CSV file does have records like you described.
                  Code:
                  ........ "","DETROIT","MI","48228","4TH PLT "IMMORTALS"","DELTA COMPANY, 3-34 IN REGI, 11952 GOLDEN ARROW R","FORT JACKSON, SC 29207" ........
                  It does not appear to be using doubled embedded $DQ as Stuart suggested. (which is the proper way to embed $DQ's)
                  It has embedded $DQ's and Embedded comas
                  Every record is wrapped in DQ's, Coma separated and it appears that every line has 2 CRLF
                  Aaaaaaahhhhhh! Yet another variant: single,un-escaped $DQs inside quoted strings.

                  My previous FixCSV function will require a bit of tweaking for that, but it shouldn't be difficult.

                  Note from: http://69.64.83.144/~mi/download/202...%20Layout..pdf

                  File contains varchar, numeric and date values. It's likely that the numeric fields are not quoted and the dates may or may not be
                  When I get that sort of data, I write a specific parsing function that every line is run through with the fields put into an array as per my earlier. Once you've got a record into the array, you can do what you want with it (fill an SQLIte database?), handling each separate field as required.

                  If all fields are quoted, then a simple PARSE strRecord,strFields(), &DQ & "," & $DQ may be all that's required.
                  (Sorry but I'm not going to download a 560MB file just to knock up an appropriate parsing routine )

                  Comment


                  • #10
                    Here is a small sample, about 20 records, the 2nd last is the offending record.
                    For people that may not want to download the 560MB

                    smaller.zip

                    Comment


                    • #11
                      Originally posted by Rod Macia View Post
                      Here is a small sample, about 20 records, the 2nd last is the offending record.
                      For people that may not want to download the 560MB

                      [ATTACH]n801548[/ATTACH]
                      OK, nice and easy, all fields are quoted:

                      '
                      Code:
                      #COMPILE EXE
                      #DIM ALL
                      
                      FUNCTION PBMAIN() AS LONG
                      LOCAL ff, x AS LONG
                      LOCAL strRecord ,strDelimiter AS STRING
                      DIM strFields(1 TO 48) AS STRING
                      strDelimiter = $DQ & "," & $DQ
                      ff = FREEFILE
                      OPEN "smaller.csv" FOR INPUT AS #ff
                      LINE INPUT #ff, strRecord 'field names header row
                      
                      WHILE NOT EOF(ff)
                          LINE INPUT #ff, strRecord 'blank line
                          LINE INPUT #ff, strRecord
                          strRecord = TRIM$(strRecord,$DQ) ' get rid pf first and last quote on line
                          PARSE strRecord,strFields(), strDelimiter
                      
                      'do something with it
                          strRecord =""
                          FOR x = 1 TO 48
                              strRecord += strFields(x) & $LF   'or $TAB instead of  $LF ?
                          NEXT
                          ? strRecord
                      
                      WEND
                      END FUNCTION
                      '
                      Last edited by Stuart McLachlan; 30 Oct 2020, 10:48 PM.

                      Comment


                      • #12
                        I tried the method Mike D suggested on a small test file and all worked well. Then I tried the voter file, but got an "out of memory" error. The entire file is 2,766,018kb long.

                        I just checked back here and see more suggestions. I'll have to study the new suggestions.

                        The "corrupt" files all are bad because of the extra quote in the same fields. Each record have 5 fields, (19 thru 23) that are mail address info. For whatever reason the double quote shows up only in the mail address fields.

                        ,"4TH PLT "IMMORTALS"" ,
                        , "SUPMANEE HOMES"",
                        ,"6 "LINCOLN"",
                        ,""CORONA GARDENS"",
                        ,"6 "LINCOLN"", <-----same as above- related individuals?
                        ,"342 "PORTSIDE TERRACE" TRIQ SAN PAWL",

                        Last "year" the data set had 2 "corrupt" entries - this year 6. It would be very interesting to know the real reason these have shown up recently.

                        Thanks again for the help. I'll be studying these replies.

                        Tim

                        Comment


                        • #13
                          You can't load that much data into memory at once. You'll need to process it in blocks. Do you really need all 48 fields? What are you trying to do with the data? You could probably read in 1 line at a time, extract just the few fields you're interested in then write them back to disc as a smaller file. That smaller file might be small enough to load into memory.

                          Comment


                          • #14
                            Originally posted by Tim Collins View Post
                            The "corrupt" files all are bad because of the extra quote in the same fields.
                            ...
                            It would be very interesting to know the real reason these have shown up recently.
                            Just to be clear, the data is not "corrupt" and the file is not "bad". Double quote (CHR$(34) is a standard ASCII character and should be permissible in any text field in a database.

                            As long as they are handled consistently and appropriately in import/export processes, they are not a problem.

                            The only problem is your method of handling the data file. The PB INPUT# command only works correctly with a few text delimiting schemas. Your data file is not one of those limited cases.

                            PB Help under "INUPUT#" even tells you (somewhat obtusely) that INPUT# can't handle your data.
                            "Please note that data to be quoted should not contain embedded quotes."

                            See post #11 for a valid method..

                            Comment


                            • #15
                              Originally posted by Paul Dixon View Post
                              You can't load that much data into memory at once. You'll need to process it in blocks. Do you really need all 48 fields? What are you trying to do with the data? You could probably read in 1 line at a time, extract just the few fields you're interested in then write them back to disc as a smaller file. That smaller file might be small enough to load into memory.
                              How about creating a mail merge file to mail bomb the voters?
                              In that case, a TAB delimited file is much more useful:

                              A similar approach for physical addresses for door knocking (only print out a list for a selected ZIP code: IF strFields(17) = .... )

                              '
                              Code:
                              #COMPILE EXE
                              #DIM ALL
                              
                              FUNCTION PBMAIN() AS LONG
                              LOCAL ff,ff2, x AS LONG
                              LOCAL strRecord AS STRING
                              DIM strFields(1 TO 48) AS STRING
                              ff = FREEFILE
                              OPEN "smaller.csv" FOR INPUT AS #ff
                              ff2 = FREEFILE
                              OPEN "addresses.tsv" FOR OUTPUT AS #ff2
                              LINE INPUT #ff, strRecord 'field names header row
                              
                              WHILE NOT EOF(ff)
                                  LINE INPUT #ff, strRecord 'blank line
                                  LINE INPUT #ff, strRecord
                                  strRecord = TRIM$(strRecord,$DQ) ' get rid of first and last quote on line
                                  PARSE strRecord,strFields(), $DQ & "," & $DQ
                              
                                  'If it has an address, extract it
                                  IF strFields(19) > "" THEN
                                      strRecord = MCASE$(strFields(2) & " " & strFields(3)) & " " &  strFields(1) & $TAB
                                      FOR x = 19 TO 22
                                          strRecord += strFields(x) & $TAB
                                      NEXT
                                      strRecord += strFields(23)
                                      PRINT #ff2, strRecord
                                  END IF
                              
                              WEND
                              CLOSE #ff
                              CLOSE #ff2
                              ? "Mail merge file created"
                              END FUNCTION
                              '

                              Comment


                              • #16
                                TEST.zip Where I work, our main input file format is CSV. Interesting how there is no solid CSV format standard. Oh well.

                                Below is an example showing a function I use to convert a CSV record to TAB delimited. It handles 99% of the issues I have run into over the years. Almost all the files we get are ASCII format. When we get the occasional UFT-8 format, I change it to ASCII. That has worked for all but one time; we had to do some different programming on that project.

                                I haven't had to make any changes to it for years. As MCM has said in the past, "Sometimes, good enough is good enough." But, I'm always open to suggestions.

                                I've also attached a file that has the test CSV file with two records. (I know, pretty big.)


                                Code:
                                #DIM ALL
                                #COMPILE EXE "aaaTest.exe
                                #COMPILER PBCC 6
                                #OPTION ANSIAPI
                                '--------------------------------------------------------------------------------
                                '- Includes
                                '--------------------------------------------------------------------------------
                                #INCLUDE "Win32api.inc" 'José Roca's Includes
                                '--------------------------------------------------------------------------------
                                '- Main TEST.zip
                                '--------------------------------------------------------------------------------
                                FUNCTION PBMAIN() AS LONG
                                LOCAL iMaxFields, iRecs AS LONG
                                LOCAL a$
                                LOCAL sFields() AS STRING
                                
                                   TRY
                                      OPEN "TEST.IN.CSV" FOR INPUT AS #1
                                   CATCH
                                      PRINT ERROR$
                                      PRINT "File: TEST.IN.CSV"
                                      PRINT "Job Canceled"
                                      EXIT FUNCTION
                                   END TRY
                                
                                   TRY
                                      OPEN "TEST.OUT.CSV" FOR OUTPUT AS #2
                                   CATCH
                                       PRINT ERROR$
                                       PRINT "File: TEST.OUT.CSV"
                                       PRINT "Job Canceled"
                                       CLOSE
                                       EXIT FUNCTION
                                   END TRY
                                
                                   '--------------------------------------------------------------------------------
                                   ' Handle header record
                                   '--------------------------------------------------------------------------------
                                   LINE INPUT #1,a$                ' Get a record.
                                   CSV2TSV a$                      ' Convert CSV to TAB delimited.
                                   iMaxFields = PARSECOUNT(a$,$TAB)' Get Field Count
                                   INCR iMaxFields                 ' Add for addtional field out
                                   REDIM sFields(1 TO iMaxFields)  ' iMaxFields defined in header routine.
                                   PARSE a$, sFields(), $TAB       ' Parse into fields array
                                   sFields(iMaxFields) = "SEQUENCE"' Add Field Name
                                   a$ = JOIN$(sFields(),$TAB)      ' Put header
                                   PRINT #2, a$
                                
                                
                                
                                   DO  UNTIL EOF(1)
                                       '--------------------------------------------------------------------------------
                                       ' Get a record and delimit it.                                                  '
                                       '--------------------------------------------------------------------------------
                                       LINE INPUT #1,a$                ' Get a record.
                                       CSV2TSV a$                      ' Convert CSV to TAB delimited.
                                       REDIM sFields(1 TO iMaxFields)  ' iMaxFields defined in header routine.
                                       PARSE a$, sFields(), $TAB       ' Parse into fields array
                                       INCR iRecs
                                       '--------------------------------------------------------------------------------
                                       ' Put your stuff here.                                                          '
                                       '--------------------------------------------------------------------------------
                                       sFields(iMaxFields) = FORMAT$(iRecs, "000000")
                                       a$ = JOIN$(sFields(), $TAB)
                                       PRINT #2, a$
                                   LOOP
                                   '--------------------------------------------------------------------------------
                                   ' Put any remaining records in buffer, close all files.                         '
                                   '--------------------------------------------------------------------------------
                                   CLOSE
                                   '--------------------------------------------------------------------------------
                                   ' Close progress bar, print any counts, etc.                                    '
                                   '--------------------------------------------------------------------------------
                                   PRINT FORMAT$(iRecs, "#,##0") & " records processed."
                                   WAITKEY$
                                END FUNCTION
                                
                                
                                '--------------------------------------------------------------------------------
                                '-
                                '--------------------------------------------------------------------------------
                                '----------------------------------------------------------------------------------------------------------
                                SUB CSV2TSV(sRecord AS STRING, OPTIONAL BYVAL IKeepDQ AS LONG)
                                    '                                         TRUE if you want to keep internal double quotes.
                                LOCAL i, iFoundQ, iByteIndex,LenZ  AS LONG
                                LOCAL pA AS STRING POINTER
                                
                                      REPLACE ANY CHR$(09,145,146,147,148) WITH " ''""""" IN sRecord ' Replace TABs with spaces, and "Smart" quotes & apostrophies with "dumb"
                                      pA = STRPTR(sRecord)                           ' point to the string
                                      LenZ = LEN(sRecord) - 1
                                      FOR i = 0 TO lenZ                              ' loop until the end of the field is reached.
                                         iByteIndex = PA + i                         ' point to current byte
                                         SELECT CASE PEEK(BYTE, iByteIndex)          ' look at this byte:
                                            CASE 34                                  ' If this byte is a QUOTE
                                               SELECT CASE i                         '
                                                   CASE 0, lenZ                      ' if a quote
                                                       iFoundQ = NOT iFoundQ                 '   switch quote flag
                                                       POKE BYTE, iByteIndex, 00             '   convert quote to null
                                                   CASE ELSE
                                                       IF ISTRUE(iFoundQ) THEN       ' if a quote has been found, look at next byte
                                                           IF PEEK(BYTE,iByteIndex + 1) = 44 THEN   ' if a comma
                                                              POKE BYTE, iByteIndex, 00             ' convert quote to null
                                                              iFoundQ = NOT iFoundQ                 '   switch quote flag off
                                                           ELSE                                     ' if not a comma,
                                                              IF ISFALSE(iKeepDQ) THEN POKE BYTE, iByteIndex, 255 '   convert quote to "ÿ"
                                                           END IF
                                                       ELSE
                                                           iFoundQ = NOT iFoundQ                 '   switch quote flag on
                                                           POKE BYTE, iByteIndex, 00             '   convert quote to null
                                                       END IF
                                               END SELECT
                                            CASE 44                                  ' If this byte is a COMMA, and QUOTE FOUND switch is OFF,
                                                IF ISFALSE iFoundQ THEN POKE BYTE, iByteIndex, 9 ' convert comma to a TAB.
                                         END SELECT                                 ' Everything else: no changes
                                      NEXT i
                                      REPLACE CHR$(00,09)   WITH CHR$(09) IN sRecord   ' Convert NULL & TAB  to TAB    (NULLTAB  > TAB)
                                      REPLACE CHR$(09,00)   WITH CHR$(09) IN srecord   ' Convert TAB  & NULL to TAB    (TABNULL  > TAB)
                                      REPLACE CHR$(00,00)   WITH $DQ IN sRecord        ' Convert NULL & NULL to quote  (NULLNULL > ")
                                      REPLACE CHR$(255)     WITH $DQ IN sRecord        ' Convert DOUBLE-QUOTE holder to DQ
                                      IF ISFALSE iKeepDQ THEN
                                          REPLACE $DQ2          WITH $DQ IN sRecord        ' Convert DOUBLE-DOUBLE quote to DQ if requested
                                      END IF
                                      sRecord = REMOVE$(sRecord, CHR$(00))           ' remove any NULL characters
                                END SUB
                                '-----------------------------------------------------------------------------------------------------------
                                ... .... . ... . . ... ... .... . .. ... .... .. ... .... .. .... ..

                                n6jah @ yahoo.com

                                Comment


                                • #17
                                  TEST.zip
                                  ... .... . ... . . ... ... .... . .. ... .... .. ... .... .. .... ..

                                  n6jah @ yahoo.com

                                  Comment


                                  • #18
                                    Originally posted by Jim Robinson View Post
                                    Below is an example showing a function I use to convert a CSV record to TAB delimited. It handles 99% of the issues I have run into over the years. Almost all the files we get are ASCII format.
                                    Nothing above CHR$(127) ?

                                    Last edited by Stuart McLachlan; 31 Oct 2020, 10:40 PM.

                                    Comment


                                    • #19
                                      Nothing about CHR$(127)
                                      Hmmm, I've never run into that character before. I guess not too many people are writing to paper tape any more.
                                      ... .... . ... . . ... ... .... . .. ... .... .. ... .... .. .... ..

                                      n6jah @ yahoo.com

                                      Comment


                                      • #20
                                        Tim,

                                        The field separator in the CSV file seems to be CHR$(34,44,34), this is $DQ,COMMA,$DQ. If you PARSECOUNT the 7,894,413 records in the file using that separator, you will consistently get 48 fields as specified. Since the whole file is too big to be handled with EXCEL or NOTEPAD, you can split it quite easily using a few lines of PBCC code.

                                        Comment

                                        Working...
                                        X