Announcement

Collapse
No announcement yet.

Handling Tab Delimited Input Files

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Handling Tab Delimited Input Files

    I have a recurring set of input files with the data elements delimited by tab hex codes. Neither INPUT# nor LINE INPUT# are willing to retain those tabs when bringing the hex code into the processor. A straight forward GRAPHIC PRINT of the input line from LINE INPUT# shows the tab hex code has been ignored. It's even worse with INPUT# which is picking up part of the next character obtained via a sequential read with that command. My problem is further compounded by inclusion of commas in "some" of the data elements so trying to go with comma delimited files instead of the tab delimited files is definitely not going to work. I've had difficulty previously trying to transform such hex involved files reading as binary and then rewriting with the hex codes replaced (converting line feed hex only into the proper end of line carriage return line feed hex code). Can anybody suggest a means (hopefully other than trying again to transform via reading and rewriting as binary files) HOW I can get those properly end of lined but tab delimited input files into a program for processing?

  • #2
    I did locate a partial solution to this problem in the existence of a command line hex editor called VXI32 which is capable of working from a batch file to replace all tabs in the input file set with spaces (which can then be readily processed with LINE INPUT#). It won't handle the daily variable file names yet because the hex editor command line won't accept wild card file names, but it may be a start towards solution of not only this problem but perhaps also my earlier difficulty transforming line feed hex only into end of line carriage return line feed hex code

    Comment


    • #3
      Handling Tab Delimited Input Files


      Bob,

      Does this help? I think hex could be used as a delimiter, but do you need hex?
      Could also be done reading file into an array.
      String array used here is only for breaking the line into elements.
      Code:
      FUNCTION PBMAIN () AS LONG
       OPEN "junk.txt" FOR OUTPUT AS #1
       PRINT #1, CHR$("how",9,"now",9,"brown",9,"cow")
       CLOSE #1
       Test "junk.txt"
      END FUNCTION
      
      '-----------------------------------------------------------------------------------------------------------------------------
      SUB Test(sFile AS STRING)
      
       LOCAL sLine AS STRING
       LOCAL columns AS LONG
      
       OPEN sFile FOR INPUT AS #1
       LINE INPUT #1,sLine
       columns = PARSECOUNT(sLine,$TAB)
       REDIM sArray(1 TO columns) AS STRING
      
       PARSE sLine,sArray(), $TAB
       ? JOIN$(sArray(),$TAB),,"Join" 'replace $TAB with any output delimiter
      
       ? sArray(columns)              'last element same as
       ? PARSE$(sLine,$TAB,columns)   'last index
      
      
       CLOSE #1
      
      END SUB
      How long is an idea? Write it down.

      Comment


      • #4
        Using hex delimiter "09"
        Replacing tabs with spaces is not needed.
        You can change the delimiter.
        If need be, any characters can be replaced with any other characters with one call.

        File can be read into an array, see FILESCAN and each element/line processed like in this demo.
        The file can also be read into a string and converted back and forth between string/array using JOIN$/PARSE.

        Code:
        FUNCTION PBMAIN () AS LONG
         LOCAL sDelimit AS STRING
         sDelimit = "09"
         OPEN "junk.txt" FOR OUTPUT AS #1
         PRINT #1, CHR$("how",sDelimit,"now",sDelimit,"brown",sDelimit,"cow")
         CLOSE #1
         Test "junk.txt",sDelimit
        END FUNCTION
        
        SUB Test(sFile AS STRING,sDelimit AS STRING)
         LOCAL sLine AS STRING
         LOCAL columns AS LONG
         OPEN sFile FOR INPUT AS #1
        
         'process first line getting number of columns
         LINE INPUT #1,sLine
         columns = PARSECOUNT(sLine,sDelimit)
         REDIM sArray(1 TO columns) AS STRING
        
         PARSE sLine,sArray(), sDelimit
         ? JOIN$(sArray()," | "),,"Columns in this line"
        
         do until eof(1)
          LINE INPUT #1, sLine
          PARSE sLine,sArray(), sDelimit
          ? JOIN$(sArray()," | "),,"Columns in this line"
         loop
         CLOSE #1
        END SUB
        How long is an idea? Write it down.

        Comment


        • #5


          I'm confused. I regularly work with tab delimited data files and have never had any problem using LINE INPUT and PARSE. ( I sometimes do REPLACE $LF with $CRLF on the whole file first if it comes fro a *NIX environment) Can you post a zipped sample file?

          (And just to be picky, your use of the phrase "tab hex code" doesn't really mean anything. "hex code" is just a way of describing a byte or a series of bytes. Presumably your delimiter is an ASCII character 9)




          Comment


          • #6
            Delimiter does not matter and file read into an array

            Code:
            FUNCTION PBMAIN () AS LONG 'saved as "tsv.bas"
            
             LOCAL sFile,sDelimit, sData() AS STRING
             LOCAL x,errcode AS LONG
            
             sFile = "junk.txt"
             sDelimit = "09" 'hex tab
             OPEN sFile FOR OUTPUT AS #1
             FOR x = 1 TO 11: PRINT #1, CHR$("Apple",sDelimit,"Banana",sDelimit,"Canalope"):NEXT
             CLOSE 'created test file
            
             errcode = GetMyFile(sFile,sData())
             IF errcode THEN ? ERROR$(errcode),,:EXIT FUNCTION
            
             ? JOIN$(sData(),$CR) + $CR + $CR +_
             "First line: " + sData(1) + $CR + _
             "Last line:  " + sData(UBOUND(sData)),,"hex tab separated values"
            
            END FUNCTION
            
            FUNCTION GetMyFile(sFile AS STRING,TheData() AS STRING) AS LONG
             LOCAL counter AS LONG
             LOCAL hFile AS LONG
             OPEN sFile FOR INPUT AS #hFile
             IF ERR THEN FUNCTION = ERR:BEEP:EXIT FUNCTION
            
             FILESCAN #hFile, RECORDS TO counter
             REDIM TheData(1 TO counter) AS STRING
             LINE INPUT #hFile, TheData() TO counter
             CLOSE #hFile
            END FUNCTION
            Attached Files
            How long is an idea? Write it down.

            Comment


            • #7
              Added helper function Parseline when everything is in memory and parses any element/line of array using any delimiter.
              Stuart,
              I'm also confused by the hex tab, but any delimiter is handled in ParseLine function.
              Bob,
              Not sure if anything more could be done.
              Reading the file into an array doesn't care about delimiters and processing using LINE INPUT # (as mentioned) doesn't care about delimiters.

              I'm reading the file into an array and parsing any element using ParseLine into another array of columns for the single line/element.
              The last line of ParseLine allows formatting the hex tab columns into anything FUNCTION = JOIN$(sOneLine()," | ")
              I should have added 2-delimiters to the function, sinput delimiter, soutput delimiter.

              Code:
              FUNCTION PBMAIN () AS LONG 'saved as "tsv.bas"
              
               LOCAL sFile,sDelimit,sTemp,sData() AS STRING
               LOCAL x,errcode,columns,element AS LONG
              
               sFile = "junk.txt"
               sDelimit = "09" 'hex tab
               OPEN sFile FOR OUTPUT AS #1
               PRINT #1, CHR$("One",sDelimit,"Apple",sDelimit,"Banana")
               PRINT #1, CHR$("Two",sDelimit,"Canalope",sDelimit,"Duck")
               PRINT #1, CHR$("Three",sDelimit,"Elephant",sDelimit,"Fish")
               CLOSE
              
               errcode = GetMyFile(sFile,sData())
               IF errcode THEN ? ERROR$(errcode),,:EXIT FUNCTION
              
               FOR element = LBOUND(sData) TO UBOUND(sData)
                sTemp+=ParseLine(sData(),element,sDelimit) + $CR
               NEXT
               ? "Parse each element:" + $CR + sTemp + $CR + $CR +_
                 "Raw elements:"       + $CR +  JOIN$(sData(),$CR)
              
              END FUNCTION
              
              FUNCTION ParseLine(sData() AS STRING,element AS LONG,sDelimit AS STRING) AS STRING
               IF element < LBOUND(sData) OR element > UBOUND(sdata) THEN ? "out of bounds":EXIT FUNCTION
               LOCAL columns AS LONG
               columns = PARSECOUNT(sData(element),sDelimit)
               REDIM sOneLine(1 TO columns) AS STRING
               PARSE sData(element),sOneLine(),sDelimit
               FUNCTION = JOIN$(sOneLine()," | ")
              END FUNCTION
              
              FUNCTION GetMyFile(sFile AS STRING,TheData() AS STRING) AS LONG
               LOCAL counter AS LONG
               LOCAL hFile   AS LONG
               OPEN sFile FOR INPUT AS #hFile
               IF ERR THEN FUNCTION = ERR:BEEP:EXIT FUNCTION
              
               FILESCAN #hFile, RECORDS TO counter
               REDIM TheData(1 TO counter) AS STRING
               LINE INPUT #hFile, TheData() TO counter
               CLOSE #hFile
              END FUNCTION
              How long is an idea? Write it down.

              Comment


              • #8
                Added sOutputDelimit for output flexibility of the columns of a parsed line.
                FUNCTION ParseLine(sData() AS STRING,element AS LONG,sInputDelimit AS STRING,sOutputDelimit AS STRING) AS STRING

                Code:
                FUNCTION PBMAIN () AS LONG 'saved as "tsv.bas"
                
                LOCAL sFile,sInputDelimit,sOutPutDelimit,sTemp,sData() AS STRING
                 LOCAL x,errcode,columns,element AS LONG
                 sFile = "junk.txt"
                 sInputDelimit = "09" 'hex tab
                 sOutputDelimit = " ** "
                 OPEN sFile FOR OUTPUT AS #1
                 PRINT #1, CHR$("One",sInputDelimit,"Apple",sInputDelimit,"Banana")
                 PRINT #1, CHR$("Two",sInputDelimit,"Canalope",sInputDelimit,"Duck")
                 PRINT #1, CHR$("Three",sInputDelimit,"Elephant",sInputDelimit,"Fish")
                 CLOSE
                
                 errcode = GetMyFile(sFile,sData())
                 IF errcode THEN ? ERROR$(errcode),,"GetMyFile":EXIT FUNCTION
                
                 FOR element = LBOUND(sData) TO UBOUND(sData)
                  sTemp+=ParseLine(sData(),element,sInputDelimit,sOutPutDelimit) + $CR
                 NEXT
                 ? "Parse each element:" + $CR + sTemp + $CR + $CR +_
                   "Raw elements:"       + $CR +  JOIN$(sData(),$CR)
                
                END FUNCTION
                
                FUNCTION ParseLine(sData() AS STRING,element AS LONG,sInputDelimit AS STRING,sOutputDelimit AS STRING) AS STRING
                 IF element < LBOUND(sData) OR element > UBOUND(sdata) THEN ? "out of bounds":EXIT FUNCTION
                 LOCAL columns AS LONG
                 columns = PARSECOUNT(sData(element),sInputDelimit)
                 REDIM sOneLine(1 TO columns) AS STRING
                 PARSE sData(element),sOneLine(),sInputDelimit
                 FUNCTION = JOIN$(sOneLine(),sOutPutDelimit)
                END FUNCTION
                
                FUNCTION GetMyFile(sFile AS STRING,TheData() AS STRING) AS LONG
                 LOCAL counter AS LONG
                 LOCAL hFile   AS LONG
                 OPEN sFile FOR INPUT AS #hFile
                 IF ERR THEN FUNCTION = ERR:BEEP:EXIT FUNCTION
                
                 FILESCAN #hFile, RECORDS TO counter
                 REDIM TheData(1 TO counter) AS STRING
                 LINE INPUT #hFile, TheData() TO counter
                 CLOSE #hFile
                END FUNCTION
                How long is an idea? Write it down.

                Comment


                • #9
                  Unless the file is huge in size, what I like to do is simply read the entire file (binary mode) into a single string variable. Then you can use functions like PARSECOUNT and PARSE$ to walk through the data. If the file uses CRLF's at the end of each line, then parse for them to get a single line one at a time.
                  Chris Boss
                  Computer Workshop
                  Developer of "EZGUI"
                  http://cwsof.com
                  http://twitter.com/EZGUIProGuy

                  Comment


                  • #10
                    Simple demo of Chris' idea of grabbing the whole file and parsing lines.
                    Code:
                    #COMPILE EXE
                    #DIM ALL
                    FUNCTION PBMAIN () AS LONG
                    LOCAL FileName AS STRING
                    LOCAL LineDelimiter AS STRING
                    LOCAL columnDelimiter AS STRING
                    LOCAL strData AS STRING
                    LOCAL strLines() AS STRING
                    LOCAL strCols() AS STRING
                    LOCAL x,y AS LONG
                    LOCAL strPrint AS STRING
                    
                    LineDelimiter = $CRLF 'Window file - for *Nix use $LF
                    ColumnDelimiter = $TAB 'ASCII 9, CHR$(9), &H09 delimited. Replace as necessary if other delimiter (space, |, ~ etc)
                    Filename = EXE.PATH$ & "Test.dat"
                    
                    'Get file
                    OPEN Filename FOR BINARY AS #1
                    GET$ #1, LOF(1), strData
                    CLOSE #1
                    
                    'Get Lines
                    REDIM strLines(1 TO PARSECOUNT(strData,Linedelimiter))
                    PARSE strdata ,strLines(),LineDelimiter
                    
                    'Step through lines
                    FOR x = 1 TO UBOUND(strLines())
                        REDIM strCols(1 TO PARSECOUNT(strLines(x),ColumnDelimiter))
                        PARSE strLines(x),strCols(),ColumnDelimiter
                        strPrint = ""
                    
                        'Get columns and do something with them. n this case just display them separated by "|"s
                        FOR y = 1 TO UBOUND(strCols())
                            strPrint += " | " & strCols(y) 'Create Pipe delimited display line
                        NEXT
                    
                        ? MID$(strPrint,4) 'Trim initial pipe delimiter and display columns for current line
                    NEXT
                    END FUNCTION

                    Comment


                    • #11
                      I wound up with an extraordinarily simple solution using the existing tab delimited input files and the XVI32.exe command line executable program that I mentioned having just discovered. Had some preliminary difficulties with the SHELL command later resolved. Using FILEPATH to refer to the exact location of each of the three tab delimited input files and "prelim.xsc" being my replace ASCII 09 with ASCII 20 instruction to XVI, my final code was

                      SCOM="XVI32.exe "+FILEPATH1+" /S=prelim.xsc"
                      SHELL SCOM,1
                      SCOM="XVI32.exe "+FILEPATH2+" /S=prelim.xsc"
                      SHELL SCOM,1
                      SCOM="XVI32.exe "+FILEPATH3+" /S=prelim.xsc"
                      SHELL SCOM,1

                      The brilliant thing about XVI32.exe is that it also permanently resolved my earlier problems dealing with $LF needs to be $CR$LF in a pair of important applications. All I need now is to point the prepared command line XVI batch file at the relevant file and it nearly instantly cleans up the entire mess for me. Observation about cross platform complications: On my Windows 7 computers, the $LF only files clearly disclose themselves in NOTEPAD7 but in NOTEPAD10 on my Windows 10 computers, the operating system itself has been changed so that the $LF only "appears" to be entirely readable all the while it blows up my Power BASIC Console Compiler v6.03 created programs using the LINE INPUT# commands that I normally use under Windows 7.

                      Many thanks to those who tried to help with various suggestions while I was quandering about these complications.

                      Comment


                      • #12
                        Sorry, I missed this is a console compiler.
                        Everything we did would have worked without shelling and could have checked for $LF.
                        So they were not hex 09, but ascii 9 delimiters.
                        How long is an idea? Write it down.

                        Comment


                        • #13
                          Everything we did would have worked
                          Easily proved if only a sample file with 'problem' content had been provided by Bob G..
                          Rgds, Dave

                          Comment


                          • #14
                            ((This is Console Compiler forum. Code to detect $LS only in a file, and REPLACE $LF WITH $CRLF IN stringwholefileloadedinto is the same in PBWin or PBCC.))
                            (((And some failing code from Bob would have helped.)))
                            Dale

                            Comment


                            • #15

                              Definitely. I was referring to my displaying an array.
                              ? JOIN$(sData(),$CR) + $CR + $CR +_ "First line: " + sData(1) + $CR + _ "Last line: " + sData(UBOUND(sData)),,"hex tab separated values"
                              How long is an idea? Write it down.

                              Comment


                              • #16
                                Originally posted by Bob Grumbine View Post
                                I wound up with an extraordinarily simple solution using the existing tab delimited input files and the XVI32.exe command line executable program that I mentioned having just discovered. Had some preliminary difficulties with the SHELL command later resolved. Using FILEPATH to refer to the exact location of each of the three tab delimited input files and "prelim.xsc" being my replace ASCII 09 with ASCII 20 instruction to XVI, my final code was
                                ASCII 20? The non printable DC4 character or do you mean &H20 (the space character)?

                                And I wouldn't call having to create script files and shell out to another application "an extraordinarily simple solution" when you could do the whole same thing in PB in a few lines of code such as Post #9.


                                Comment


                                • #17
                                  Originally posted by Bob Grumbine View Post
                                  On my Windows 7 computers, the $LF only files clearly disclose themselves in NOTEPAD7 but in NOTEPAD10 on my Windows 10 computers, the operating system itself has been changed so that the $LF only "appears" to be entirely readable all the while it blows up my Power BASIC Console Compiler v6.03 created programs using the LINE INPUT# commands that I normally use under Windows 7.
                                  If you are working with text data files, forget about Notepad. It's totally inadequate (as you have discovered).

                                  Do yourself a favour, download NoteTab++ and learn to use it especially the "View =>Show symbol => Show all characters" menu item which lets you see the spaces, tabs, CRs, LFs and all the other "non-printable characters" (ASCII 0 - 31)



                                  Comment


                                  • #18
                                    I have a recurring set of input files with the data elements delimited by tab hex codes. Neither INPUT# nor LINE INPUT# are willing to retain those tabs when bringing the hex code into the processor. A straight forward GRAPHIC PRINT of the input line from LINE INPUT# shows the tab hex code has been ignored. It's even worse with INPUT# which is picking up part of the next character obtained via a sequential read with that command. My problem is further compounded by inclusion of commas in "some" of the data elements so trying to go with comma delimited files instead of the tab delimited files is definitely not going to work. I've had difficulty previously trying to transform such hex involved files reading as binary and then rewriting with the hex codes replaced (converting line feed hex only into the proper end of line carriage return line feed hex code). Can anybody suggest a means (hopefully other than trying again to transform via reading and rewriting as binary files) HOW I can get those properly end of lined but tab delimited input files into a program for processing?
                                    After reading this again, not sure if hex was involved and can't figure out why reading for binary was not wanted.
                                    Determining what type of file it is could be done with IF INSTR(s,$CRLF) THEN REPLACE this with that in s or do it without replace.
                                    Yes, 1-line of input would have been nice, but you can still do this without shelling or any other program.
                                    How long is an idea? Write it down.

                                    Comment


                                    • #19
                                      I also use notepad++

                                      If you want your own hex editor with source:
                                      https://forum.powerbasic.com/forum/u...ge-file-viewer (see post #9 for the .zip file version)
                                      How long is an idea? Write it down.

                                      Comment


                                      • #20
                                        Mike Doty Reading for binary was never wanted because each *line* of the file is its own distinct input with additional complicating commas where they confuse the relevant data. With binary treatment, the program would have to look at each byte in sequence without any relevant indication of where it fits as a *data* element. The point was to get the relevant data parts separated out (with spaces in my solution) so that I can LINE INPUT# each line and process the contents where they need to go. I have no conceivable need for my own hex editor since I had been using HxD for evaluating problems and found the XVI32.exe referred to on their web site when I went to try for a command line version of HxD. However, since I have seen multiple requests for "a line" from the original objectionable file, I am showing one below with every shown separation between the data elements I want being a tab character HEX 09 (not what "would have" appeared here as multiple spaces but for the fact that the web site converts them to single spaces rather than the tab which shows in NOTEPAD7):

                                        GCG20 1,550.5s -3.5 -0.23% 1,556.6 1,558.2 1,548.0 253,076 01/16/20

                                        After processing by my SHELL command, it looks like what shows below with each of the separators now being single spaces facilitating reading with LINE INPUT# then processing to remove the objectionable commas from the line and then using INSTR(line," ") to obtain or discard each of the parts as needed for the relevant aspects of my program:

                                        GCG20 1,550.5s -3.5 -0.23% 1,556.6 1,558.2 1,548.0 253,076 01/16/20

                                        That only *looks* the same because the web site converted the original HEX 09 characters from the original file into spaces. What I actually need out of the input file are the six data elements:

                                        GCG20 1550.5 1556.6 1558.2 1548.0 253076

                                        with every other line having its own related relevance. For me this was a problem of getting the data elements in to relate to what is needed for my processing intents, spending as little of my time as possible on this aspect of handling the input files as they exist, not to be a purist about using exclusively PBCC v6.03 to do the processing. With the original definition of DEFSTR S at the very top of the program, my resulting six lines of code are precisely what I needed to conform the three input files to what I actually need.

                                        Comment

                                        Working...
                                        X