Announcement

Collapse
No announcement yet.

Text Compression

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Text Compression

    Greetings!

    I'm attempting to understand a different way to store text. I've been told that it's stored six bits per character. If you have the text "What's possible?" it'll be stored in a file as "92 128 84 157 56 16 61 52 201 8 193 127 0" (I'm showing it as decimal base). I started digging into some of the serial communication stuff; but, I may be following the wrong path. Any ideas?
    Donnie Ewald
    [email protected]

  • #2
    You can store text any way you want. "As it happens" the printable characters can be stored in 7 bits (128 discrete characters) .. and if you can live with a reduced character set of 64 characters, you can fit everything into 6 bits.

    However, text is usually stored 8 bits per characters... using the familiar "ASCII" character set... because then it does not have to be compressed for storage or uncompressed for retrieval.

    Unless of course it's stored in Unicode... 16 bits per character.

    For serial communicatons of text, you would normally just transmit the 8-bit form. Today's modems will compress it to save on total bandwidth, but as an applications programmer, what happens between the modems is something you really don't care about all that much.
    Michael Mattias
    Tal Systems (retired)
    Port Washington WI USA
    [email protected]
    http://www.talsystems.com

    Comment


    • #3
      You might want to take a look at the PB compression code over at http://www.pbcrypto.com/pbcrypto.php for some more sophisticated algorithms.
      Paul Squires
      FireFly Visual Designer (for PowerBASIC Windows 10+)
      Version 3 now available.
      http://www.planetsquires.com

      Comment


      • #4
        The 'downside' of using compression is... now you need it on BOTH sides of the transmission...

        .. of course, that can be an 'upside', too... eg if the transmission is supposed to be private but not worth the expense (time & materials & software) of 'full-blown' encryption.
        Michael Mattias
        Tal Systems (retired)
        Port Washington WI USA
        [email protected]
        http://www.talsystems.com

        Comment


        • #5
          Greetings!

          I appreciate the responses. I'm looking to use this format specifically for use in another program. What do I need to study, or what is the history that I can research that might lead me to some understanding? I suspect that this is something I might've learned (or was supposed to) in the past; but, I've no real idea of how to start. Michael, the character set is reduced to 64; but, I've no idea what that means.

          I won't turn down any code examples; but, I'd prefer some nudges in the direction of what this means instead.
          Donnie Ewald
          [email protected]

          Comment


          • #6
            Don, can you make more and different examples like "What's possible?" above?

            Comment


            • #7
              I just don't see why you want to store text any way other than as US-ASCII.... 8 bits/character.

              If it's to save space on disk or in memory, there are many other ways to compress text which will save a lot more than the 25% you save by storing in 6 bits versus 8 bits.

              And if it's for transmission over the internet, that's even less savings because the internet is only 7 bits anyway.

              If it's for security, well, have fun coding the bit-packing and unpacking. Here's how you get started with that ...
              Code:
               
              %CAP_A  = 0 
              %CAP_B  = 1
              ...
              %CAP_Z  = 25 
              %LOWER_A  = %CAP_A + 26
              %LOWER_B  = %CAP_B + 26  
              ...
              %LOWER_Z  =  51  
              
              %NUM_0 =  52 
              %NUM_1 =  53
              ...
              %NUM_9 =  61
              Leaves you two more characters..so you probably want to eschew the lower case letters since you probably want .,;!$+&%#@ at least.

              MCM
              Michael Mattias
              Tal Systems (retired)
              Port Washington WI USA
              [email protected]
              http://www.talsystems.com

              Comment


              • #8
                John, I didn't convert it over to decimal; but, here's another one in hex:
                Code:
                20 F4 05 19 53 0C 66 C8 14 20 94 E0 5C 93 0C 80 D0 4B 16 04 CF 34 58 13 3D 25 20 3C 68 13 14 E4 C5 81 43 E0 4C F3 45 3C E1 6E 80 15 20 50 81 60 34 F3 45 39 4B 20 25 48 04 3C 54 CE 9D 48 06 3D 28 0D 16 E8 09 81 42 09 38 B8 0D 24 32 01 14 C8 08 05 38 07 25 61 4E 80 D1 60 4C F3 45 80 73 CF 12 01 09 48 50 D4 24 F3 A0 50 F8 13 50 14 94 81 42 0F 54 72 2E
                ...stands for: "HOPEFULLY, THIS WILL MAKE SOME SORT OF SENSE TO SOMEONE. AT THE MOMENT, IT DOESN'T FOR ME. I THINK MICHAEL HAS GIVEN ME SOME GOOD DIRECTION TO START THOUGH."

                ...and another:
                Code:
                30 55 27 4E 05 19 40 58 09 3A 05 08 16 00 4C 40 80 42 15 48 01 38 48 13 3C D1 60 39 53 42 15 24 E0 04 E1 20 4C 51 60 5C 80 54 81 42 01 52 03 0F 3C B4 E0 30 92 C5 EA 00 42 0C 41 46 1C 82 4A 2C C3 4E 3D 04 52 4D 45 56 5D 86 5A 83 1C B3 D3 5D B7 E3 9C
                ...stands for: "LET'S TYPE IN THE ALPHABET AND SOME NUMBERS AND SEE WHAT THAT LOOKS LIKE: ABCDEFGHIJKLMNOPQRSTUVWXYZ 1234567890"

                ...the idea of bit-packing looks promising; but, I need to do some reading up on it. Hopefully this doesn't mean I need to dig into assembly though.

                Michael, I understand that the other ways might make more sense; but, when you're coding/decoding an old game format you have to play by their rules.
                Donnie Ewald
                [email protected]

                Comment


                • #9
                  I just don't see why....
                  ...the internet is only 7 bits anyway.
                  ?? Again?

                  Apparently not relevent to Don's system, in the world of serial comms - NMEA data exchange schemes, the 'High speed" version of NMEA data transmission implemented for AIS uses a method of 8 to 6 bit encapsulation to transmit encoded data using a limited (64 character) set of allowable characters.
                  Rgds, Dave

                  Comment


                  • #10
                    Originally posted by Dave Biggs View Post
                    ?? Again?
                    I wonder if Michael does Mime conversions on his files before he FTP's them.

                    Comment


                    • #11
                      Don,

                      You could possibly establish a conversion table this way.

                      If you type the same character eight times, does that results in two sets of the same three eight bit values being stored in the file?
                      If so then then the first six bits of the first eight bit value from the file could be the table entry for the typed character.
                      Repeat for each of the 64 characters in the set and you should have a table that you can use to convert the data read back from the file.

                      Read the data, convert to a binary string, chop into six bit chunks and re-translate using the table (array scan)..?

                      You might be able to see a pattern in the table that suggests a conversion algorithm for more speed if that's an issue.
                      Rgds, Dave

                      Comment


                      • #12
                        Well, I took the words 'bit-packing' and 'unpacking' from Michael's response and spent some time on this page:

                        http://webster.cs.ucr.edu/AoA/Window...a7.html#999885

                        I ignored all the information on assembly as that makes not a lick of sense to me. After thinking for a while and tinkering, I came up with this short bit of PB/CC code:

                        Code:
                        function pbmain()
                          local sTempVar as STRING
                        
                          sTempVar = "HOPEFULLY, THIS WILL MAKE SOME SORT OF SENSE TO SOMEONE. AT THE MOMENT, IT DOESN'T FOR ME. I THINK MICHAEL HAS GIVEN ME SOME GOOD DIRECTION TO START [email protected]"
                          'sTempVar = "LET'S TYPE IN THE ALPHABET AND SOME NUMBERS AND SEE WHAT THAT LOOKS LIKE: ABCDEFGHIJKLMNOPQRSTUVWXYZ [email protected]"
                        
                          Flag& = 0
                          for I& = 1 to len(sTempVar)
                            Jack$ = Jack$ + bin$(asc(sTempVar, I&), 6)
                            incr Flag&
                        
                            if Flag& = 4 THEN
                              COutput$ = COutput$ + $SPC + hex$(val("&B" + Jack$), 6)
                              Jack$ = ""
                              Flag& = 0
                            END IF
                          NEXT
                        
                          if len(COutput$) = 0 THEN
                            print hex$(val("&B" + Jack$), len(sTempVar)*2)
                          else
                            print COutput$
                          END IF
                        
                          waitkey$
                        END FUNCTION
                        It outputs the same information that the program does. I feel pretty good about this; but, I'm curious if there's any other takes or approaches on this.
                        Donnie Ewald
                        [email protected]

                        Comment


                        • #13
                          Yes, I think that looks good. It meets the main criterion of all programming: first make it work.

                          Now, if you get into longer encodings, the string concatenation is rapidly, exponentially going to slow the algo. So that is a potential area for optimization.

                          Comment


                          • #14
                            This speeds the string concat. issue:
                            Code:
                            #COMPILE EXE
                            #DIM ALL
                            
                            FUNCTION PBMAIN()
                              LOCAL sTempVar, jack, cOutput AS STRING, flag, i, parts  AS LONG
                            
                              sTempVar = REPEAT$(10000,"HOPEFULLY, THIS WILL MAKE SOME SORT OF SENSE TO SOMEONE. AT THE MOMENT, IT DOESN'T FOR ME. I THINK MICHAEL HAS GIVEN ME SOME GOOD DIRECTION TO START [email protected]")
                              'sTempVar = "LET'S TYPE IN THE ALPHABET AND SOME NUMBERS AND SEE WHAT THAT LOOKS LIKE: ABCDEFGHIJKLMNOPQRSTUVWXYZ [email protected]"
                            
                              Flag& = 0
                              cOutput = SPACE$(LEN(sTempVar) * 2)
                              FOR I& = 1 TO LEN(sTempVar)
                                Jack$ = Jack$ + BIN$(ASC(sTempVar, I&), 6)
                                INCR Flag&
                            
                                IF Flag& = 4 THEN
                                   MID$(cOutput, parts * 7 + 1) = $SPC + HEX$(VAL("&B" + Jack$), 6)
                                   INCR parts
                            
                            '      COutput$ = COutput$ + $SPC + hex$(val("&B" + Jack$), 6)
                                  Jack$ = ""
                                  Flag& = 0
                                END IF
                              NEXT
                            
                              cOutput = RTRIM$(cOutput)
                              IF LEN(COutput$) = 0 THEN
                                PRINT HEX$(VAL("&B" + Jack$), LEN(sTempVar)*2)
                              ELSE
                                PRINT LEFT$(COutput$, 1000)
                              END IF
                                                  
                              WAITKEY$
                            END FUNCTION
                            Last edited by John Gleason; 5 Jul 2009, 08:52 AM. Reason: reduced "parts" calc

                            Comment


                            • #15
                              Packed dates

                              Here is an old piece of code I have been using to save space when all you had to store your data was floppies.

                              Code:
                              'Storing date with a 2 bytes integer by Guy Dombrowski
                              'Day is * by 1000, Year is * by 10, month 1 to 9 is last bit and neg value is used for month 10 to 12
                              
                              #COMPILE EXE
                              #BREAK ON
                               DEFLNG a-z
                              
                              FUNCTION PBMAIN () AS LONG
                                
                               d=31: y=2099: m=12  ' Test data
                               dym=(d*1000)+((y-2000)*10): IF m<10 THEN dym=dym+m ELSE dym=dym+m MOD 10: dym=-dym
                               PRINT "Packed date is "; dym
                              
                               IF dym<0 THEN m=10: dym=-dym ELSE m=0
                               d=dym\1000: dym=dym MOD 1000: y=dym\10: dym=dym MOD 10: m=m+dym: y=y+2000
                               PRINT "Unpacked date is "; USING$("##/##/####",d,m,y)
                              
                               WAITKEY$
                              
                              END FUNCTION
                              Old QB45 Programmer

                              Comment


                              • #16
                                This might help when it comes to unpacking the data:
                                Code:
                                  
                                       6-Bit ASCII         Standard ASCII               6-Bit ASCII         Standard ASCII
                                 
                                 Chr   Dec Hex  Binary    Dec Hex  Binary       Chr   Dec Hex  Binary    Dec Hex  Binary
                                  @    0   00   00 0000   64  40   0100 0000     !    33  21   10 0001   33  21   0010 0001
                                  A    1   01   00 0001   65  41   0100 0001     ”    34  22   10 0010   34  22   0010 0010
                                  B    2   02   00 0010   66  42   0100 0010     #    35  23   10 0011   35  23   0010 0011
                                  C    3   03   00 0011   67  43   0100 0011     $    36  24   10 0100   36  24   0010 0100
                                  D    4   04   00 0100   68  44   0100 0100     %    37  25   10 0101   37  25   0010 0101
                                  E    5   05   00 0101   69  45   0100 0101     &    38  26   10 0110   38  26   0010 0110
                                  F    6   06   00 0110   70  46   0100 0110     `    39  27   10 0111   39  27   0010 0111
                                  G    7   07   00 0111   71  47   0100 0111     (    40  28   10 1000   40  28   0010 1000
                                  H    8   08   00 1000   72  48   0100 1000     )    41  29   10 1001   41  29   0010 1001
                                  I    9   09   00 1001   73  49   0100 1001     *    42  2A   10 1010   42  2A   0010 1010
                                  J   10   0A   00 1010   74  4A   0100 1010     +    43  2B   10 1011   43  2B   0010 1011
                                  K   11   0B   00 1011   75  4B   0100 1011     ,    44  2C   10 1100   44  2C   0010 1100
                                  L   12   0C   00 1100   76  4C   0100 1100     -    45  2D   10 1101   45  2D   0010 1101
                                  M   13   0D   00 1101   77  4D   0100 1101     .    46  2E   10 1110   46  2E   0010 1110
                                  N   14   0E   00 1110   78  4E   0100 1110     /    47  2F   10 1111   47  2F   0010 1111
                                  O   15   0F   00 1111   79  4F   0100 1111     0    48  30   11 0000   48  30   0011 0000
                                  P   16   10   01 0000   80  50   0101 0000     1    49  31   11 0001   49  31   0011 0001
                                  Q   17   11   01 0001   81  51   0101 0001     2    50  32   11 0010   50  32   0011 0010
                                  R   18   12   01 0010   82  52   0101 0010     3    51  33   11 0011   51  33   0011 0011
                                  S   19   13   01 0011   83  53   0101 0011     4    52  34   11 0100   52  34   0011 0100
                                  T   20   14   01 0100   84  54   0101 0100     5    53  35   11 0101   53  35   0011 0101
                                  U   21   15   01 0101   85  55   0101 0101     6    54  36   11 0110   54  36   0011 0110
                                  V   22   16   01 0110   86  56   0101 0110     7    55  37   11 0111   55  37   0011 0111
                                  W   23   17   01 0111   87  57   0101 0111     8    56  38   11 1000   56  38   0011 1000
                                  X   24   18   01 1000   88  58   0101 1000     9    57  39   11 1001   57  39   0011 1001
                                  Y   25   19   01 1001   89  59   0101 1001     :    58  3A   11 1010   58  3A   0011 1010
                                  Z   26   1A   01 1010   90  5A   0101 1010     ;    59  3B   11 1011   59  3B   0011 1011
                                  [   27   1B   01 1011   91  5B   0101 1011     <    60  3C   11 1100   60  3C   0011 1100
                                  \   28   1C   01 1100   92  5C   0101 1100     =    61  3D   11 1101   61  3D   0011 1101
                                  ]   29   1D   01 1101   93  5D   0101 1101     >    62  3E   11 1110   62  3E   0011 1110
                                  ^   30   1E   01 1110   94  5E   0101 1110     ?    63  3F   11 1111   63  3F   0011 1111
                                  –   31   1F   01 1111   95  5F   0101 1111
                                Space 32   20   10 0000   32  20   0010 0000
                                 
                                When converting packed data back to standard Ascii:
                                For characters with a 6-bit value  <&h20 the 8-Bit Ascii value = 6-bit AND &h40.
                                For characters with a 6-bit value =>%h20 the 8-Bit Ascii value = 6-bit.
                                Rgds, Dave

                                Comment


                                • #17
                                  >I wonder if Michael does Mime conversions on his files before he FTP's them.

                                  The underlying Internet is 7 data bits. Thats WHY you have to MIME files to transmit them.. you have to make all your data fit in 7 bits. So if you want to send more than 128 discrete characters, you have to do some kind of encoding/decoding.

                                  Needless to say, it is rare for an end user to even think about this, as all your modern software handles it for you... but trust me, it *is* being handled for you.
                                  Michael Mattias
                                  Tal Systems (retired)
                                  Port Washington WI USA
                                  [email protected]
                                  http://www.talsystems.com

                                  Comment


                                  • #18
                                    Originally posted by Michael Mattias View Post
                                    The underlying Internet is 7 data bits.
                                    AHAHAHAAHHAH!!!!!
                                    Sorry if we dare to not trust you on this!

                                    Bye!
                                    Last edited by Marco Pontello; 6 Jul 2009, 08:29 AM.
                                    -- The universe tends toward maximum irony. Don't push it.

                                    File Extension Seeker - Metasearch engine for file extensions / file types
                                    Online TrID file identifier | TrIDLib - Identify thousands of file formats

                                    Comment


                                    • #19
                                      A long time ago in the internet's infancy there were 7 bit connects but as far as I know, they are pretty rare today. As to MIME encoding files in e-mail, that is because the e-mail transfer protocols were written when there were a lot of 7 bit connections and the protocol itself only guarantees transferring data using 7 bits so to ensure that your data arrives intact, you need to encode it to 7 bits.

                                      HTTP and other protocols do not MIME encode data, they just send the raw 8 bit data.
                                      Jeff Blakeney

                                      Comment


                                      • #20
                                        Originally posted by Michael Mattias View Post
                                        >I wonder if Michael does Mime conversions on his files before he FTP's them.

                                        The underlying Internet is 7 data bits. Thats WHY you have to MIME files to transmit them.. you have to make all your data fit in 7 bits. So if you want to send more than 128 discrete characters, you have to do some kind of encoding/decoding.

                                        Needless to say, it is rare for an end user to even think about this, as all your modern software handles it for you... but trust me, it *is* being handled for you.
                                        Are you serious, you actually do it? I was joking. The internet just moves binary packages, it doesnt care if they are 7 bit, 8 bit or 153bit words in the data section of the package (encryption wouldn't work if it did). The headers have format requiremnts (8 bit) after that forget it.

                                        Comment

                                        Working...
                                        X