How does gzip algorithm work




















Optimizing matches. Beyond GZIP Thank you! Total views 4, On Slideshare 0. From embeds 0. Number of embeds Downloads Shares 0. Comments 0. Likes 5. You just clipped your first slide! Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips. Visibility Others can see my Clipboard.

The first codes in this array are the literal bytes, which appeared in the input that was supplied to the LZ77 compression algorithm in the first place. This is followed by the special "stop" code and up to 30 length declarators, indicating backpointers. Finally, there will be up to 32 distance code lengths. This is a bit dense, but implements the logic for using the code lengths Huffman tree to build two new Huffman trees which can then be used to decode the final LZcompressed data.

Check to see if it's a literal e. If it's the stop code e. If it's a length code e. Interpreting backpointers is the most complex part of inflating gzipped input. Similar to the "sliding scales" that were used by the code lengths Huffman tree, where a 17 was followed by three bits indicating the actual length and 18 was followed by 7 bits, different length codes are followed by variable numbers of extra bits. If the length code is between and , subtract from it - this is the length of the backpointed range, without any extra length bits.

These bits are then added to another code whose value depends on the length code itself to get the actual length of the range. In this way, very large backpointers can be represented, but the common cases of short lengths can be coded efficiently. A length code is always followed by a distance code, indicating how far back in the input buffer the matched range was found.

The lengths codes can range from , but the distance codes can range from - which means that, while decompressing, it's necessary to keep track of at least the previous 32, input characters. Listing 17 should be fairly straightforward to understand at this point - read Huffman codes, one after another, and interpret them as literals or backpointers as described previously.

Note that this routine, as implemented, can't decode more than bytes of output, and that it assumes that the output is ASCII-formatted text. A more general-purpose gunzip routine would handle larger volumes of data and would return the result back to the caller for interpretation.

If you've made it this far, you're through the hard parts. Everything else involved in unzipping a gzipped file is boilerplate. A gzipped file starts with a header that first declares it as a gzipped file and declares some metadata about the file itself. Go ahead and declare a main routine as shown in listing 19 that expects to be passed in the name of a gzipped file, and will read the header and output some of its metadata.

I won't go over this in detail; refer to RFC [5] if you want more specifics. Note that a gzip file must begin with the magic bytes 1F8B, or it's not considered a valid gzipped file. Finally, the header can optionally be protected by a CRC16, although this is rare. Once the header is read, the compressed data, as described by the routines above, follows. However, there's one extra layer of header metadata. A gzipped file actual consists of multiple blocks of deflated data usually there's just one block, but the file format permits more than one.

The blocks themselves can either be uncompressed or compressed according to the deflate specification described previously. Also, in a nod to space efficiency, very small inputs can use a boilerplate set of Huffman codes, and not declare their own Huffman tables at all. This way, if the compressed data is actually larger than the original input when the Huffman tables are included, the data can be reasonably compressed.

The GZIP file format, then, after the standard header, consists of a series of blocks. Notice that the bit stream declared in listing 11 is initialized here. As you can imagine, this fixed Huffman tree isn't optimal for any input data set, but if the input is small less than a few hundred bytes , it's better to use this less efficient Huffman tree to encode the literals and lengths than to use up potentially a few hundred bytes declaring a more targeted Huffman tree.

Notice also that there's no distance tree - when fixed Huffman trees are used, the distances are always specified in five bits, not coded. Notice that the hardcoded distances are still subject to "extra bit" interpretation just like the dynamic ones are. That's almost it. Insert a call to "inflate" at the end of the main routine as shown in listing if gzip.

In fact, that could be it - except that there's a small problem with compressing data. With uncompressed data, one bit error affects at best one byte, and at worst a contiguous chunk of data - but a flipped bit in uncompressed data doesn't render the whole document unreadable.

Compressed data, due to it's natural push to make the most efficient use of space down to the bit level, is highly susceptible to minor errors. A single bit flipped in transmission will change the meaning of the entire remainder of the document.

Although bit-flipping errors are rarer and rarer on modern hardware, you still want some assurance that what you decompressed was what the author compressed in the first place. Whereas with ASCII text, it's obvious to a human reader that the data didn't decompress correctly if an error occurs, imagine a binary document that represents, for instance, weather meter readings.

The readings can range from and vary wildly; if the document uncompressed incorrectly, the reader would have no way to detect the error. The decompressor can compare both to the output and verify that they're correct before blessing the document as correctly inflated. A CRC is sort of like a checksum, but a more reliable one.

A checksum is a simple mechanism to give the receiver some assurance that what was sent is what was received - the concept is to sum all of the input bytes modulo an agreed-upon base typically 2 32 and transmit the resulting sum. This is simple for each side to quickly compute, and if either side disagrees on the result, something has gone wrong during transmission. However, checksums aren't optimal when it comes to detecting bit errors.

If a single bit is flipped, then the checksums won't match, as you would want. However, if two bits are flipped, it's possible that they can flip in such a way as to cancel each other out, and the checksum won't detect the error.

The CRC, developed in W. Peterson in [6], is a more robust check mechanism which is less susceptible to this error cancellation. In a future installment of this series, I'll cover the CRC32 value and add support for CRC32 validation to the gunzip routine presented here. However, I'd encourage you to download the source code and use it to decompress a few small documents and step through the code to see if you can understand what's going on and how it all works.

Of course, if you want to do any real decompression, use a real tool such as GNU Gunzip. However, I've found the utility described here very helpful in understanding how the algorithm works and how GZIP compression goes about doing its job. I believe that the code in this article is free from errors, but I haven't tested it extensively - please, whatever you do, don't use it in a production environment use GNU gzip instead! However, if you do spot an error, please let me know about it.

Completely off-topic or spam comments will be removed at the discretion of the moderator. You may preserve formatting e. Command Line Fanatic A blog about technology, protocols, security, details and fanaticism.

Imagine that you were dealing with a four-character "alphabet" with four variable-length codes: 1. C: 10 4. D: 11 Example 2: Invalid variable-length code assignment The problem with this assignment is that the codes are ambiguous. A valid Huffman coding of the four-character alphabet above, then, is: 1. B: 01 3. C: 4. D: Example 3: Valid prefix-coding variable-length code assignment This means that there can only be one 1-bit code, one two-bit code, two three-bit codes, four four-bit codes, etc.

Figure 3: a prefix code tree Once such a tree structure has been constructed, decoding is simple - start at the top of the tree and read in a bit of data. Therefore, the resulting Huffman table will look like: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Example 4: Simple reserved-prefix Huffman table This can be extended to any number of bit-lengths - if there are seven-bit codes to be represented, then the six-bit codes will begin with , and the seven-bit codes will begin with Does ES6 make JavaScript frameworks obsolete?

Podcast Do polyglots have an edge when it comes to mastering programming Featured on Meta. Now live: A fully responsive profile. Linked 5. Related Hot Network Questions. Stack Overflow works best with JavaScript enabled. Xls converter. Xml converter. Length converter. Weight converter. Temperature converter. Energy converter. Area converter. Epub converter. Mobi converter. Azw3 converter. Mp4 converter. Webm converter. Flv converter. Mkv converter. Avi converter. Mov converter.

Wmv converter. Android video converter. Iphone video converter. Ipad video converter. Mobile video converter. Xbox video converter. Psp video converter. Kindle video converter. Mp3 converter. Wma converter. Wav converter. Flac converter. M4a converter. Alac converter. Amr converter. Ogg converter. Aiff converter. Aac converter.

Android audio converter. Iphone audio converter. Ipad audio converter. Ipod audio converter. Convert video to mp3. Convert video to gif. Convert mp4 to gif. Convert webm to gif. Video compressor. Compress pdf. Compress jpeg. Compress png. Image compressor. Gif compressor.



0コメント

  • 1000 / 1000