Compression in the presence of shared data
Abstract
Lempel-Ziv based compression algorithms find textual resemblance among different parts of the processed file, and replace a later segment in the file by a pointer to an earlier appearance of it in the file. The pointer replacing the segment can be viewed as a reference into a (dynamic) dictionary that includes segments of the already-processed prefix of the file. The longer the segment replaced by a dictionary reference, the larger the contribution made to a good overall compression ratio. Since the reference dictionary includes only segments of the already-processed portion of the file, no pre-existing dictionary is needed nor referenced. Often, though, both, compressor and decompressor, even when they reside on different sites, share knowledge of files (e.g., devices managed by a server, or software customers which are known to hold older releases of the product). For such cases, we suggest to include shared files in the reference dictionary, as well as the already-processed prefix of the file being compressed. Preferably, files to be included are those which resemble the processed file. Extending the Lempel-Ziv dictionary can only lengthen the segments of the processed file that match entries in the dictionary, and can thus improve the achieved compression ratio. Additionally, long matches (and thus good compression) can be expected also for small files, which happens less with the original Lempel-Ziv dictionary that does not built large enough when the processed file is short. Working with a dictionary that is much larger than the original Lempel-Ziv dictionary has two potential drawbacks which we must address: (i) dictionary look-ups are more computationally expensive and (ii) every reference into the dictionary consumes more bits, which can hurt the gained compression. Additionally, there is the challenge of identifying the subset of shared files which resemble the file to be compressed. We address these issues, and conclude, backed by experimental results, that the scheme suggested is advantageous in the presence of shared data. © 2001 Elsevier Science Inc.