Bounds on expansion in LZ'77-like coding
Abstract
We investigate the maximum increase in number of phrases that results from changing k consecutive symbols in a string x having length n parsed using an LZ'77-like algorithm. We consider a class of compression algorithms that partition a sequence into a collection y of nonoverlapping, variable-length phrases and encode them. Each phrase either is a singleton or matches a substring that starts to its left. We show that changing a single symbol of x in position i can yield an expansion that is of order O(n - i)2/3 as (n - i)) → ∞. Our lower bound requires an alphabet size of O(n - i)1/3. We also show that hanging k consecutive symbols starting from position i can yield an expansion having a similar but somewhat more involved form. The paper contains both analytically derived upper and lower bounds, and algorithms for numerically computing tighter bounds. While deriving the bounds, we provide a detailed analysis of how expansion can arise when changing consecutive symbols. This problem is motivated by management policies for computer systems, such as the IBM Memory eXpansion Technology (MXT) or the IBM iSeries compressed disks, that use LZ'77-like coding on small compression units, such as 1-4 kbyte, and store the compressed data in memory or on disk tracks. Here, when a change of a portion of the compression unit occurs, for example, an L2 cache line, or a 512-byte disk sector, the data is recompressed and potentially stored in a different location. Knowing the maximum expansion, rather than the average expansion, is an important factor for designing policies for allocation and management of memory or disk space. © 2006 IEEE.