Performance of Delta Huffman Compression Variants
MetadataShow full metadata
This thesis examines an algorithm that is a combination of the Elias Delta and Huffman algorithms known as δ-Huffman. Exploiting the fact that Elias Delta codes are uniquely decodable and tend to allocate fewer bits to small integers, while Huffman coding is a lossless compression method that provides compressed symbols with a bit rate that is relatively close to the entropy of the source.
The thesis “Dynamic Unbounded Integer Compression” by Naga Sai Gowtham Mulpuri explored achievable compression ratios in the context of δ-Huffman encoding and used the frequency of inputs as the estimation for the probability distribution of the δ-Huffman algorithm on synthetic data and real-world posting list data.
Our work focuses on different assumptions about the way to estimate the probability function for the δ-Huffman algorithm. Assumptions such as using a Geometric Probability Mass Function in the estimation of the probability distribution of the δ-Huffman algorithm are made and methods such as slice and reset with the -δ Huffman algorithm are introduced.
We examine the practical and theoretical bounds on the compression capabilities of integer compression methods. Additionally, we devise methods for efficient compression of integers regardless of their specific probability distribution function. The focus of our work has been on the δ-Huffman algorithm’s compression, rather than on its computation complexity.
Twelve experiments with twelve variants of the dynamic δ-Huffman algorithm are performed. The experiments use one or more of three source data sets: synthetic data from several Geometric Probability Mass Functions (GPMF) and Poisson, real-world data from sorted inverted index gaps from Wikipedia, and a benchmark that attempts to represent realistic workload data from Silesia. The entropy estimates of these data sets and the average bit rate of the δ-Huffman algorithms offer a way to construct a comparative evolution of the performance of these algorithms.
From these experiments, the best variant out of the twelve δ-Huffman algorithms for synthetic data, real-world data, and benchmark data are determined.
Finally, given the results of the experiments, assuming GPMF without knowledge concerning the exact value of parameters, the best variant out of the twelve δ-Huffman algorithms is identified.