Let's look at the trivial approach, which in many cases can be optimal:
It would store:
- The number of elements in the range (with 31 bits, since lists in Java are limited to 2^31 - 1 elements).
- Followed by 1 bit for each index, indicating whether the element is present in the set or not.
This is the implementation that the first version of the fractal image compressor had, to store the data from the splits and merges of the triangulations.
Although it wasn't optimal, since as iterations of the splits progressed, the number of elements to be stored decreased, becoming a very small percentage of the total, in which case the trivial encoding can be substantially improved.
Now the goal is to improve that trivial encoding, to see if a more compressed encoding in terms of the number of bits used can be obtained.
First approach: When the number of elements in the set is much smaller than the size of the range, we can consider storing the indices present.
In this case, the usage would be:
- 31 bits for the number of elements in the range.
- 31 bits for the number of elements present in the set.
- 31 bits * the number of elements present in the set.
Second approach. We improve the number of bits per stored element:
- 31 bits for the number of elements in the range. The number of bits of the largest element in the range (numBitsElem) would be calculated.
- 5 bits to store numBitsElem. So, if the number of elements in the range were 1023, a 10 would be stored in these 5 bits.
- numBitsElem bits, to store the number of elements in the Set.
- numBitsElem bits * the number of elements present in the Set.
Well, little by little we are getting closer to a better encoding...
Third approximation. We realize that if we sort the set in ascending order, the deltas (differences between consecutive elements) can likely be encoded with fewer bits than an absolute element of the range:
- 31 bits for the number of elements in the range.
- We calculate the smallest numBitsElem to encode an element. We also calculate the list of deltas (list of differences between consecutive elements of the sorted list of elements in the set). And finally, we also calculate numBitsDelta
- 5 bits to store numBitsElem.
- 5 bits to store numBitsDelta.
- numBitsElem bits, to store the number of elements in the Set.
- numBitsDelta bits * the number of elements in the Set.
Well, it seems we're already reaching the optimal encoding, right? ...
Let's continue ...
Fourth approximation. We focus on the fact that there can be a much larger number of smaller deltas, and perhaps only a few much larger ones, that break the `numBitsDelta`.
Under these conditions, we might find a `numOptimumBitsDelta` smaller than the maximum `numMaxBitsDelta`, which in turn is less than or equal to `numBitsElem`.
But how do we encode a delta element that exceeds `numOptimumBitsDelta`?
Well... since deltas will never be 0 (there are no repeated elements in a Set), we can use the value 0 to indicate that it's a special value.
And that 0 would be followed by the value of the delta, but encoded this time with `numMaxBitsDelta`. The total number of bits could be calculated as follows:
- 31 bits for the number of elements in the range. We calculate the smallest numBitsElem to be able to encode one element. We also calculate the list of deltas (list of differences between consecutive elements of the ordered list of elements in the Set).
- 31 bits, to store the number of elements in the Set.
- 5 bits to store numOptimumBitsDelta.
- 5 bits to store numMaxBitsDelta.
- numOptimumBitsDelta bits * the number of elements in the Set.
- numMaxBitsDelta bits * the number of deltas that exceed numOptimumBitsDelta.
Have we already reached the optimal encoding?
Fifth approximation. Well... since we have already programmed all this... We could think about the complementary situation... Would it be possible that encoding the complement of the set with the same algorithm would produce more compressed results?
The complement of the set is understood as another set, disjoint from the original, whose union with the original set is the entire range.
That is, the complementary set would have all the elements of the range absent in the original set, and none of those in the original set.
The beauty is that from a set, it is very easy to obtain the complementary set, and therefore we could obtain the original set, having encoded the complementary one.
The obvious application is a set that has all the elements of the range except one, or a few...
Well... if we store only the absent ones (the complementary set), it could be stored in a much smaller size, right?
The encoding in this case could be like this:
- 31 bits for the number of elements in the range. We calculate the smallest numBitsElem to encode an element. We also calculate the list of deltas (list of differences between consecutive elements of the ordered list of elements in the Set).
- 1 bit, to store whether we store the original Set, or the complement. This is the extra bit.
- 31 bits, to store the number of elements in the Set.
- 5 bits to store numOptimumBitsDelta.
- 5 bits to store numMaxBitsDelta.
- numOptimumBitsDelta bits * the number of elements in the Set.
- numMaxBitsDelta bits * the number of deltas that exceed numOptimumBitsDelta, in the positions where they appear.
Well... now we're talking.
So that's where I got stuck... But I'm aware that if all those wonders can be done, what couldn't be done by looking for other patterns, or simply using a lossless compressor, like zip or 7z...
That'll have to wait, on the back burner for next time...