Skip to content

Latest commit

 

History

History
93 lines (58 loc) · 5.42 KB

README.md

File metadata and controls

93 lines (58 loc) · 5.42 KB

Introduction

This is a lightweight, pure python version of a rANSCoder. A rANSCoder is an entropy coder, which was first described in Jarek Duda's paper (http://arxiv.org/abs/1311.2540). This variant of rANS works on 64 bits and emits 32-bit words at the same time. It is a port of Fabien Giesen 64 bit rANSCoder to python. It should be compatible with the functions rANS64XXXXXX in Giesen's project - i.e. you could theoretically encode or decode with this project, and do the other step via Giesen's code.

Installation

This repository is available via pip. You can install it as follows.

pip install py_rans

Purpose

The focus of this repository is not performance, but rather rapid prototyping. This entropy coder will provide compresion performance which is close to entropy at a reasonable speed. If you want the same compression performance and higher compression speed you should look at a C/C++ library fufilling those needs - I can recommend either Giesen's or Bonfield's implementation of rANS. The following are the Github links to their two projects:

API by simple example

The API is deliberately kept very simple, with a single way to use the rANSCoder. The interface consists of two classes "Encoder" and Decoder".

To encode something, you would do the following:

import rans.rANSCoder as rANS
import random
import numpy as np

encoder = rANS.Encoder()
probs = np.array([0.1, 0.2, 0.3, 0.05, 0.05, 0.15, 0.05, 0.1], dtype=np.float32)
data = np.array([random.randint(0,7) for i in range(10000)], dtype=np.int32)
# Use numpy arrays for better performance! Numba works better with numpy arrays than vanilla python arrays.

for symbol in data:
    encoder.encode_symbol(probs, symbol)
encoded = encoder.get_encoded()
print("The number of bytes needed is: " + str(len(encoded) * 4))

# Encoded is a list of 32-bit words.
# Do things afterwards with encoded - e.g. save it into a file or send it somewhere.

And later, to decode:

import rans.rANSCoder as rANS

# Given: length_decoded, encoded, probs
# Assuming the previous code snippet was how it was encoded, then:
# length_decoded = len(data) 
# encoded = encoded
# probs = probs

decoder = rANS.Decoder(encoded)
decoded_data = []
for _ in range(length_decoded):
    decoded_data.append(decoder.decode_symbol(probs))
    
# decoded_data is the original data, in REVERSE ORDER.

If you run the two code snippets in order, then you will need to call decoded_data.reverse() before running the check data == decoded_data.

API more formally

This python package consists of two classes, Encoder and Decoder. There are also a few helper functions outside of the two classes - but they are not meant to be used as part of the public API - this package is meant to be interfaced with solely through the two classes. All of the code is wrapped in numba wrappers, which compiles the code just in time before use. The first call to any method will take longer than any calls following it.

The Encoder is responsible for taking the uncompressed data and compressing it. It is an entropy coder, so it expects data and a corresponding model (i.e. probability distribution which says how frequent a given symbol is). The Encoder's constructor takes no arguments, so you may simply construct it via encoder = rANS.Encoder().

There are two methods:

Encoder.encode_symbol(probs, symb)

This method takes a symbol and adds it to the state inside of the Encoder which represents the currently compressed data. In order to compress it, the encoder requires a probability distribution which says how likely a given symb will occur. So, for example, given probs = [0.15, 0.8, 0.05], the symbol 0 occurs with 15% probability, 1 occurs with 80% probability, and 2 with 5% probability. So note that symb is a number with the bounds 0 ≤ symb < len(probs).

For this method to work properly, sum(probs) should be close to 1 - i.e. you need to pass a proper probability distribution. Note that you may use different probs for each succesive symbol you encode in an Encoder. This is allowed and intentional, as your model might change while you're encoding.

Encoder.get_encoded()

This method flushes the state in the encoder, and returns an array of 32-bit words which represents the compressed data. Note that you should not use the Encoder after calling this method - if you want to encode more data later on, you should then create a new Encoder.

The Decoder is responsible for taking the compressed data generated by an Encoder, and transforming it back into the original data. Note that the data is output in reversed order. The constructor expects the encoded data as input, i.e. you may construct the Decoder as decoder = rANS.Decoder(encoded_data).

The Decoder only has one method:

Decoder.decode_symbol(probs)

This method takes the probability distribution that the corresponding symbol was encoded with, and retrieves this symbol. The next time this method is called, the next symbol is decoded. Notice that these symbols are decoded in reverse order - i.e. this coder follows a FILO principle - the last symbol encoded is the first symbol decoded.

Further note that the Decoder does not know how many symbols are left - you will either need to have messages/data of fixed length, or you will have to store the length with the compressed data in order to later be able to decompress it.