BenchmarksΒΆ
My quick tests show that memory usage is quite decent. For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):
- dict(unicode words -> word lenghts): about 600M
- list(unicode words) : about 300M
- BaseTrie from datrie library: about 70M
marisa_trie.RecordTrie
: 11Mmarisa_trie.Trie
: 7M
Note
Lengths of words were stored as values in datrie.BaseTrie
and marisa_trie.RecordTrie
. RecordTrie
compresses
similar values and the key compression is better so it uses
much less memory than datrie.BaseTrie
.
marisa_trie.Trie
provides auto-assigned IDs. It is not possible
to store arbitrary values in marisa_trie.Trie
so it uses less
memory than RecordTrie
.
Benchmark results (100k unicode words, integer values (lenghts of the words), Python 3.2, macbook air i5 1.8 Ghz):
dict building 2.919M words/sec
Trie building 0.394M words/sec
BytesTrie building 0.355M words/sec
RecordTrie building 0.354M words/sec
dict __getitem__ (hits) 8.239M ops/sec
Trie __getitem__ (hits) not supported
BytesTrie __getitem__ (hits) 0.498M ops/sec
RecordTrie __getitem__ (hits) 0.404M ops/sec
dict get() (hits) 4.410M ops/sec
Trie get() (hits) not supported
BytesTrie get() (hits) 0.458M ops/sec
RecordTrie get() (hits) 0.364M ops/sec
dict get() (misses) 4.869M ops/sec
Trie get() (misses) not supported
BytesTrie get() (misses) 0.849M ops/sec
RecordTrie get() (misses) 0.816M ops/sec
dict __contains__ (hits) 8.053M ops/sec
Trie __contains__ (hits) 1.018M ops/sec
BytesTrie __contains__ (hits) 0.605M ops/sec
RecordTrie __contains__ (hits) 0.618M ops/sec
dict __contains__ (misses) 6.489M ops/sec
Trie __contains__ (misses) 2.047M ops/sec
BytesTrie __contains__ (misses) 1.079M ops/sec
RecordTrie __contains__ (misses) 1.123M ops/sec
dict items() 57.248 ops/sec
Trie items() not supported
BytesTrie items() 11.691 ops/sec
RecordTrie items() 8.369 ops/sec
dict keys() 217.920 ops/sec
Trie keys() 19.589 ops/sec
BytesTrie keys() 14.849 ops/sec
RecordTrie keys() 15.369 ops/sec
Trie.prefixes (hits) 0.594M ops/sec
Trie.prefixes (mixed) 1.874M ops/sec
Trie.prefixes (misses) 1.447M ops/sec
RecordTrie.prefixes (hits) 0.103M ops/sec
RecordTrie.prefixes (mixed) 0.458M ops/sec
RecordTrie.prefixes (misses) 0.164M ops/sec
Trie.iter_prefixes (hits) 0.588M ops/sec
Trie.iter_prefixes (mixed) 1.470M ops/sec
Trie.iter_prefixes (misses) 1.170M ops/sec
Trie.keys(prefix="xxx"), avg_len(res)==415 5.044K ops/sec
Trie.keys(prefix="xxxxx"), avg_len(res)==17 89.363K ops/sec
Trie.keys(prefix="xxxxxxxx"), avg_len(res)==3 258.732K ops/sec
Trie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4 293.199K ops/sec
Trie.keys(prefix="xxx"), NON_EXISTING 1169.524K ops/sec
RecordTrie.keys(prefix="xxx"), avg_len(res)==415 3.836K ops/sec
RecordTrie.keys(prefix="xxxxx"), avg_len(res)==17 73.591K ops/sec
RecordTrie.keys(prefix="xxxxxxxx"), avg_len(res)==3 229.515K ops/sec
RecordTrie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4 269.228K ops/sec
RecordTrie.keys(prefix="xxx"), NON_EXISTING 1071.433K ops/sec
Tries from marisa_trie
are static and uses less memory, tries from
datrie are faster and can be updated.
You may also give DAWG a try - it is usually faster than
marisa-trie
and sometimes can use less memory (depending on data).
Please take this benchmark results with a grain of salt; this is a very simple benchmark on a single data set.