Bloom Filter Interface
probabilistic with one-sided errors:
* if the key is not in the set, contains will return false
(no false negatives)
* if the key is in the set, contains may return true or false
(false positives are possible)
10 bits for 1% false positive
Multiple Hash Functions
number of elements in set: n
* filter size (in bits): m
* number of hash functions: k
* insert sets all k bits to 1, remove not possible
* contains returns true if all k bits are 1, false otherwise
Blocked Bloom Filter
Count-Distinct Estimation: Basic Idea
Count-Distinct Estimation: problem and solution
two problems:
* using only one maximum leads to estimates with high variance
* a single outlier that happens to have many leading zeros will lead to severe
overestimation
solution:
* use multiple sketches (maxima), partitioned by hash
Succinct data structure: space consumption
space bound L
2L is compact
L + √L is succinct
L + 3 is implicit
Level-Ordered Unary Degree Sequence (LOUDS)
Fast Succinct Trie (FST)
FST: LOUDS-Dense
four arrays:
* 256-bit sequence D-Labels: branch present?
* 256-bit sequence D-HasChild: child or leaf?
* 256-bit sequence D-IsPrefixKey: is the key a prefix?
* sequence for the values (D-Values)
FST: LOUDS-Sparse
four arrays:
* byte sequence S-Labels: records all the branching labels for each trie node
* bit sequence S-HasChild: inner or leaf node?
* bit sequence S-LOUDS: 1 bit for each label, indicates node boundaries
* sequence for the values (S-Values)
Learned Indexes
k: number of hash functions
n: number of elements in set
m: filter size in bits
blocked bloom filter
Learned indexes: issues