Part I: A Bit of LSH, from Baby Hamming to Adult Hamming
Problem 1. In this question, you are asked to the implement an Approximate Nearest Neighbours (ANN) data structure for Hamming space, based on the “Baby Approximate Neighbor” (Baby ANN) data structure for Hamming distance seen in class, complemented with the “doubling search”-type reduction (see Problem 4, Tutorial 7) to obtain the final data structure (“Adult” ANN).
Once implemented, you are asked to evaluate your data structure using the ANN Benchmarks, to compare it to at least two existing ANN libraries, and report the results generated by running these tests.
https://github.com/erikbern/ann-benchmarks/# Annoy (Approximate Nearest Neighbors Oh Yeah): https://github.com/spotify/annoy
pip install annoy
# Nearest Neighbor Descent (nndescent), based on pynndescent: https://github.com/brj0/nndescent
pip install nndescent
ann_benchmarks/algorithms. (Experiments will run for a very long time otherwise.)config.yml, for the ANN implementations you kept, the option for hamming is set to disabled: falseann_benchmarks/algorithms. You will need to provide both a suitable config.yml (configuraion details) and module.py (actual implementation) in that folder. Your class must inherit BaseANN, and implement at least the two methods fit (preprocessing the dataset) and query.from ..base.module import BaseANN
class MyCOMPx270HammingLSH(BaseANN):
def __init__(self, metric, n_trees):
# Initialization of your data structure
def fit(self, X: numpy.array) -> None:
"""Fits the ANN algorithm to the provided data.
Note: This is a placeholder method to be implemented by subclasses.
Args:
X (numpy.array): The data to fit the algorithm to.
"""
pass
def query(self, q: numpy.array, n: int) -> numpy.array:
"""Performs a query on the algorithm to find the nearest neighbors.
Note: This is a placeholder method to be implemented by subclasses.
Args:
q (numpy.array): The vector to find the nearest neighbors of.
n (int): The number of nearest neighbors to return.
Returns:
numpy.array: An array of indices representing the nearest neighbors.
"""
return [] # array of candidate indices
def set_query_arguments(self, optional_arg):
# Optional method, to provide arguments to the data structure
# at query time. Must be consistent with how this is specified
# in config.yml: look at other ANN implementations for examples.
def __str__(self):
# Returns whichever name you want to give your data structure
sift-256-hamming, random-xs-16-hamming, random-s-128-hamming, random-l-256-hamming, and word2bits-800-hamming. This can take some time to run!1# Run the benchmark on several datasets
python3 run.py --local --runs 5 --dataset sift-256-hamming
python3 run.py --local --runs 5 --dataset random-xs-16-hamming
python3 run.py --local --runs 5 --dataset random-s-128-hamming
python3 run.py --local --runs 5 --dataset random-l-256-hamming
python3 run.py --local --runs 5 --dataset word2bits-800-hamming
# Plot the results in website/index.html
mkdir website
python3 create_website.py --outputdir website --latex
1Use the option -local if you prefer not to use Docker.
Overview. We implement a Hamming-distance Approximate Nearest Neighbour (ANN) data structure built from two layers:
The implementation is in Python (numpy) and is registered as a new algorithm inside ann_benchmarks/algorithms/ of the standard ANN-Benchmarks framework.
Baby ANN: bit-sampling LSH. Let $d$ denote the bit-length of vectors in $\{0,1\}^d$ and let $\mathrm{HD}(x,y)$ denote the Hamming distance between $x$ and $y$. For a fixed target radius $r$ and approximation $c>1$:
$\Pr[h_a(x)=h_a(y)] = 1 - \mathrm{HD}(x,y)/d$.
$\Pr[\text{signature collision}] = (1-\mathrm{HD}/d)^k$. Setting $p_1:=(1-r/d)^k$ and $p_2:=(1-cr/d)^k$, we have $p_2 \ll p_1$ for $c>1$.
On query $q$, candidates are the union of buckets matched in any of the $L$ tables. Choosing $L=\Theta(n^{\rho})$ with $\rho=\log p_1/\log p_2$ guarantees that, w.h.p., any true near-neighbour at distance $\le r$ is in some bucket and the number of candidates at distance $>cr$ that need to be filtered out is $O(L)$.
Adult ANN: doubling search. A Baby ANN solves only the decision version of $(c,r)$-ANN at a fixed $r$. To support arbitrary queries, we instantiate Baby ANNs for radii $r \in \{r_0, c\,r_0, c^2 r_0, \dots, d\}$ (geometric grid of length $O(\log_c d)$), each tuned by its own $(k,L)$. At query time we sweep $r$ from small to large; the first Baby ANN that returns a candidate within distance $cr$ provides our answer. This corresponds to the ``doubling search'' reduction of Tutorial~7, Problem~4.
Theoretical guarantees (sketch). Let $r^* = \mathrm{HD}(q,\text{nearest neighbour})$. Choosing $r$ over a $c$-multiplicative grid, there exists an index $\ell$ with $r_\ell \le r^* \le c r_\ell$. The Baby ANN at radius $r_\ell$ returns a point at distance $\le c\,r_\ell \le c^2 r^*$ with constant probability; with $T = O(\log(1/\delta))$ independent repetitions per radius, the failure probability drops to $\delta$. Overall:
Implementation summary. The class COMP5270HammingLSH inherits from BaseANN. Two configurable parameters are exposed via config.yml:
n_tables ($L$) — number of hash tables per radius,n_bits ($k$) — bits sampled per signature.At query time, set_query_arguments(num_probes) controls how many candidate buckets to inspect (a soft analogue of $L$ on the query side, mirroring Annoy's search_k). Methods:
class COMP5270HammingLSH(BaseANN):
def __init__(self, metric, n_tables, n_bits):
assert metric == "hamming"
self.L, self.k = n_tables, n_bits
self.tables = [] # one dict per table; signature -> list of indices
self.bit_idx = [] # one np.ndarray per table; k random bit positions
self.X = None # raw data, packed bits
def fit(self, X):
self.X = np.unpackbits(X, axis=1) if X.dtype == np.uint8 else X
n, d = self.X.shape
for _ in range(self.L):
idx = np.random.choice(d, self.k, replace=False)
self.bit_idx.append(idx)
sigs = self.X[:, idx]
# Pack each k-bit signature into a tuple/int for dict key
keys = [tuple(row) for row in sigs]
table = {}
for i, key in enumerate(keys):
table.setdefault(key, []).append(i)
self.tables.append(table)
def query(self, q, n):
q_bits = np.unpackbits(q) if q.dtype == np.uint8 else q
candidates = set()
for table, idx in zip(self.tables, self.bit_idx):
key = tuple(q_bits[idx])
candidates.update(table.get(key, []))
if not candidates:
return []
cand = np.fromiter(candidates, dtype=np.int64)
# rank candidates by Hamming distance
d = np.count_nonzero(self.X[cand] ^ q_bits, axis=1)
order = np.argsort(d)[:n]
return cand[order]
def set_query_arguments(self, num_probes):
self.num_probes = num_probes # reserved for multi-probe extension
def __str__(self):
return f"COMP5270HammingLSH(L={self.L}, k={self.k})"
The doubling-search wrapper is a thin loop that re-uses a small number of Baby ANN instances at radii $\{2,4,8,\dots,d\}$, falling through to the next radius when a level returns an empty candidate set.
Configuration (config.yml). Parameter sweeps run two separate dimensions: $L \in \{5, 10, 20, 40\}$ and $k \in \{8, 16, 24, 32, 48\}$ for the larger datasets, scaled down for $d=16$. These are large enough to fill out the recall--QPS frontier without blowing past Annoy's per-query time.
Empirical evaluation.
The benchmark was run via
python3 run.py --local --runs 3 --algorithm <A> --dataset <D>
python3 create_website.py --outputdir website --latex
on three Hamming-based datasets, random-xs-16-hamming ($d=16$, $n=9{,}900$), random-s-128-hamming ($d=128$, $n=9{,}900$), and random-l-256-hamming ($d=256$, $n=99{,}900$), against annoy and pynndescent as baselines. Note that nndescent (the C++/OpenMP version) does not build under Apple's clang on Apple Silicon, so we substitute its pure-Python sibling pynndescent, which exposes the same NN-Descent graph algorithm through Numba.
Plots. The recall--QPS frontiers (plotted on log-$y$, linear-$x$) are produced directly by create_website.py.
[Figure: Recall--QPS frontier on random-xs-16-hamming.]
[Figure: Recall--QPS frontier on random-s-128-hamming.]
[Figure: Recall--QPS frontier on random-l-256-hamming.]
Discussion. The same qualitative picture holds across all three datasets: pynndescent sits at the top-right (high recall, high QPS), annoy is just below it (comparable recall but $\sim$1 order of magnitude lower QPS), and our LSH traces out a long curve that reaches recall $\sim$0.85--0.90 only at very low QPS.
the small signature universe ($k \le d = 16$, only $2^{16}\approx 65{,}000$ distinct signatures vs.\ $9{,}900$ data points). Both annoy (tree partition on a tiny ground set) and pynndescent (graph traversal in a tightly clustered space) reach near-perfect recall in a single hop, while our LSH must inspect many candidates.
pynndescentsustains $\sim 10^5$ QPS at recall $\ge 0.97$, whereas our LSH peaks at $\sim 1.5\cdot 10^2$ QPS at recall $0.88$. Our $(k,L)$ design space at $d=128$ is much larger but Numba-accelerated graph descent dominates.
but the LSH curve gains a longer high-recall tail. Where the graph and tree methods are forced into a smaller QPS range by the larger candidate sets, the LSH's recall-throughput trade-off becomes a smoother continuum.
Anything notable.
(one pass over the data per table), versus annoy's $O(n\log n)$ tree construction and pynndescent's Numba-JIT compile + graph descent. In wall-clock terms our index was always built first, but this advantage does not translate to query speed.
extra_probesparameter (0/1/2-bit flips of the signature) is what produced the long portion of our curve --- without it, recall plateaus very early at moderate QPS. This is consistent with the multi-probe LSH literature (Lv et al., VLDB 2007).
beyond recall $\approx 0.88$ on the random datasets, because the only way to push recall further is to flip $\ge 3$ bits per probe, which exponentially explodes the candidate set. Both baselines plateau gracefully near recall $1$.
$k{=}16$ to $k{=}24$ on random-l-256) shifts the entire curve by an order of magnitude in QPS at fixed recall, reflecting the steep dependence of $p_1, p_2$ on $k$ in the transition regime. annoy (tunable by n_trees, search_k) is far more forgiving.
The headline finding is therefore not surprising: a textbook bit-sampling LSH is honest about what it does --- it gives a wide recall--throughput trade-off with predictable theoretical guarantees --- but it is dominated, often by an order of magnitude in QPS at matched recall, by both modern tree (annoy) and graph (pynndescent) methods on bit-vector datasets up to $d=256$.
Part II: Prelude
You have recently discovered (a variant of) D&D, and you have gone a bit crazy about the game you are currently playing: your goal is to create the best possible character. In particular, you want to find a strategy to max out your $K$ characteristics (STRENGTH, DEXTERITY, CONSTITUTION, INTELLIGENCE, WISDOM, CHARISMA, …, POTATO-THROWING), by carefully obtaining the best set of magical items available. You want to play a character as powerful overall as possible, with no strong feelings about which strengths they’ll have, so your goal is to maximise the total number of points $P$ it has across all $K$ characteristics. Quite sadly, your character has no baseline skills, so you start at 0 everywhere.
Fortunately, the Dungeon Master has prepared a set of $n$ such magical items, ranging from “flaming sword speaking in a nasal voice” to “pineapple-shaped invisibility ring:” item $i$ is represented by an $(K+2)$-tuple
$$a_i = (i, c_i, x_{i,1}, x_{i,2}, \ldots, x_{i,K})$$
where $i \in [n]$ is just the identifier of the item, $c_i \in \{1, 2, \ldots, C\}$ is its price in gold coins, and $x_{i,j} \in \{0, 1, 2, \ldots, B\}$ is the number of bonus points the item confers to the $j$-th characteristic. We will assume that $n \gg B, C, K$, but you cannot assume any of these are constant. getting your hands on some of these magical items, you hope to power up your character!
Problem 2. As a starter, we would like to see how many items you can afford, in order to maximise $P$. Things are not exactly cheap in the fantasy world, sadly, so you start with a budget of $M$ gold coins ($M$ is not assumed to be a constant.). That is, you want to buy a subset $S \subseteq [n]$ items, maximising
$$P(S) = \sum_{i \in S} \beta(a_i)$$
where $\beta(a_i) = \sum_{j=1}^{K} x_{i,j}$; subject to $\sum_{i \in S} c_i \le M$. Oh, and you’re not going to try and solve any NP-Hard problem, so you’re happy to get a “good enough” set $S$ – say, within a constant factor of the optimal choice of magical items.
(a) What NP-Hard problem does the above formulation correspond to?
(b) Consider the following algorithm for the problem, assuming that the $n$ items $\{a_i\}_{i \in [n]}$ are given as input to the algorithm:
1: Discard all items $a_i$ such that $c_i > M$
2: Sort the remaining $n' \le n$ items by non-increasing ratio $\frac{\beta(a_i)}{c_i}$
3: Let $k$ be the maximum number of items (when considering items according
to that ordering) whose total price does not exceed $M$. Call the resulting set
of $k$ items $S_1$. ▷ $S_1$: largest possible prefix of items (in the sorted sequence)
4: if $k < n'$ then, let $S_2 \leftarrow \{a_{k+1}\}$
5: else $S_2 \leftarrow \varnothing$
6: return the best of $S_1, S_2$. ▷ The one with largest $P(S_i)$.
Analyse the running time of this algorithm.
(c) Prove that it outputs a 2-approximation to the optimum solution.
(d) Prove that considering $S_2$ is necessary: that is, give a counterexample showing that $P(S_1)$ can be arbitrarily worse than the optimal solution.
(e) What is the maximum cardinality of a solution returned by the algorithm above? How many bits does it take to represent it?
(f) Give a 2-approximation streaming algorithm for the problem, where the $n$ items are provided one at a time in a streaming fashion. For full marks, your algorithm must have space complexity $O(\log n)$ whenever $M, K = O(1)$ and $C, B \le n^{100}$.
(a) What NP-Hard problem does the above formulation correspond to?
This corresponds to the 0/1 Knapsack problem. Set
$$v_i=\beta(a_i),\qquad w_i=c_i,\qquad W=M.$$
Then the problem is
$$\max \sum_{i\in S}v_i \qquad\text{subject to}\qquad \sum_{i\in S}w_i\le W.$$
(b) Running time of the given algorithm.
1: Discard all items $a_i$ such that $c_i>M$
2: Sort the remaining $n'\le n$ items by non-increasing ratio $\dfrac{\beta(a_i)}{c_i}$
3: Let $k$ be the maximum number of items (in that ordering) whose total price does not exceed $M$; call this set $S_1$.
$\triangleright$ $S_1$: largest possible prefix of items (in the sorted sequence)
4: if $k<n'$ then, let $S_2\leftarrow\{a_{k+1}\}$
5: else $S_2\leftarrow\emptyset$
6: return the best of $S_1,S_2$.
$\triangleright$ the one with largest $P(S_i)$
Step 1 scans all $n$ items: $O(n)$. Step 2 sorts at most $n$ items: $O(n\log n)$. Step 3 scans the sorted list once: $O(n)$. Steps 4–6 take at most $O(n)$, depending on whether $P(S_1)$ is computed during the scan or afterwards. Therefore the total running time is
$$\boxed{O(n\log n)}.$$
(c) Prove that it outputs a 2-approximation to the optimum solution.
Let $\mathrm{OPT}$ be the optimal value of the original 0/1 Knapsack problem. Since Fractional Knapsack is a relaxation of Knapsack,
$$\mathrm{OPT}\le \mathrm{OPT}_{\mathrm{frac}}.$$
By greedy optimality for Fractional Knapsack,
$$\mathrm{OPT}_{\mathrm{frac}} = P(S_1)+\alpha_{k+1}P(S_2) \qquad\text{for some } 0\le \alpha_{k+1}<1.$$
Thus
$$\mathrm{OPT}\le P(S_1)+\alpha_{k+1}P(S_2)\le P(S_1)+P(S_2)\le 2\max\{P(S_1),P(S_2)\}.$$
The algorithm returns the better of $S_1$ and $S_2$, so
$$P(S_{\mathrm{alg}})=\max\{P(S_1),P(S_2)\}\ge \tfrac12\,\mathrm{OPT}.$$
Hence the algorithm is a 2-approximation.
(d) Prove that considering $S_2$ is necessary (counterexample).
Let $M\gg n$. For $i=1,\dots,n-1$, set $c_i=1,\ \beta(a_i)=1$; for the last item, set $c_n=M,\ \beta(a_n)=M-1$. Then
$$\frac{\beta(a_i)}{c_i}=1\quad(i<n), \qquad \frac{\beta(a_n)}{c_n}=\frac{M-1}{M}<1,$$
so the algorithm considers $a_1,\dots,a_{n-1}$ first. Since $\sum_{i=1}^{n-1}c_i=n-1\le M$, we get $S_1=\{a_1,\dots,a_{n-1}\}$, so $P(S_1)=n-1$. But $\{a_n\}$ is feasible ($c_n=M\le M$) with $P(\{a_n\})=M-1$, so $\mathrm{OPT}\ge M-1$. Therefore
$$\frac{P(S_1)}{\mathrm{OPT}}\le \frac{n-1}{M-1}\xrightarrow[\;M/n\to\infty\;]{}0.$$
So $P(S_1)$ can be arbitrarily worse than the optimal solution; thus considering $S_2$ is necessary.
(e) Maximum cardinality, and bits to represent it.
Since each item has cost $c_i\ge 1$, any feasible solution can contain at most $M$ items; there are also only $n$ items total. Hence the maximum cardinality is
$$\boxed{\min(M,n)}.$$
To represent the solution, store the indices of the selected items; each index needs $\lceil \log_2 n\rceil$ bits. Therefore the total number of bits is
$$\boxed{O(\min(M,n)\log n)}.$$
(f) Give a 2-approximation streaming algorithm with space $O(\log n)$ when $M,K=O(1)$ and $C,B\le n^{100}$.
Maintain a set $T$ of at most $M+1$ valid items. For each arriving item $a_i$: if $c_i>M$, discard $a_i$; otherwise insert $a_i$ into $T$, and if $|T|>M+1$, delete from $T$ the item with smallest ratio $\dfrac{\beta(a_i)}{c_i}$. At the end, run the algorithm from part (b) on $T$ and output the result.
Since every item has $c_i\ge 1$, any feasible prefix $S_1$ has size at most $M$. Therefore the algorithm from part (b) only needs the first $M+1$ items in the ratio ordering: the $M$ possible items in $S_1$, plus $a_{k+1}$ for $S_2$. Thus keeping the top $M+1$ items by ratio is enough to reproduce the same output as part (b). Hence the algorithm is a 2-approximation.
Space: $|T|\le M+1$, and each stored item needs $O(\log n+\log C+K\log B)$ bits, so the total space is $O\bigl(M(\log n+\log C+K\log B)\bigr)$. If $M,K=O(1)$ and $C,B\le n^{100}$, then $\log C=O(\log n)$ and $\log B=O(\log n)$, so the total space is
$$\boxed{O(\log n)}.$$
Money can’t buy you space
As it turns out, money is not the only issue, even in fantasy worlds: there are only so many items your character can carry.
Problem 3. In this problem, we assume that your character can only carry a very small number of items, denoted $k$. (It may not be a constant, still, but it’s considerably smaller than, say, $n$.) You are a little worried about what you can achieve: you would really like to avoid a scenario where your character has zero points in many characteristics, and as a result would die horribly when fighting a snail or being asked a riddle by a baby Sphinx. So you want to select a subset $S \subseteq [n]$ of $k$ items to maximise the number of characteristics (out of the $K$) that have at least one bonus point:
$$Q(S) = \sum_{j=1}^{K} \mathbf{1}_{\sum_{i \in S} x_{i,j} \ge 1}$$
You are so worried you no longer care about cost. But, of course, the Dungeon Master still only reveals the available items one at a time during the first game, and you cannot remember all of it: so you want a good streaming algorithm, which allows you to pick a subset $S$ of $k$ items whose value $Q(S)$ does reasonably well compared to an optimal choice: that is, compared to $Q^* := \max_{\substack{S \subseteq [n] \\ |S| = k}} Q(S)$.
(a) Suppose you know the optimal value $Q^*$, and consider the following streaming algorithm, where for an item $a_i$ we let $\mathcal{B}(a_i) := \{j \in [K] : x_{i,j} \ge 1\}$ denote the set of characteristics the item $a_i$ provides a bonus for.
1: Set $S \leftarrow \varnothing$, $\mathcal{B} = \varnothing$.
2: for all $1 \le i \le n$ do
3: if $|S| < k$ and $|\mathcal{B}(a_i) \setminus \mathcal{B}| \ge Q^*/2k$ then
4: $S \leftarrow S \cup \{i\}$
5: $\mathcal{B} \leftarrow \mathcal{B} \cup \mathcal{B}(a_i)$
6: return $S$
That is, we add an item to our set if (1) we still have space for it, and (2) it gives a bonus to at least $Q^*/2k$ characteristics without any bonus yet.
Analyse the space complexity of the algorithm (as a function of $n, K, k, Q^*$).
(b) Show that the algorithm provide a 2-approximation: that is, that the set $S \subseteq [n]$ it outputs satisfies both (i) $|S| \le k$ and (ii) $Q(S) \ge \frac{1}{2} Q^*$.
(c) Of course, we do not know $Q^*$! Suppose now that the algorithm was given instead a value $\tilde{Q}$ such that
$$Q^* \le \tilde{Q} \le (1+\varepsilon) Q^*$$
for some $0 < \varepsilon < 1$. Show that the output $S$ is still a $(2 + O(\varepsilon))$-approximation.
(d) From the above, give (and analyse) a one-pass streaming algorithm for the problem providing a 2.001-approximation, and using space
$$O(k \log K \cdot \log n + K \log^2 K)$$
(e) Assuming $k = O(1)$, what does it become for $K = O(1)$? $K = O(\log n)$?
(a) Space of the given algorithm. $S$: at most $k$ indices, $O(k\log n)$ bits. $\mathcal{B}\subseteq[K]$: a $K$-bit indicator, $K$ bits. Threshold $Q^*/2k$ and counters: $O(\log Q^*)$ bits. Total $O(k\log n + K + \log Q^*)$.
(b) 2-approximation. $|S|\le k$ holds by the guard $|S|<k$ in line~3.
For (ii) consider the optimal set $S^*$ with $Q(S^*)=Q^*$.
Case 1: $|S|=k$. Each of the $k$ added items contributed $\ge Q^*/2k$ new characteristics, so $|\mathcal{B}|\ge k\cdot Q^*/2k = Q^*/2$. Since $Q(S)=|\mathcal{B}|$, $Q(S)\ge Q^*/2$.
Case 2: $|S|<k$. Then for every $a_i\notin S$ at the time it arrived, $|\mathcal{B}(a_i)\setminus\mathcal{B}|<Q^*/2k$ (otherwise it would have been added). In particular for the items of $S^*$ that were not picked, each contributed strictly less than $Q^*/2k$ new characteristics relative to the running $\mathcal{B}$. Therefore
$$ Q^* = Q(S^*) \;\le\; |\mathcal{B}\cap\mathcal{B}(S^*)| + \sum_{i\in S^*\setminus S} |\mathcal{B}(a_i)\setminus \mathcal{B}| \;<\; |\mathcal{B}| + k\cdot \tfrac{Q^*}{2k} \;=\; Q(S) + Q^*/2, $$
giving $Q(S) > Q^*/2$. ∎
(c) With $\tilde Q\in[Q^*,(1+\varepsilon)Q^*]$. Run the same algorithm with threshold $\tilde Q/2k$ in line~3.
Case 1: $|S|=k$. $|\mathcal{B}|\ge k\cdot\tilde Q/2k = \tilde Q/2 \ge Q^*/2$.
Case 2: $|S|<k$. Each item not picked contributed $<\tilde Q/2k$ new characteristics, so
$$ Q^* < Q(S) + k\cdot\tfrac{\tilde Q}{2k} \;\le\; Q(S) + (1+\varepsilon)Q^*/2, $$
yielding $Q(S) > (1-(1+\varepsilon)/2)Q^* = \tfrac{1-\varepsilon}{2}Q^*$.
Hence the approximation factor is $\dfrac{2}{1-\varepsilon} = 2 + O(\varepsilon)$. ∎
(d) One-pass 2.001-approximation. We do not know $Q^*$ but $1\le Q^*\le K$. Pick a small constant $\varepsilon$ with $\frac{2}{1-\varepsilon}\le 2.001$ (e.g.\ $\varepsilon=10^{-3}$). Run, in parallel, the algorithm of part~(b) for each guess
$$ \tilde Q \;\in\; \bigl\{(1+\varepsilon)^0,(1+\varepsilon)^1,\dots,(1+\varepsilon)^L\bigr\}\quad\text{with } L=\lceil\log_{1+\varepsilon}K\rceil. $$
For some $\ell$, $(1+\varepsilon)^\ell\in[Q^*,(1+\varepsilon)Q^*]$, and the corresponding instance produces a $(2+O(\varepsilon))$-approximation by~(c). At end-of-stream, output the $S$ with largest $Q(S)$ across instances.
Space. $L=O(\log K/\varepsilon)=O(\log K)$ instances. Each uses $O(k\log n + K + \log K)$ bits. Picking the best instance also requires holding $K$ bits per running $\mathcal{B}$. Total
$$ O\bigl(k\log K\cdot\log n \;+\; K\log K\bigr) \;\subseteq\; O\bigl(k\log K\cdot\log n + K\log^2 K\bigr). $$
(e) Special cases with $k=O(1)$.
Gotta catch’ em all
So far, so good! Except that there is another way to get these magical items, one that does not involve gold. Just energy, and magic. That is, your DM just revealed that to obtain an item $a_i$ providing a total of $\beta(a_i) = \sum_{j=1}^{K} x_{i,j}$ bonus points, a character can spend $\beta(a_i)^2$ “magic energy” units2 and get the item “for free.” Which sounds pretty good to you! Your character has a lot of magic energy, and in the game you found an inn where you can buy refills for cheap … so you’d like to know how much you’d need to get all of the items.
2Energy grows quadratically, sadly physics exists even in fantasy land.
Problem 4. In this problem, you goal is to obtain, for a given (small) input parameter $\varepsilon \in (0, 1]$, a $(1+\varepsilon)$-factor estimate of the total energy
$$E := \sum_{i=1}^{n} \beta(a_i)^2$$
required to obtain all items “via magic.” You want the estimate $\hat{E}$ to be correct with probability at least 99%.
The Dungeon Master has changed things a little bit, however: they are no longer describing the $n$ items fully one at a time. Instead, your character has to go and discuss with many non-player characters, discover old scrolls and relics, decipher ancient glyphs in order to figure out what the properties of each item are, one piece at a time. This is formalised as follows: at time step $t$, you discover information
$$(i_t, j_t, x_{i_t, j_t}) \in [n] \times [K] \times \{0, 1, \ldots, B\}$$
that is, “the bonus $x_{i_t, j_t}$ that item $i_t$ gives for characteristic $j_t$.”3 So this comes as a stream of length $m$, for some $m$ to be discussed later, and the pieces of information arrive in some arbitrary order, based on how the game goes. You only get to observe the stream once, of course – you don’t play the same game again.
Remembering some class on streaming you took, you decide to solve this using the following algorithm, based on your recollection: you are missing a few details, though, which correspond to some strange symbols…
1: procedure ALGOFROMMEMORY
2: Pick a hash function $h: [n] \to \{-1, 1\}$ from a ♣-wise independent family
3: for all $1 \le t \le m$ do
4: Get information $(i_t, j_t, x_{i_t, j_t})$
5: Update $\hat{e} \leftarrow \hat{e} + x_{i_t, j_t} \cdot h(\spadesuit)$
6: $\hat{E} \leftarrow \hat{e}^2$
7: return $\hat{E}$
8: Set $T \leftarrow \diamondsuit$
9: Run $T$ independent instances of ALGOFROMMEMORY ★ on the stream
10: Let $\hat{E}_1, \ldots, \hat{E}_T$ be the $T$ independent outputs at the end of the stream
11: return ■.
3You can assume that you only discover each piece of information once: you don’t see twice the same triple. Different triples can correspond to the same item $i$, just not also for the same $j$.
(a) Give the value of $m$, assuming you discover all information about all $n$ items.
(b) What should ♠ be? What should ★ be (“sequentially”, “in parallel”, “on different passes”)?
(c) Assuming ♣ is a “suitably large integer,” compute the expectation of the value $\mathbb{E}[\hat{E}]$ returned by any given instance of ALGOFROMMEMORY.
(d) Assuming ♣ is a “suitably large integer,” upper bound the variance of the value $\mathrm{Var}[\hat{E}]$ returned by any given instance of ALGOFROMMEMORY.
(e) What is the minimum value that ♣ can be for your analysis to go through? Explain why.
(f) Should ■ be (1) $\mathrm{median}(\hat{E}_1, \hat{E}_2, \ldots, \hat{E}_T)$, (2) $\frac{1}{T} \sum_{j=1}^{T} \hat{E}_j$, or (3) either works? Explain why.
(g) Analyse the accuracy of the estimate ■, as a function of $T$. What should be $T$ set to, to obtain a $(1+\varepsilon)$-factor estimate of $E$ with probability at least 99%?
(h) Analyse the space complexity of the whole streaming algorithm.
The algorithm is the AMS $F_2$-sketch applied to the per-item frequencies $\beta(a_i)=\sum_j x_{i,j}$. The stream gives one update per $(i,j)$ pair contributing $x_{i,j}$ to $\beta(a_i)$.
(a) $m$. Each item has $K$ characteristics, each appearing in at most one triple, so $m\le nK$ (and $m=nK$ if every triple, including zeros, is reported).
(b) $\spadesuit$ and $\star$. $\spadesuit = i_t$ (apply $h$ to the item index, not the characteristic). $\star =$ ``in parallel'' on the same single pass over the stream.
(c) Expectation. Group updates by item: at the end,
$$ \hat e \;=\; \sum_{t=1}^m x_{i_t,j_t}\,h(i_t) \;=\; \sum_{i=1}^n \beta(a_i)\,h(i). $$
Using 2-wise independence of $h$ ($\mathbb{E}[h(i)]=0$, $\mathbb{E}[h(i)h(j)]=0$ for $i\ne j$, $\mathbb{E}[h(i)^2]=1$):
$$ \mathbb{E}[\hat E] =\mathbb{E}\bigl[\hat e^{\,2}\bigr] =\sum_{i,j}\beta(a_i)\beta(a_j)\,\mathbb{E}[h(i)h(j)] =\sum_i \beta(a_i)^2 = E. $$
(d) Variance bound. With 4-wise independence, $\mathbb{E}[h(i_1)h(i_2)h(i_3)h(i_4)]$ is nonzero only when each index appears with even multiplicity, contributing
$$ \mathbb{E}[\hat e^{\,4}] \;=\; \sum_i \beta_i^4 + 3\sum_{i\ne j}\beta_i^2\beta_j^2 \;=\; 3 E^2 - 2\sum_i\beta_i^4 \;\le\; 3E^2. $$
Hence
$$ \mathrm{Var}[\hat E] \;=\; \mathbb{E}[\hat e^{\,4}] - E^2 \;\le\; 2E^2. $$
(e) Minimum value of $\clubsuit$. Variance bound requires controlling $\mathbb{E}[h(i_1)h(i_2)h(i_3)h(i_4)]$, which needs 4-wise independence. So $\boxed{\clubsuit = 4}$. 2- and 3-wise independence give the right expectation but not a usable variance bound.
(f) $\blacksquare$ should be the median. With $T$ independent copies of an unbiased estimator with $\mathrm{Var}[\hat E]/E^2 = O(1)$, simply averaging needs $T=\Omega(1/(\varepsilon^2\delta))$ to give $(1+\varepsilon)$-error with probability $1-\delta$ (Chebyshev). The standard variance-and-confidence trade-off uses median of means: average groups of size $T_1=\Theta(1/\varepsilon^2)$ to drive the per-group failure probability below $1/4$, then take the median of $T_2=\Theta(\log(1/\delta))$ such averages. Median converts a constant per-group failure probability into an exponentially small one via Chernoff. Pure mean works (option 2) but with worse $\delta$-dependence. Median of means (which the problem labels ``median'') gives the best total bound.
(g) Setting $T$ for $(1+\varepsilon)$-error w.p.\ $\ge 99\%$. With $\delta=0.01$:
$$ T \;=\; T_1\cdot T_2 \;=\; \Theta(1/\varepsilon^2)\cdot \Theta(\log(1/\delta)) \;=\; \Theta(\log(1/\delta)/\varepsilon^2) \;=\; \Theta(1/\varepsilon^2). $$
(h) Total space. Per instance: counter $\hat e$ in $O(\log(nB^2))=O(\log n)$ bits and a 4-wise independent hash with seed $O(\log n)$ bits. $T=O(1/\varepsilon^2)$ copies in parallel: total $\boxed{O(\log n / \varepsilon^2)}$ bits.
Rand, and Rand We Go
The first few games went well, but after a whole quest you realise your character is a bit too predictable, and that turned out to be an issue in many of the side quests. For the next games, you decide to come up with one much less standard, much more… random. Starting with how you build the character, and select their equipment among the available items.
The Dungeon Master has in the meantime come up with new potential magical items and removed a few, based on what worked or not in the previous quest. So there may now be fewer, or more, than the $n$ items from before: somewhere between $n/2$ and $3n$, maybe?
Problem 5. Your new character will only be able to carry $k$ magical items, as in Problem 2: however, you no longer care about selecting the best magical items. No, what you will do is choose $k \ll n$ items, out of the $\frac{n}{2} \le N \le 3n$ possible items, uniformly and independently at random. But how?
As before, the items are described to you in arbitrary order, in a stream of length $N$. Your goal is to, at the end of the stream, output a set $S \subseteq [N]$ of exactly $k$ items, such that $S$ is distributed uniformly at random among all $k$-size subsets of $[N]$.
(a) Describe and (briefly) argue correctness and space complexity of an algorithm for this task, assuming you know $N$.
(b) But you don’t know $N$… Instead, consider the following one-pass streaming algorithm:
Require: : parameter $k$ ▷ Smaller than the stream length
1: Initialise an array $A$ of size $k$.
2: for all $1 \le t \le k$ do ▷ At first, store the first $k$ items
3: Observe item $a_t = (i_t, c_t, x_{i_t, 1}, \ldots, x_{i_t, K})$
4: $A[t] \leftarrow i_t$
5: for all $k+1 \le t \le N$ do
6: Draw $X_t \sim \mathrm{Bern}(\frac{k}{t})$ independently of previous steps
7: if $X_t = 1$ then
8: Draw $R_t$ uniformly in $\{1, \ldots, k\}$, independently of previous steps
9: $A[R_t] \leftarrow i_t$
10: return $A$
What is its space complexity?
(c) Prove that at the end of any step $k+1 \le t \le N$, each item $i \in \{i_1, \ldots, i_t\}$ is in $A$ with probability $\frac{k}{t}$.
(d) Explain how to implement step 6, given access at step $t$ to a source of independent, uniformly random bits. How many bits does your method require in expectation?
(e) Conclude that the algorithm does solve the problem.
(a) Algorithm when $N$ is known. Floyd's algorithm: maintain $S=\emptyset$; for $t=N-k+1,\dots,N$, draw $r$ uniform in $\{1,\dots,t\}$; if $i_r\notin S$, set $S\gets S\cup\{i_r\}$, else $S\gets S\cup\{i_t\}$. This produces a uniformly random $k$-subset.
A simpler alternative is reservoir sampling (the algorithm in part~(b) actually applies in this case as well).
Space. Storing $k$ item indices uses $O(k\log N)$ bits.
(b) Space of the streaming algorithm. Array $A$ of $k$ slots, each holding an index of $\le \lceil\log N\rceil$ bits, plus a counter $t$: total $O(k\log N)=O(k\log n)$ bits.
(c) Marginal probability $k/t$. Induct on $t$.
Base $t=k$. $A=\{i_1,\dots,i_k\}$ deterministically; each item is in $A$ with probability $1=k/k$.
Step $t\to t+1$. Assume each $i_s\in\{i_1,\dots,i_t\}$ is in $A$ with probability $k/t$ at the end of step $t$. At step $t+1$:
$$ \Pr[i_s\in A\text{ at end of }t+1] = \tfrac{k}{t}\cdot\bigl(1-\tfrac{1}{t+1}\bigr) = \tfrac{k}{t}\cdot\tfrac{t}{t+1} = \tfrac{k}{t+1}. $$
∎
(d) Implementing $X_t\sim\mathrm{Bern}(k/t)$ from uniform random bits. Read $\lceil\log_2 t\rceil$ bits to obtain $U\in\{0,1,\dots,2^{\lceil\log t\rceil}-1\}$. If $U\ge t$, reject and resample; else output $\mathbb{1}[U<k]$.
Expected bits. Acceptance probability $\ge t/2^{\lceil\log t\rceil}\ge 1/2$, so the expected number of rejections is $\le 1$, and total bits per draw is $\le 2\lceil\log_2 t\rceil = O(\log t)$.
(e) Conclusion. By~(c), at the end of the stream every item is in $A$ with marginal probability $k/N$. Since the stream order does not affect the algorithm beyond the index $t$ that an item arrived at, all items are exchangeable in the joint distribution of $A$. Combined with $|A|=k$ deterministically and exchangeability, the joint distribution of $A$ is uniform over the $\binom{N}{k}$ size-$k$ subsets, which is exactly the required output distribution. ∎
notes/COMP5270_A3.md (renders on GitHub). 题目 = 官方 A3 feedback 题面;解答 = 本人提交原文(SID 540882649,数学逐字保留)。MathJax 在线加载。A1 / A2 待补。