Rejection sampling, order statistics, and Dungeons and Dragons

$\newcommand{\E}{\operatorname{E}}\newcommand{\Xmin}{X_{(1)}}\newcommand{\Fmin}{F_{(1)}}\newcommand{\fmin}{f_{(1)}}\newcommand{\eff}{\operatorname{eff}}$In this post I’m going to look at two probability questions that arose in the context of a friendly game of Dungeons and Dragons.

(1): Supposing I only have 6 sided dice, how can I replicate rolls of different sided dice and what’s the efficiency of this procedure? For example, if I only have a 6-sided die, how can I mimic a single roll of a 4-sided or 20-sided die?

(2): One operation that comes up during character creation is to roll four 6-sided dice, discard the smallest value, and sum the remaining values. What’s the expected value of this process?

Question (1) will lead to a discussion of how to uniformly sample from a set given uniform samples from another set. Question (2) will involve computing the expected value of the minimum of a discrete uniform sample.

Question 1

I’ll suppose that I have a $p$-sided die and I want to mimic a $q$-sided die for $p,q\in\mathbb N$. I’ll use the notation $J_k = \{1,\dots,k\}$ so I can reframe this problem as asking how I can use $X\sim\text{Unif}(J_p)$ to produce samples that are $\text{Unif}(J_q)$.

The main theorem that I’ll be using here is as follows:

Result 1: Rejection sampling.

Let $S$ be a Borel set and let $\nu$ be the counting measure if $S$ is finite and the Lebesgue measure otherwise. Assume $\nu(S) < \infty$.

For a random variable $X$ supported on $S$, I’ll define $X$ to be uniformly distributed over $S$ iff for any Borel $A$
$$
P(X\in A) = \frac{\nu(A\cap S)}{\nu(S)}
$$
(also note that $S$ can’t be countably infinite as $\sigma$-additivity implies there is no uniform distribution over such a set). My use of $\nu$ here unifies the discrete and continuous uniform distributions.

Claim: if $X\sim\text{Unif}(S)$ and $B$ is a Borel set with $P(X\in B) > 0$ then
$$
X\mid X\in B \sim \text{Unif}(B \cap S).
$$
Pf: let $A$ be Borel. Then
$$
\begin{aligned}
& P(X\in A \mid X\in B) = \frac{P(X \in A\cap B)}{P(X \in B)} \\
&= \frac{\nu(A\cap B\cap S)/ \nu(S)}{\nu(B\cap S) / \nu(S)} = \frac{\nu(A\cap B\cap S)}{\nu(B\cap S)}.
\end{aligned}
$$
$B$ and $S$ are fixed so by definition $X\mid X\in B\sim\text{Unif}(B\cap S)$.

$\square$

With this result in hand I can immediately produce samples that are $\text{Unif}(J_q)$ when $q\leq p$ by just taking $S = J_p$ and $B = J_q$. For example, if $p=6$ and $q=4$ I can sample uniformly over $J_4$ by sampling uniformly over $J_6$ and rejecting realizations that don’t land in $J_4$. This is how I could replicate a 4-sided die using a 6-sided die.

Just to manually confirm the computations here: supposing $X\sim\text{Unif}(J_6)$, then for $j \in J_4$
$$
P(X = j \mid X\in J_4) = \frac{P(X = j)}{P(X \in J_4)} = \frac{1/6}{4/6} = \frac 14
$$
so $X\mid X\in J_4$ is uniform over $J_4$.

I’l define the efficiency of this procedure as the probability of accepting a realization, so
$$
P(X\in J_q) = \frac{|J_q|}{|J_p|} = \frac qp.
$$
In my example above my efficiency is $\frac qp = \frac 23$.

But if $q > p$ I’ll need to change my strategy so that I’m initially sampling uniformly over some larger envelope $S \supseteq J_q$ and then I can reject down as before. I can do this by rolling my die multiple times to sample uniformly over the Cartesian product of $J_p$ with itself. For example, if $p=6$ but $q=20$, so I’m trying to mimic a 20-sided die, I can roll my die twice to sample uniformly over $J_6^2$ (which has 36 elements) and then I can embed $J_{20}$ in $J_6^2$ after which I can use rejection sampling. This embedding won’t be unique but both sets are well-ordered so I can just enumerate them and pair elements up until I exhaust $J_q$.

This procedure works but it can be inefficient. For this analysis I’m imagining $p$ being fixed, since the motivation is that I have a $p$-sided die available, and I’ll increase $q$ and see what happens. I sample from the smallest $J_p^r$ big enough to have $J_q$ embedded in it, so as a function of $q$ the efficiency is
$$
\eff(q) = \frac{q}{\min \{p^r : r\in\mathbb N \text{ and } p^r \geq q\}}.
$$
The denominator is basically doing a discrete log, since I’m looking for the smallest integer $r$ such that $p^r \geq q \iff r \geq \log_p q$ therefore I can write this as
$$
\eff(q) = \frac{q}{p^{\lceil\log_pq\rceil}}.
$$
On every interval like $(p^{r-1}, p^{r}]$, $\eff$ increases linearly with a slope of $p^{-r}$ and attains a maximum of $1$ at $q=p^{r}$, after which the efficiency drops to $\frac{p^r + 1}{p^{r+1}} =p^{-1} + p^{-r-1}$ when the power of the denominator increases by one, corresponding to needing to increase the dimension of the sample space by $1$. Thus $\eff$ has a sawtooth-like shape with the slopes of the teeth decreasing exponentially so the number of steps it takes to reach the next $q$ with maximal efficiency also increases exponentially (because these are exactly when $q = p^r$ for $r\in\mathbb N$). This is shown in the figure below.

eff

Since $\eff(q) =1$ infinitely often I have $\limsup_q \eff(q) = 1$, and the efficiency drops to a local minima of $p^{-1} + p^{-r-1}$ on the step after every local maximum so this shows $\liminf_q \eff(q) = p^{-1}$. This means that the smaller $p$ is, the better the worst-case efficiency is, which makes sense because if $p$ is small then a particular $q$ of interest is likely to be closer to the next larger power of $p$ which is when I have good efficiency (i.e. the more often $p^r – q$ is small and positive, the better things will be for me).

This analysis doesn’t include the fact that if I can only produce a single $X\sim\text{Unif}(J_p)$ at a time, then it takes $r$ draws to produce a single sample in $J_p^r$, so even if the probability of acceptance is bounded from below by $p^{-1}$ it’ll still take me longer and longer since it’ll take more and more rolls to produce a single initial sample from $\text{Unif}(J_p^r)$. In particular, let $N$ be the number of tuples I produce until one is accepted. The probability of acceptance is $\eff(q)$ so I’ll have
$$
N \sim \text{Geom}(\eff(q))
$$
i.e. $N$ follows a geometric distribution with success probability $\eff(q)$ (I’m using a geometric distribution with a support of $\{1,2,\dots\}$). Each tuple requires $\lceil \log_p q\rceil$ rolls so I’ll actually be rolling $N \cdot \lceil \log_p q\rceil$ dice so the expected number of $\text{Unif}(J_p)$ rolls I have to make is
$$
\E\left[N \cdot \lceil \log_p q\rceil\right] = \frac{\lceil \log_p q\rceil}{\eff(q)} = \frac{\lceil \log_p q\rceil p^{\lceil \log_p q\rceil}}{q} \geq \lceil \log_p q\rceil \underset{q\to\infty}{\to} \infty.
$$
When the efficiency is maximized I’ll need exactly $\log_p q$ rolls since I’m guaranteed to accept, but in the worst case of $\eff(q) \approx p^{-1}$ I’ll need around $p \lceil \log_p q\rceil$ rolls. In either case this quantity is $\Omega(\log q)$ where I’m using big-$\Omega$ notation.

To conclude this section, the probability of accepting a particular sample from $J_p^r$ is bounded from below by $p^{-1}$ but because I have to roll more and more dice to produce each $J_p^r$ sample, the total number of die rolls I need to make grows logarithmically as $q$ increases even though the probability of accepting doesn’t decrease below a certain point.

There’s one quick extension of this that makes it a little more efficient. Let $d$ be a divisor of $p$ so $p = dm$ for some $m\in\mathbb N$. I can group the elements of $J_p$ into $d$ groups of size $m$, which I’ll label $1,2,\dots,d$, and I then return the group label of whatever group $X\sim\text{Unif}(J_p)$ lands in. This lets me directly sample from $J_d$ for any $d\mid p$. So now instead of needing to only consider sampling from envelopes of size $p^r$, I can now sample from envelopes with a size given by any finite product of factors of $p$. If $p$ is prime this doesn’t help, but the more highly divisible $p$ is, the closer I’ll be able to get to being just barely above $q$. In my earlier example with $p=6$ and $q=20$, the best I could do was to embed $J_q$ in $J_6^2$ so my efficiency is $\frac{20}{36} \approx 56\%$. But using this strategy I can instead use $J_2\times J_2 \times J_6$ as my envelope, and this only has $24$ elements so now my efficiency is up to $\frac{20}{24}\approx 83\%$. The downside to this is that the actual implementation will be more complicated.

 

Question 2

I’ll now move to the second topic of this post. As a reminder, I want to compute the expected value of the result of rolling several independent dice, removing the lowest value, and summing the rest. In general this is a problem of the form
$$
\E \sum_{i=k}^n X_{(i)}
$$
where $X_{(i)}$ denotes the $i$th order statistic of the sample $(X_1,\dots,X_n)$. In the special case where $k=2$, which is where I’ll be focusing my attention, I can simplify this problem by noting that
$$
S := \sum_{i=2}^n X_{(i)} = \left(\sum_{i=1}^n X_i\right) – X_{(1)}
$$
so by the linearity of expectation
$$
\E S = n \E X_1 – \E X_{(1)}.
$$
This makes life easier because $\E X_1$ will be easy (it’s just the expected value of a discrete uniform distribution) and then I just have to deal with $\Xmin$ rather than arbitrary order statistics.

In general for the minima of iid samples I have
$$
\begin{aligned}
&\Fmin(x) = P(X_{(1)} \leq x) \\
&= 1 – P(\Xmin > x) \\
&= 1 – P(X_1 > x \cap X_2 > x \cap \dots \cap X_n > x) \\
&= 1 – P(X_1 > x)^n \\
&= 1 – [1 – F(x)]^n.
\end{aligned}
$$

I’ll now prove the following result which I’ll use to compute $\E \Xmin$. This result is great because it spares me from needing to compute $\fmin$ and I can just use $\Fmin$, which I already have.

Result 2: for a non-negative discrete RV $X$,
$$
\E X = \sum_{x=0}^\infty P(X > x).
$$
Pf: $X$ is discrete so the existence of the Radon-Nikodym derivative $f := \frac{\text dP_X}{\text dc}$ is guaranteed (since $c \gg P_X$ and $c$ is $\sigma$-finite on a countable space), where $c$ denotes the counting measure and $P_X$ is the distribution of $X$. This lets me write
$$
\sum_{x=0}^\infty P(X > x)= \sum_{x=0}^\infty \int_{(x, \infty)} f\,\text dc = \sum_{x=0}^\infty \sum_{t=x+1}^\infty f(t).
$$
$f \geq 0$ so by Tonelli’s theorem I can switch the sums. I’m summing over a triangle in the first quadrant so when I switch the sums the limits have $t = 1,2,3,\dots$ as the outer sum and the inner sum has $x = 0, 1,\dots,t-1$. Thus
$$
\sum_{t=1}^\infty \sum_{x=0}^{t-1} f(t) = \sum_{t=1}^\infty t f(t) = \sum_{t=0}^\infty t f(t) = \E X.
$$

$\square$

I have $X_i \stackrel{\text{iid}}\sim\text{Unif}(J_k)$, which means $\Xmin$ is also non-negative and discrete. I worked out $\Fmin$ in terms of $F$ so I have everything I need to use this result to get $\E \Xmin$.

$$
\begin{aligned}
& \E \Xmin = \sum_{x\geq 0} P(\Xmin > x) \\
&=\sum_{x\geq 0} \left[1 – \Fmin(x)\right] \\
&= \sum_{x\geq 0} [1 – F(x)]^n.
\end{aligned}
$$
Now I can use the fact that the CDF of a discrete uniform over $J_k$ has the form $F(x) = \frac xk$ on $J_k$ so
$$
\begin{aligned}
&\sum_{x\geq 0} [1 – F(x)]^n = \sum_{x=0}^k \left[1 – \frac xk\right]^n \\
&= k^{-n}\sum_{x=0}^k \left[k-x\right]^n \\
&= k^{-n}\sum_{j=1}^k j^n.
\end{aligned}
$$
This shows that $\E \Xmin$ can be computed by determining what the sum of the first $k$ powers of $n$ are.

I’ll introduce the notation
$$
T_k^{(n)} = \sum_{j=1}^k j^n
$$
so in my case I need a formula for $T_k^{(4)}$. I’ll be able to derive it in terms of $T_k^{(j)}$ for $j<3$, so I’ll need closed forms for those. The following lemma gives these.

Lemma:
$$
T_k^{(n)} = \begin{cases}
\frac{k(k+1)}2 & n=1 \\
\frac{k(k+1)(2k+1)}6 & n=2 \\
\frac{k^2(k+1)^2}4 & n=3
\end{cases}
$$
Pf: For $T_k^{(1)}$ I’ll use a geometric proof. Consider a $(k+1)\times (k+1)$ grid of unit squares. This figure has an area of $(k+1)^2$. I can split this figure into three pieces: the diagonal, which has an area of $k+1$, and the upper and lower triangles. The upper and lower triangles have an area of
$$
1 + 2 + \dots + k = T_k^{(1)}
$$
so this means
$$
\begin{aligned}
&(k+1)^2 = 2T_k^{(1)} + k+1 \\
&\implies T_k^{(1)} = \frac{k(k+1)}{2}.
\end{aligned}
$$
For $T_k^{(2)}$ and $T_k^{(3)}$ I’ll just use induction. Trivially the formula holds for $k=1$ in both cases, so I’ll assume it holds for $1,\dots, k$ and I’ll prove it for $k+1$:
$$
\begin{aligned}
& T_{k+1}^{(2)} = T_{k}^{(2)} + (k+1)^2 \\
&= \frac{k(k+1)(2k+1) + 6(k+1)^2}6 \\
&= \frac{(k+1)(k+2)(2k+3)}{6} \\
&= \frac{(k+1)([k+1]+1)(2[k+1]+1)}{6}
\end{aligned}
$$
which establishes that result, and
$$
\begin{aligned}
& T_{k+1}^{(3)} = T_{k}^{(3)} + (k+1)^3 \\
&= \frac{k^2(k+1)^2}4 + (k+1)^3 \\
&= (k+1)^2 \left[\frac{k^2 + 4k + 4}{4}\right] \\
&= \frac{(k+1)^2(k+2)^2}{4}
\end{aligned}
$$
and that result is estbalished too. These inductive proofs aren’t particularly insightful but they at least get the job done.

$\square$

Result 3:
$$
T_k^{(4)} = \frac{k^5}5 + \frac{k^4}2 + \frac {k^3}3 – \frac k{30}.
$$

Pf: I got this proof from here.  I can express
$$
(k+1)^5 – 1
$$
as a telescoping sum with
$$
(k+1)^5 – 1 = \sum_{j=1}^k (j+1)^5 – j^5.
$$
Expanding $(j+1)^5 – j^5$ (perhaps via the binomial theorem) I have
$$
(k+1)^5 – 1 = \sum_{j=1}^k \left[5j^4 + 10j^3 + 10j^2 + 5j + 1\right]
$$
and now I can see a 4th power in there. Distributing the sum I have
$$
(k+1)^5 – 1 = 5 T_k^{(4)} + 10 T_k^{(3)} + 10 T_k^{(2)} + 5 T_k^{(1)} + k
$$
so I can solve for $T_k^{(4)}$ to get
$$
5T_k^{(4)} = (k+1)^5 – 10 T_k^{(3)} – 10 T_k^{(2)} – 5 T_k^{(1)} – k – 1.
$$
Plugging in my equations from the previous lemma I have
$$
\begin{aligned}
& 5T_k^{(4)} = (k+1)^5 – \frac{5k^2(k+1)^2}{2} – \frac{5k(k+1)(2k+1)}{3} – \frac{5k(k+1)}{2} – (k+1) \\
&= \frac{k+1}6\left[6(k+1)^4 – 15k^2(k+1) – 10k(2k+1) – 15k – 6\right] \\
&= \frac{k+1}6\left[6k^4+9k^3+k^2-k\right] \\
&= \frac 16\left[6k^5 + 15k^4 + 10k^3 – k\right] \\
&\implies T_k^{(4)} = \frac{k^5}{5} + \frac{k^4}2 + \frac{k^3}3 – \frac k{30}
\end{aligned}
$$
as desired. $k$ was arbitrary so this proves the result.

$\square$

Applying this result to $\E \Xmin$ with $n=4$ (since my actual problem involves rolling 4 dice) I have
$$
\E \Xmin = k^{-n}T_k^{(n)} = k^{-4}\left(\frac{k^5}{5} + \frac{k^4}2 + \frac{k^3}3 – \frac k{30}\right).
$$
When I plug in $k=6$ I obtain $\E \Xmin = \frac{2275}{1296} \approx 1.755$.

To finish this off, I can also use the above lemma to easily compute $\E X_1$ when $X\sim\text{Unif}(J_k)$:
$$
\E X_1 = \sum_{x=1}^k x \frac{1}{k} = \frac 1k T_k^{(1)} = \frac{k+1}{2}.
$$
This means
$$
\begin{aligned}
& \E S = n \E X_1 – \E \Xmin \\
&= \frac{n(k+1)}{2} – k^{-n}T_k^{(n)} \\
&= 4(3.5) – \frac{2275}{1296} = \frac{15869}{1296} = 12.2446.
\end{aligned}
$$

This gave me an exact result, but I can derive an approximation by noting that
$$
\begin{aligned}
& \E \Xmin = k^{-n}\sum_{j=1}^k j^n = \sum_{j=1}^k \left(\frac jk\right)^n \\
&= k \sum_{j=1}^k \frac 1k \cdot \left(\frac jk\right)^n \\
&\approx k \int_0^1 x^n\,\text dx \\
&= \frac{k}{n+1}
\end{aligned}
$$
when $k$ is sufficiently large. The plot below shows the error of this approximation.

relative error

As an aside, I never ended up needing the pmf of $\Xmin$ since I was able to compute the expected value directly from $\Fmin$. But if I did need the pmf I could have gotten it in the following way. When $X_1$ has a continuous distribution I can differentiate to get
$$
\fmin(x) = n(1-F(x))^{n-1}f(x).
$$
For a discrete distribution I can still differentiate but now I’m taking discrete derivatives:
$$
P(\Xmin = x) = \fmin(x) = \Fmin(x) – \Fmin(x-1) = \frac{\Fmin(x) – \Fmin(x-1)}{x – (x-1)}.
$$
If the support of $X$ is bounded from below, which is the case here, I can view this in the following way. WLOG I’ll take $1 = \min\{x\in\mathbb Z : P(X = x) > 0\}$, i.e. I’m assuming $1$ is the minimum of the support of $X$ (and this minimum is well defined by my assumption of the support being bounded below and the well-ordering of the natural numbers). Then
$$
\Fmin(1) = \fmin(1)
$$
and
$$
\Fmin(2) = \fmin(1) + \fmin(2)
$$
and so on, so in general
$$
\Fmin = L \fmin
$$
where the linear operator $L$ is given by the infinite matrix
$$
L = \begin{bmatrix}
1 & 0 & 0 & 0 & \cdots \\
1 & 1 & 0 & 0 & \cdots \\
1 & 1 & 1 & 0 & \cdots \\
\vdots & \vdots & \vdots & \ddots &
\end{bmatrix}.
$$
By inspection I can see that I can turn $L$ into the identity by sequentially differencing the columns, so $L$ is invertible with
$$
L^{-1} = \begin{bmatrix}
1 & 0 & 0 & 0 & 0 &\cdots \\
-1 & 1 & 0 & 0 & 0 & \cdots \\
0 & -1 & 1 & 0 & 0 & \cdots \\
0 & 0 & -1 & 1 & 0 & \cdots \\
\vdots & \vdots & & \ddots & &
\end{bmatrix}.
$$
This means
$$
\fmin = L^{-1}\Fmin
$$
which leads to
$$
\fmin(x) = \Fmin(x) – \Fmin(x-1)
$$
as before. The cumulative summation done by $L$ is exactly integration w.r.t. the counting measure, so unsurprisingly $L^{-1}$ is the discrete differentiation corresponding to differentiating w.r.t. the counting measure. This is just a confirmation of the fact that there is a bijection between PMFs and CDFs (and since PMFs are densities w.r.t. the counting measure, there’s no issue with almost everywhere equivalence since the only null set of the counting measure is $\emptyset$).

 

 

2 thoughts on “Rejection sampling, order statistics, and Dungeons and Dragons”

  1. Very interesting results. I was inspired by this proof to (experimentally) check what the expected value for each order statistic for the ability scores when rolling 4d6 and dropping the lowest – I invariably got something around
    8,10,11,12,14,15 (with some extra for each, totalling around 3).
    Pretty close to the actual distribution they provide in the PHB! I think rolling came out ahead by about 1 when all the remainders are tallied up.

Leave a Reply

Your email address will not be published. Required fields are marked *