Alternate Option for CE
Consider the following setup:
- Probability Space $(\Omega, \mathcal{F}, \mathbb{P})$
- Random Variable $U: \Omega \to {0, 1}^{k}$ with Uniform Distribution.
- Measureable Functions
- Encoder $f: {0, 1}^{k} \to {0,1}^{n}$
- Noise Function $q: {0,1}^{n} \times \Omega \to \mathbb{R}^{n}$
- Likelihood Function $l: \mathbb{R}^{n} \to [0,1]^{k}$. That is $l(\mathbf{y}){i} = \mathbb{P}(U{i} = 1|Y=\mathbf{y})$.
- Generic soft decoder $g: [n] \to [0,1]^{k}$. $g$ returns a vector of Probability Mass Functions on ${0, 1}$. %%5. Hard thresholder $h: [0,1]^{k} \to {0, 1}^{k}$, where $h(x){i} = \begin{cases} 1 &\text{if }x{i} \geq 0.5\\ 0 &\text{otherwise} \end{cases}$%%
- Denote Random Variables
- $X = f \circ U$. This is the encoded value.
- $Y(\omega) = q(X(\omega), \omega)$. This is the noisy value
- $L = l \circ Y$. This a random variable of the likelihood for $U_{i}=1$ conditioned on $Y$.
- $R = g \circ Y$. This is a random variable of the soft decoded value conditioned on $Y$.
- $\hat{U} \sim ({0, 1}^{k}, L)$. This is a random variable representing a decoded value sampled from the likelihood.
- $\hat{V} \sim ({0, 1}^{k}, L)$. This is a random variable representing a decoded value sampled using the soft decoded value.
Recall the following definitions
Definition 1: The Block Error Rate of an encoder $f$, soft decoder $g$ given some noise model $q$ is $$BLER(f, g|q) = \mathbb{E}\big[ 1_{\hat{V} \neq U} \big]$$
Definition 2: The Bit Error Rate of an encoder $f$, soft decoder $g$ given some noise model $q$ is $$BER(f, g|q) = \mathbb{E}\big[ \frac{1}{k} \sum\limits_{i=1}^{k} 1_{\hat{V}{i} \neq U{i}} \big]$$
Definition 3: The Binary Cross-Entropy of an encoder $f$, soft decoder $g$ given some noise model $q$ is $$BCE(f, g|q) = \mathbb{E}\Big[-\frac{1}{k} \sum\limits_{i=1}^{k} 1_{U_{i} = 1} \log g(Y){i} + 1{U_{i} = 0} \log (1 - g(Y)_{i}) \Big]$$
Let $r(u) = \mathbb{E}(g(Y)|U=u)$. Then
Definition 3: The Modified BCE of an encoder $f$, soft decoder $g$ given some noise model $q$ is $$BCE’(f, g|q) = \mathbb{E}\Big[-\frac{1}{k} \sum\limits_{i=1}^{k} 1_{U_{i} = 1} \log r(U){i} + 1{U_{i} = 0} \log (1- r(U)_{i}) \Big]$$
Recall that we can write $$\begin{align*} BER(f, g|q) &= \mathbb{E}\big[ \frac{1}{k} \sum\limits_{i=1}^{k} 1_{\hat{V}{i} \neq U{i}} \big]\\ &=\mathbb{E}\big[ \frac{1}{k} \sum\limits_{i=1}^{k} 1_{U_{i} = 1}1_{\hat{V}{i}=0} + 1{U_{i} = 0}1_{\hat{V}{i}=1} \big]\\ &=\mathbb{E}\Bigg[ \frac{1}{k} \sum\limits{i=1}^{k} 1_{U_{i} = 1} \mathbb{E}(1_{\hat{V}{i}=0}|U) + 1{U_{i} = 0}\mathbb{E}(1_{\hat{V}{i}=1}|U) \Bigg]\\ &=\mathbb{E}\Bigg[ \frac{1}{k} \sum\limits{i=1}^{k} 1_{U_{i} = 1} \mathbb{E}(\mathbb{E}(1_{\hat{V}{i}=0}|Y)|U) + 1{U_{i} = 0}\mathbb{E}(\mathbb{E}(1_{\hat{V}{i}=1}|Y)|U) \Bigg]\\ &=\mathbb{E}\Bigg[ \frac{1}{k} \sum\limits{i=1}^{k} 1_{U_{i} = 1} \mathbb{E}(1 - g(Y){i}|U) + 1{U_{i} = 0}\mathbb{E}(g(Y){i}|U) \Bigg]\\ &=\mathbb{E}\Bigg[ \frac{1}{k} \sum\limits{i=1}^{k} 1_{U_{i} = 1} \mathbb{E}(1 - g(Y){i}|U) + 1{U_{i} = 0}\mathbb{E}(g(Y){i}|U) \Bigg]\\ &=\mathbb{E}\Bigg[ \frac{1}{k} \sum\limits{i=1}^{k} 1_{U_{i} = 1} (1 - r(U){i}) + 1{U_{i} = 0}r(U)_{i} \Bigg]\\ \end{align*}$$
TODO: Prove this: Recall that random bayes classifer minimizes $BER$ and that $l$ induces Random Bayes Classifer. So choosing $g=l$ minimizes $BER$. TODO (does choosing $g = l$…
No the issue is that we also have to choose the Coding Encoder. I don’t think these are the same because it is susceptible to the same issue as with Counterexample showing that optimizing BLER is not the same as optimization CE. The cost functions weight $[0,1]$ differently and the sum will allow me to take advantage of this.
First note that $g(Y) = \omega \mapsto g(q(f(U(\omega)), \omega))$, so given $U=u$, we get $r(u) = \mathbb{E}[\omega \mapsto g(q(f(u), \omega))]$z. Letting