Dark Dwarf Blog background

Homework 2

Homework 2

2. Probability Potpourri

Q1Q1

Concisely, Σ=E[(Zμ)(Zμ)]\Sigma = E[(Z − \mu )(Z − \mu )^{⊤}], where μ\mu is the mean value of the (column) vector ZZ. Show that the covariance matrix is always positive semidefinite (PSD).

proof:proof: 只需证明:对于任何非零向量 vvvTΣv0vᵀ\Sigma v \ge 0 恒成立。

vTΣv=vTE[(Zμ)(Zμ)T]v=E[vT(Zμ)(Zμ)Tv]\begin{aligned} v^T\Sigma v &= v^T\,\mathbb{E}\big[(Z-\mu )(Z-\mu )^T\big]\,v \\[6pt] &= \mathbb{E}\big[v^T(Z-\mu )(Z-\mu )^T v\big] \end{aligned}

Y=ZμY = Z - \mu

LHS=E[vTYYTv]=E[vTY)(YTv)]\begin{aligned} \text{LHS} &=\mathbb{E}\big[v^TYY^T v\big] \\[6pt] &=\mathbb{E}\big[(v^TY)(Y^T v)\big] \end{aligned}

由于 vvn×1n\times 1 的列向量,vTv^T1×n1 \times n 的行向量。而 YYn×1n \times 1 的列向量,两者相乘得到 1×11\times 1 的标量。YTvY^Tv 亦然 (事实上,YTv=(vTY)TY^Tv=(v^TY)^T,而标量的转置就是它本身)。因此:

(vTY)(YTv)=(vTY)2(vᵀY)(Yᵀv) = (vᵀY)^2

因此,E[vTY)(YTv)]\mathbb{E}\big[(v^TY)(Y^T v)\big] 就转变成一个非负随机变量的期望。这个期望显然是非负的。

Q4Q4

Consider a discrete random variable XX that takes on a value from a finite sample space ZZ. For a value xZx ∈ Z, let p(x)p(x) be the probability that XX takes the value xx. The entropy of XX is

H(X)=xZp(x)lnp(x).H(X) = -\sum_{x\in\mathcal{Z}} p(x)\,\ln p(x).

(i)(i) First, let’s consider a random variable from a Bernoulli distribution, which has only two possible states. Let XBernoulli(p)X ∼ Bernoulli(p), p(0,1)p ∈ (0, 1). Show that H(X)H(X) is concave in pp. (That is, H(X)−H(X) is a convex function of pp.)

proof:proof: 伯努利分布的熵函数如下:

H(p)=[pln(p)+(1p)ln(1p)]H(p) = - [ p \ln(p) + (1-p) \ln(1-p) ]

其二阶导数:

H(p)=(1p+11p)<0H''(p) = - (\frac{1}{p} + \frac{1}{1-p}) \lt 0

因此伯努利分布的熵函数为凹函数。

Consider a sample space ZZ with nn states, a discrete distribution over those states, and a random variable XX drawn from that distribution. Show that among all possible PMFs over nn states, the entropy H(X)H(X) is maximized by the uniform distribution. What is H(X)H(X) for that distribution?

Answer:Answer: 首先根据题意,我们建立如下的优化问题:

  • 已知:包含 nn 个状态的样本空间,概率质量分布为 p=(p1,p2,,pn)p=(p_1, p_2,\dots,p_n)
  • 优化问题:
H(p)=ipilnpisubject toipi=1,pi0    i\begin{aligned} H(p) &= -\sum_{i} p_i \ln p_i \\[4pt] \text{subject to}\quad &\sum_{i} p_i = 1,\quad p_i \ge 0\;\; \forall i \end{aligned}

我们引入一个参照分布 q=(q1,q2,...,qn)q = (q₁, q₂, ..., qₙ),其中 i,  qi=1n\forall i, \;q_i=\frac{1}{n}。我们使用KL散度衡量任意一个分布 pp 和我们的猜测分布 qq 之间的差距

KL散度具有如下性质:两个分布的差距永远是非负的。只有当两个分布完全相同时,差距才是0

根据KL散度计算公式,有:

DKL(pq)=ipilnpiqi0D_{\mathrm{KL}}(p\|q)=\sum_{i} p_i \ln\frac{p_i}{q_i}\ge 0

而:

ipilnpiqi=ipilnpiipilnqi=H(p)+lnn\begin{aligned} \sum_{i} p_i \ln\frac{p_i}{q_i} &= \sum_{i} p_i \ln p_i - \sum_{i} p_i \ln q_i \\[6pt] &= -H(p) + \ln n \end{aligned}

因此:

H(p)ln(n)H(p) \leq ln(n)

而当 i,  pi=qi=1n\forall i,\; p_i=q_i=\frac{1}{n} 时,H(p)=lnnH(p)=\ln n。因此证毕。

线性代数部分我不会做…就跳过了

4 Matrix/Vector Calculus

Q1Q1

我们分别求 g(A)=sin(A112+e(A11+A22))g(A) = \sin(A₁₁² + e^{(A₁₁+A₂₂)})h(A)=xTAyh(A) = xᵀAy 的梯度。

g(A)g(A) 有:

gA11=cos(A112+eA11+A22)A11(A112+eA11+A22)=cos(A112+eA11+A22)(2A11+eA11+A22A11(A11+A22))=cos(A112+eA11+A22)(2A11+eA11+A22)\begin{aligned} \frac{\partial g}{\partial A_{11}} &= \cos\bigl(A_{11}^2 + e^{A_{11}+A_{22}}\bigr)\,\frac{\partial}{\partial A_{11}}\bigl(A_{11}^2 + e^{A_{11}+A_{22}}\bigr) \\[6pt] &= \cos\bigl(A_{11}^2 + e^{A_{11}+A_{22}}\bigr)\,\bigl(2A_{11} + e^{A_{11}+A_{22}}\,\frac{\partial}{\partial A_{11}}(A_{11}+A_{22})\bigr) \\[6pt] &= \cos\bigl(A_{11}^2 + e^{A_{11}+A_{22}}\bigr)\,\bigl(2A_{11} + e^{A_{11}+A_{22}}\bigr) \end{aligned} gA12=0\frac{\partial g}{\partial A_{12}} = 0 gA21=0\frac{\partial g}{\partial A_{21}} = 0 gA22=cos(A112+eA11+A22)A22(A112+eA11+A22)=cos(A112+eA11+A22)(eA11+A22A22(A11+A22))=cos(A112+eA11+A22)eA11+A22\begin{aligned} \frac{\partial g}{\partial A_{22}} &= \cos\bigl(A_{11}^2 + e^{A_{11}+A_{22}}\bigr)\,\frac{\partial}{\partial A_{22}}\bigl(A_{11}^2 + e^{A_{11}+A_{22}}\bigr) \\[6pt] &= \cos\bigl(A_{11}^2 + e^{A_{11}+A_{22}}\bigr)\,\bigl(e^{A_{11}+A_{22}}\,\frac{\partial}{\partial A_{22}}(A_{11}+A_{22})\bigr) \\[6pt] &= \cos\bigl(A_{11}^2 + e^{A_{11}+A_{22}}\bigr)\,e^{A_{11}+A_{22}} \end{aligned}

因此 g(x)g(x) 的梯度矩阵为:

Ag(A)=[cos ⁣(A112+eA11+A22)(2A11+eA11+A22)00cos ⁣(A112+eA11+A22)eA11+A22]\begin{aligned} \nabla_A g(A) &= \begin{bmatrix} \cos\!\bigl(A_{11}^2 + e^{A_{11}+A_{22}}\bigr)\,\bigl(2A_{11} + e^{A_{11}+A_{22}}\bigr) & 0 \\[8pt] 0 & \cos\!\bigl(A_{11}^2 + e^{A_{11}+A_{22}}\bigr)\,e^{A_{11}+A_{22}} \end{bmatrix} \end{aligned}

h(A)h(A)xTAyxᵀAy 是一个标量,我们先将其展开:

x=[x1x2],y=[y1y2]Ay=[A11y1+A12y2A21y1+A22y2]xTAy=x1(A11y1+A12y2)+x2(A21y1+A22y2)h(A)=x1y1A11+x1y2A12+x2y1A21+x2y2A22\begin{aligned} x &= \begin{bmatrix} x_1 \\ x_2 \end{bmatrix},\quad y = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \\[6pt] Ay &= \begin{bmatrix} A_{11}y_1 + A_{12}y_2 \\[4pt] A_{21}y_1 + A_{22}y_2 \end{bmatrix} \\[6pt] x^T A y &= x_1\,(A_{11}y_1 + A_{12}y_2) + x_2\,(A_{21}y_1 + A_{22}y_2) \\[6pt] h(A) &= x_1 y_1 A_{11} + x_1 y_2 A_{12} + x_2 y_1 A_{21} + x_2 y_2 A_{22} \end{aligned}

对展开的结果的每个参数求偏导:

hA11=x1y1hA12=x1y2hA21=x2y1hA22=x2y2\begin{aligned} \frac{\partial h}{\partial A_{11}} &= x_1 y_1 \\[6pt] \frac{\partial h}{\partial A_{12}} &= x_1 y_2 \\[6pt] \frac{\partial h}{\partial A_{21}} &= x_2 y_1 \\[6pt] \frac{\partial h}{\partial A_{22}} &= x_2 y_2 \end{aligned}

因此 h(x)h(x) 的梯度矩阵为:

Ah(A)=[x1y1x1y2x2y1x2y2]=[x1x2][y1y2]=xyT\begin{aligned} \nabla_A h(A) &= \begin{bmatrix} x_1 y_1 & x_1 y_2 \\[4pt] x_2 y_1 & x_2 y_2 \end{bmatrix} \\[6pt] &= \begin{bmatrix} x_1 \\[4pt] x_2 \end{bmatrix}\begin{bmatrix} y_1 & y_2 \end{bmatrix} \\[6pt] &= x\,y^{T} \end{aligned}

这是一个非常常用的矩阵求导结论:A(xTAy)=xyT∇_A (xᵀAy) = xyᵀ

于是:

Af(A)=Ag(A)+Ah(A)=[cos(A112+eA11+A22)(2A11+eA11+A22)+x1y1x1y2x2y1cos(A112+eA11+A22)eA11+A22+x2y2]\begin{aligned} \nabla_A f(A) &= \nabla_A g(A) + \nabla_A h(A) \\[6pt] &= \begin{bmatrix} \cos\bigl(A_{11}^2 + e^{A_{11}+A_{22}}\bigr)\,\bigl(2A_{11}+e^{A_{11}+A_{22}}\bigr) + x_1 y_1 & x_1 y_2 \\[8pt] x_2 y_1 & \cos\bigl(A_{11}^2 + e^{A_{11}+A_{22}}\bigr)\,e^{A_{11}+A_{22}} + x_2 y_2 \end{bmatrix} \end{aligned}