Transformer explained

Check out the great illustration of transformer.

Cost formulation background

  • Entropy: coding length as a system intrinsic. $$ H(p)=\sum_{x}p(x)\log_2(\frac{1}{p(x)}) $$
  • Cross entropy: coding length for a message (sample) p as if it is drawn from distribution q. $$H_q(p)=\sum_{x}p(x)\log_2(\frac{1}{q(x)})$$

  • Lullback-Lerbler divergence: coding efficiency difference based on true message (sample) distribution p (Zero-rebased cross entropy). $$D_q(p)=H_q(p) - H(p) = \sum_{x}p(x)\log_2(\frac{p(x)}{q(x)})$$

Transformer practice

Pytorch version The Annotated Transformer.

Open my forked playground in colab.

Github of x-transformers by lucidrains.