Transformer
Transformer explained
Check out the great illustration of transformer.
Cost formulation background
- Entropy: coding length as a system intrinsic. $$ H(p)=\sum_{x}p(x)\log_2(\frac{1}{p(x)}) $$
Cross entropy: coding length for a message (sample) p as if it is drawn from distribution q. $$H_q(p)=\sum_{x}p(x)\log_2(\frac{1}{q(x)})$$
- Lullback-Lerbler divergence: coding efficiency difference based on true message (sample) distribution p (Zero-rebased cross entropy). $$D_q(p)=H_q(p) - H(p) = \sum_{x}p(x)\log_2(\frac{p(x)}{q(x)})$$
Transformer practice
Pytorch version The Annotated Transformer.
Open my forked playground in colab.
Github of x-transformers by lucidrains.
Enjoy Reading This Article?
Here are some more articles you might like to read next: