Topics

Mixture of Experts

Sparsely activating subsets of parameters so model capacity grows without proportional compute.