<ul><li><ahref="https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/papers/2204.10628.pdf">Autoregressive Search Engines: Generating Substrings as Document Identifiers</a></li>
<ul><li><ahref="https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/papers/2204.10628.pdf">Autoregressive Search Engines: Generating Substrings as Document Identifiers</a></li>
<li><ahref="https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/papers/2203.15556.pdf">Training Compute-Optimal Large Language Models</a></li>
<li><ahref="https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/papers/2203.15556.pdf">Training Compute-Optimal Large Language Models</a></li>
<p>That is, we pick the highest probable tokens until the sum of their probabilities is less that <spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.625em;vertical-align:-0.19444em;"></span><spanclass="mord coloredeq eqe"style=""><spanclass="mord mathnormal"style="">p</span></span></span></span></span>.</p>
<p>That is, we pick the highest probable tokens until the sum of their probabilities is less that <spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.625em;vertical-align:-0.19444em;"></span><spanclass="mord coloredeq eqe"style=""><spanclass="mord mathnormal"style="">p</span></span></span></span></span>.</p>
<p>Then we sample from the selected tokens.</p>
<p>Then we sample from the selected tokens.</p>
<p>Here's an <ahref="experiment.html">experiment</a> that uses these sampling techniques.</p>
<p>Here we sample from the following probability distribution where <spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.68333em;vertical-align:0em;"></span><spanclass="mord coloredeq eqd"style=""><spanclass="mord mathnormal"style="margin-right:0.22222em">V</span></span></span></span></span> is the vocabulary, <spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.7857599999999999em;vertical-align:-0.3551999999999999em;"></span><spanclass="mord"><spanclass="mord mathnormal">u</span><spanclass="msupsub"><spanclass="vlist-t vlist-t2"><spanclass="vlist-r"><spanclass="vlist"style="height:0.34480000000000005em;"><spanstyle="top:-2.5198em;margin-left:0em;margin-right:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizing reset-size6 size3 mtight"><spanclass="mord mtight"><spanclass="mord mtight">1</span><spanclass="mrel mtight">:</span><spanclass="mord mtight">∣</span><spanclass="mord mtight coloredeq eqd"style=""><spanclass="mord mathnormal mtight"style="margin-right:0.22222em">V</span></span><spanclass="mord mtight">∣</span></span></span></span></span><spanclass="vlist-s"></span></span><spanclass="vlist-r"><spanclass="vlist"style="height:0.3551999999999999em;"><span></span></span></span></span></span></span></span></span></span> are the logits of the distribution and T is the temperature:</p>
<p>Here we sample from the following probability distribution where <spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.68333em;vertical-align:0em;"></span><spanclass="mord coloredeq eqd"style=""><spanclass="mord mathnormal"style="margin-right:0.22222em">V</span></span></span></span></span> is the vocabulary, <spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.7857599999999999em;vertical-align:-0.3551999999999999em;"></span><spanclass="mord"><spanclass="mord mathnormal">u</span><spanclass="msupsub"><spanclass="vlist-t vlist-t2"><spanclass="vlist-r"><spanclass="vlist"style="height:0.34480000000000005em;"><spanstyle="top:-2.5198em;margin-left:0em;margin-right:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizing reset-size6 size3 mtight"><spanclass="mord mtight"><spanclass="mord mtight">1</span><spanclass="mrel mtight">:</span><spanclass="mord mtight">∣</span><spanclass="mord mtight coloredeq eqd"style=""><spanclass="mord mathnormal mtight"style="margin-right:0.22222em">V</span></span><spanclass="mord mtight">∣</span></span></span></span></span><spanclass="vlist-s"></span></span><spanclass="vlist-r"><spanclass="vlist"style="height:0.3551999999999999em;"><span></span></span></span></span></span></span></span></span></span> are the logits of the distribution and T is the temperature:</p>
<p><spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.68333em;vertical-align:0em;"></span><spanclass="mord mathnormal"style="margin-right:0.13889em;">T</span><spanclass="mspace"style="margin-right:0.2777777777777778em;"></span><spanclass="mrel">=</span><spanclass="mspace"style="margin-right:0.2777777777777778em;"></span></span><spanclass="base"><spanclass="strut"style="height:0.64444em;vertical-align:0em;"></span><spanclass="mord">1</span></span></span></span> is normal random sampling.</p>
<p><spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.68333em;vertical-align:0em;"></span><spanclass="mord mathnormal"style="margin-right:0.13889em;">T</span><spanclass="mspace"style="margin-right:0.2777777777777778em;"></span><spanclass="mrel">=</span><spanclass="mspace"style="margin-right:0.2777777777777778em;"></span></span><spanclass="base"><spanclass="strut"style="height:0.64444em;vertical-align:0em;"></span><spanclass="mord">1</span></span></span></span> is normal random sampling.</p>
<p>Here's an <ahref="experiment.html">experiment</a> that uses these sampling techniques.</p>