Files
tantivy/master/src/utf8_ranges/lib.rs.html
2018-05-03 07:28:39 +00:00

1057 lines
68 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><meta name="generator" content="rustdoc"><meta name="description" content="Source to the Rust file `/home/travis/.cargo/registry/src/github.com-1ecc6299db9ec823/utf8-ranges-1.0.0/src/lib.rs`."><meta name="keywords" content="rust, rustlang, rust-lang"><title>lib.rs.html -- source</title><link rel="stylesheet" type="text/css" href="../../normalize.css"><link rel="stylesheet" type="text/css" href="../../rustdoc.css" id="mainThemeStyle"><link rel="stylesheet" type="text/css" href="../../dark.css"><link rel="stylesheet" type="text/css" href="../../light.css" id="themeStyle"><script src="../../storage.js"></script></head><body class="rustdoc source"><!--[if lte IE 8]><div class="warning">This old browser is unsupported and will most likely display funky things.</div><![endif]--><nav class="sidebar"><div class="sidebar-menu">&#9776;</div></nav><div class="theme-picker"><button id="theme-picker" aria-label="Pick another theme!"><img src="../../brush.svg" width="18" alt="Pick another theme!"></button><div id="theme-choices"></div></div><script src="../../theme.js"></script><nav class="sub"><form class="search-form js-only"><div class="search-container"><input class="search-input" name="search" autocomplete="off" placeholder="Click or press S to search, ? for more options…" type="search"><a id="settings-menu" href="../../settings.html"><img src="../../wheel.svg" width="18" alt="Change settings"></a></div></form></nav><section id="main" class="content"><pre class="line-numbers"><span id="1"> 1</span>
<span id="2"> 2</span>
<span id="3"> 3</span>
<span id="4"> 4</span>
<span id="5"> 5</span>
<span id="6"> 6</span>
<span id="7"> 7</span>
<span id="8"> 8</span>
<span id="9"> 9</span>
<span id="10"> 10</span>
<span id="11"> 11</span>
<span id="12"> 12</span>
<span id="13"> 13</span>
<span id="14"> 14</span>
<span id="15"> 15</span>
<span id="16"> 16</span>
<span id="17"> 17</span>
<span id="18"> 18</span>
<span id="19"> 19</span>
<span id="20"> 20</span>
<span id="21"> 21</span>
<span id="22"> 22</span>
<span id="23"> 23</span>
<span id="24"> 24</span>
<span id="25"> 25</span>
<span id="26"> 26</span>
<span id="27"> 27</span>
<span id="28"> 28</span>
<span id="29"> 29</span>
<span id="30"> 30</span>
<span id="31"> 31</span>
<span id="32"> 32</span>
<span id="33"> 33</span>
<span id="34"> 34</span>
<span id="35"> 35</span>
<span id="36"> 36</span>
<span id="37"> 37</span>
<span id="38"> 38</span>
<span id="39"> 39</span>
<span id="40"> 40</span>
<span id="41"> 41</span>
<span id="42"> 42</span>
<span id="43"> 43</span>
<span id="44"> 44</span>
<span id="45"> 45</span>
<span id="46"> 46</span>
<span id="47"> 47</span>
<span id="48"> 48</span>
<span id="49"> 49</span>
<span id="50"> 50</span>
<span id="51"> 51</span>
<span id="52"> 52</span>
<span id="53"> 53</span>
<span id="54"> 54</span>
<span id="55"> 55</span>
<span id="56"> 56</span>
<span id="57"> 57</span>
<span id="58"> 58</span>
<span id="59"> 59</span>
<span id="60"> 60</span>
<span id="61"> 61</span>
<span id="62"> 62</span>
<span id="63"> 63</span>
<span id="64"> 64</span>
<span id="65"> 65</span>
<span id="66"> 66</span>
<span id="67"> 67</span>
<span id="68"> 68</span>
<span id="69"> 69</span>
<span id="70"> 70</span>
<span id="71"> 71</span>
<span id="72"> 72</span>
<span id="73"> 73</span>
<span id="74"> 74</span>
<span id="75"> 75</span>
<span id="76"> 76</span>
<span id="77"> 77</span>
<span id="78"> 78</span>
<span id="79"> 79</span>
<span id="80"> 80</span>
<span id="81"> 81</span>
<span id="82"> 82</span>
<span id="83"> 83</span>
<span id="84"> 84</span>
<span id="85"> 85</span>
<span id="86"> 86</span>
<span id="87"> 87</span>
<span id="88"> 88</span>
<span id="89"> 89</span>
<span id="90"> 90</span>
<span id="91"> 91</span>
<span id="92"> 92</span>
<span id="93"> 93</span>
<span id="94"> 94</span>
<span id="95"> 95</span>
<span id="96"> 96</span>
<span id="97"> 97</span>
<span id="98"> 98</span>
<span id="99"> 99</span>
<span id="100">100</span>
<span id="101">101</span>
<span id="102">102</span>
<span id="103">103</span>
<span id="104">104</span>
<span id="105">105</span>
<span id="106">106</span>
<span id="107">107</span>
<span id="108">108</span>
<span id="109">109</span>
<span id="110">110</span>
<span id="111">111</span>
<span id="112">112</span>
<span id="113">113</span>
<span id="114">114</span>
<span id="115">115</span>
<span id="116">116</span>
<span id="117">117</span>
<span id="118">118</span>
<span id="119">119</span>
<span id="120">120</span>
<span id="121">121</span>
<span id="122">122</span>
<span id="123">123</span>
<span id="124">124</span>
<span id="125">125</span>
<span id="126">126</span>
<span id="127">127</span>
<span id="128">128</span>
<span id="129">129</span>
<span id="130">130</span>
<span id="131">131</span>
<span id="132">132</span>
<span id="133">133</span>
<span id="134">134</span>
<span id="135">135</span>
<span id="136">136</span>
<span id="137">137</span>
<span id="138">138</span>
<span id="139">139</span>
<span id="140">140</span>
<span id="141">141</span>
<span id="142">142</span>
<span id="143">143</span>
<span id="144">144</span>
<span id="145">145</span>
<span id="146">146</span>
<span id="147">147</span>
<span id="148">148</span>
<span id="149">149</span>
<span id="150">150</span>
<span id="151">151</span>
<span id="152">152</span>
<span id="153">153</span>
<span id="154">154</span>
<span id="155">155</span>
<span id="156">156</span>
<span id="157">157</span>
<span id="158">158</span>
<span id="159">159</span>
<span id="160">160</span>
<span id="161">161</span>
<span id="162">162</span>
<span id="163">163</span>
<span id="164">164</span>
<span id="165">165</span>
<span id="166">166</span>
<span id="167">167</span>
<span id="168">168</span>
<span id="169">169</span>
<span id="170">170</span>
<span id="171">171</span>
<span id="172">172</span>
<span id="173">173</span>
<span id="174">174</span>
<span id="175">175</span>
<span id="176">176</span>
<span id="177">177</span>
<span id="178">178</span>
<span id="179">179</span>
<span id="180">180</span>
<span id="181">181</span>
<span id="182">182</span>
<span id="183">183</span>
<span id="184">184</span>
<span id="185">185</span>
<span id="186">186</span>
<span id="187">187</span>
<span id="188">188</span>
<span id="189">189</span>
<span id="190">190</span>
<span id="191">191</span>
<span id="192">192</span>
<span id="193">193</span>
<span id="194">194</span>
<span id="195">195</span>
<span id="196">196</span>
<span id="197">197</span>
<span id="198">198</span>
<span id="199">199</span>
<span id="200">200</span>
<span id="201">201</span>
<span id="202">202</span>
<span id="203">203</span>
<span id="204">204</span>
<span id="205">205</span>
<span id="206">206</span>
<span id="207">207</span>
<span id="208">208</span>
<span id="209">209</span>
<span id="210">210</span>
<span id="211">211</span>
<span id="212">212</span>
<span id="213">213</span>
<span id="214">214</span>
<span id="215">215</span>
<span id="216">216</span>
<span id="217">217</span>
<span id="218">218</span>
<span id="219">219</span>
<span id="220">220</span>
<span id="221">221</span>
<span id="222">222</span>
<span id="223">223</span>
<span id="224">224</span>
<span id="225">225</span>
<span id="226">226</span>
<span id="227">227</span>
<span id="228">228</span>
<span id="229">229</span>
<span id="230">230</span>
<span id="231">231</span>
<span id="232">232</span>
<span id="233">233</span>
<span id="234">234</span>
<span id="235">235</span>
<span id="236">236</span>
<span id="237">237</span>
<span id="238">238</span>
<span id="239">239</span>
<span id="240">240</span>
<span id="241">241</span>
<span id="242">242</span>
<span id="243">243</span>
<span id="244">244</span>
<span id="245">245</span>
<span id="246">246</span>
<span id="247">247</span>
<span id="248">248</span>
<span id="249">249</span>
<span id="250">250</span>
<span id="251">251</span>
<span id="252">252</span>
<span id="253">253</span>
<span id="254">254</span>
<span id="255">255</span>
<span id="256">256</span>
<span id="257">257</span>
<span id="258">258</span>
<span id="259">259</span>
<span id="260">260</span>
<span id="261">261</span>
<span id="262">262</span>
<span id="263">263</span>
<span id="264">264</span>
<span id="265">265</span>
<span id="266">266</span>
<span id="267">267</span>
<span id="268">268</span>
<span id="269">269</span>
<span id="270">270</span>
<span id="271">271</span>
<span id="272">272</span>
<span id="273">273</span>
<span id="274">274</span>
<span id="275">275</span>
<span id="276">276</span>
<span id="277">277</span>
<span id="278">278</span>
<span id="279">279</span>
<span id="280">280</span>
<span id="281">281</span>
<span id="282">282</span>
<span id="283">283</span>
<span id="284">284</span>
<span id="285">285</span>
<span id="286">286</span>
<span id="287">287</span>
<span id="288">288</span>
<span id="289">289</span>
<span id="290">290</span>
<span id="291">291</span>
<span id="292">292</span>
<span id="293">293</span>
<span id="294">294</span>
<span id="295">295</span>
<span id="296">296</span>
<span id="297">297</span>
<span id="298">298</span>
<span id="299">299</span>
<span id="300">300</span>
<span id="301">301</span>
<span id="302">302</span>
<span id="303">303</span>
<span id="304">304</span>
<span id="305">305</span>
<span id="306">306</span>
<span id="307">307</span>
<span id="308">308</span>
<span id="309">309</span>
<span id="310">310</span>
<span id="311">311</span>
<span id="312">312</span>
<span id="313">313</span>
<span id="314">314</span>
<span id="315">315</span>
<span id="316">316</span>
<span id="317">317</span>
<span id="318">318</span>
<span id="319">319</span>
<span id="320">320</span>
<span id="321">321</span>
<span id="322">322</span>
<span id="323">323</span>
<span id="324">324</span>
<span id="325">325</span>
<span id="326">326</span>
<span id="327">327</span>
<span id="328">328</span>
<span id="329">329</span>
<span id="330">330</span>
<span id="331">331</span>
<span id="332">332</span>
<span id="333">333</span>
<span id="334">334</span>
<span id="335">335</span>
<span id="336">336</span>
<span id="337">337</span>
<span id="338">338</span>
<span id="339">339</span>
<span id="340">340</span>
<span id="341">341</span>
<span id="342">342</span>
<span id="343">343</span>
<span id="344">344</span>
<span id="345">345</span>
<span id="346">346</span>
<span id="347">347</span>
<span id="348">348</span>
<span id="349">349</span>
<span id="350">350</span>
<span id="351">351</span>
<span id="352">352</span>
<span id="353">353</span>
<span id="354">354</span>
<span id="355">355</span>
<span id="356">356</span>
<span id="357">357</span>
<span id="358">358</span>
<span id="359">359</span>
<span id="360">360</span>
<span id="361">361</span>
<span id="362">362</span>
<span id="363">363</span>
<span id="364">364</span>
<span id="365">365</span>
<span id="366">366</span>
<span id="367">367</span>
<span id="368">368</span>
<span id="369">369</span>
<span id="370">370</span>
<span id="371">371</span>
<span id="372">372</span>
<span id="373">373</span>
<span id="374">374</span>
<span id="375">375</span>
<span id="376">376</span>
<span id="377">377</span>
<span id="378">378</span>
<span id="379">379</span>
<span id="380">380</span>
<span id="381">381</span>
<span id="382">382</span>
<span id="383">383</span>
<span id="384">384</span>
<span id="385">385</span>
<span id="386">386</span>
<span id="387">387</span>
<span id="388">388</span>
<span id="389">389</span>
<span id="390">390</span>
<span id="391">391</span>
<span id="392">392</span>
<span id="393">393</span>
<span id="394">394</span>
<span id="395">395</span>
<span id="396">396</span>
<span id="397">397</span>
<span id="398">398</span>
<span id="399">399</span>
<span id="400">400</span>
<span id="401">401</span>
<span id="402">402</span>
<span id="403">403</span>
<span id="404">404</span>
<span id="405">405</span>
<span id="406">406</span>
<span id="407">407</span>
<span id="408">408</span>
<span id="409">409</span>
<span id="410">410</span>
<span id="411">411</span>
<span id="412">412</span>
<span id="413">413</span>
<span id="414">414</span>
<span id="415">415</span>
<span id="416">416</span>
<span id="417">417</span>
<span id="418">418</span>
<span id="419">419</span>
<span id="420">420</span>
<span id="421">421</span>
<span id="422">422</span>
<span id="423">423</span>
<span id="424">424</span>
<span id="425">425</span>
<span id="426">426</span>
<span id="427">427</span>
<span id="428">428</span>
<span id="429">429</span>
<span id="430">430</span>
<span id="431">431</span>
<span id="432">432</span>
<span id="433">433</span>
<span id="434">434</span>
<span id="435">435</span>
<span id="436">436</span>
<span id="437">437</span>
<span id="438">438</span>
<span id="439">439</span>
<span id="440">440</span>
<span id="441">441</span>
<span id="442">442</span>
<span id="443">443</span>
<span id="444">444</span>
<span id="445">445</span>
<span id="446">446</span>
<span id="447">447</span>
<span id="448">448</span>
<span id="449">449</span>
<span id="450">450</span>
<span id="451">451</span>
<span id="452">452</span>
<span id="453">453</span>
<span id="454">454</span>
<span id="455">455</span>
<span id="456">456</span>
<span id="457">457</span>
<span id="458">458</span>
<span id="459">459</span>
<span id="460">460</span>
<span id="461">461</span>
<span id="462">462</span>
<span id="463">463</span>
<span id="464">464</span>
<span id="465">465</span>
<span id="466">466</span>
<span id="467">467</span>
<span id="468">468</span>
<span id="469">469</span>
<span id="470">470</span>
<span id="471">471</span>
<span id="472">472</span>
<span id="473">473</span>
<span id="474">474</span>
<span id="475">475</span>
<span id="476">476</span>
<span id="477">477</span>
<span id="478">478</span>
<span id="479">479</span>
<span id="480">480</span>
<span id="481">481</span>
<span id="482">482</span>
<span id="483">483</span>
<span id="484">484</span>
<span id="485">485</span>
<span id="486">486</span>
<span id="487">487</span>
<span id="488">488</span>
<span id="489">489</span>
<span id="490">490</span>
<span id="491">491</span>
<span id="492">492</span>
<span id="493">493</span>
<span id="494">494</span>
<span id="495">495</span>
<span id="496">496</span>
<span id="497">497</span>
<span id="498">498</span>
<span id="499">499</span>
<span id="500">500</span>
<span id="501">501</span>
<span id="502">502</span>
<span id="503">503</span>
<span id="504">504</span>
<span id="505">505</span>
<span id="506">506</span>
<span id="507">507</span>
<span id="508">508</span>
<span id="509">509</span>
<span id="510">510</span>
<span id="511">511</span>
<span id="512">512</span>
<span id="513">513</span>
<span id="514">514</span>
<span id="515">515</span>
<span id="516">516</span>
<span id="517">517</span>
<span id="518">518</span>
<span id="519">519</span>
<span id="520">520</span>
<span id="521">521</span>
<span id="522">522</span>
<span id="523">523</span>
<span id="524">524</span>
<span id="525">525</span>
<span id="526">526</span>
<span id="527">527</span>
</pre><pre class="rust ">
<span class="doccomment">/*!
Crate `utf8-ranges` converts ranges of Unicode scalar values to equivalent
ranges of UTF-8 bytes. This is useful for constructing byte based automatons
that need to embed UTF-8 decoding.
See the documentation on the `Utf8Sequences` iterator for more details and
an example.
# Wait, what is this?
This is simplest to explain with an example. Let&#39;s say you wanted to test
whether a particular byte sequence was a Cyrillic character. One possible
scalar value range is `[0400-04FF]`. The set of allowed bytes for this
range can be expressed as a sequence of byte ranges:
```ignore
[D0-D3][80-BF]
```
This is simple enough: simply encode the boundaries, `0400` encodes to
`D0 80` and `04FF` encodes to `D3 BF`, and create ranges from each
corresponding pair of bytes: `D0` to `D3` and `80` to `BF`.
However, what if you wanted to add the Cyrillic Supplementary characters to
your range? Your range might then become `[0400-052F]`. The same procedure
as above doesn&#39;t quite work because `052F` encodes to `D4 AF`. The byte ranges
you&#39;d get from the previous transformation would be `[D0-D4][80-AF]`. However,
this isn&#39;t quite correct because this range doesn&#39;t capture many characters,
for example, `04FF` (because its last byte, `BF` isn&#39;t in the range `80-AF`).
Instead, you need multiple sequences of byte ranges:
```ignore
[D0-D3][80-BF] # matches codepoints 0400-04FF
[D4][80-AF] # matches codepoints 0500-052F
```
This gets even more complicated if you want bigger ranges, particularly if
they naively contain surrogate codepoints. For example, the sequence of byte
ranges for the basic multilingual plane (`[0000-FFFF]`) look like this:
```ignore
[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]
```
Note that the byte ranges above will *not* match any erroneous encoding of
UTF-8, including encodings of surrogate codepoints.
And, of course, for all of Unicode (`[000000-10FFFF]`):
```ignore
[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]
[F0][90-BF][80-BF][80-BF]
[F1-F3][80-BF][80-BF][80-BF]
[F4][80-8F][80-BF][80-BF]
```
This crate automates the process of creating these byte ranges from ranges of
Unicode scalar values.
# Why would I ever use this?
You probably won&#39;t ever need this. In 99% of cases, you just decode the byte
sequence into a Unicode scalar value and compare scalar values directly.
However, this explicit decoding step isn&#39;t always possible. For example, the
construction of some finite state machines may benefit from converting ranges
of scalar values into UTF-8 decoder automata (e.g., for character classes in
regular expressions).
# Lineage
I got the idea and general implementation strategy from Russ Cox in his
[article on regexps](https://web.archive.org/web/20160404141123/https://swtch.com/~rsc/regexp/regexp3.html) and RE2.
Russ Cox got it from Ken Thompson&#39;s `grep` (no source, folk lore?).
I also got the idea from
[Lucene](https://github.com/apache/lucene-solr/blob/ae93f4e7ac6a3908046391de35d4f50a0d3c59ca/lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java),
which uses it for executing automata on their term index.
*/</span>
<span class="attribute">#![<span class="ident">deny</span>(<span class="ident">missing_docs</span>)]</span>
<span class="attribute">#[<span class="ident">cfg</span>(<span class="ident">test</span>)]</span> <span class="kw">extern</span> <span class="kw">crate</span> <span class="ident">quickcheck</span>;
<span class="kw">use</span> <span class="ident">std</span>::<span class="ident">char</span>;
<span class="kw">use</span> <span class="ident">std</span>::<span class="ident">fmt</span>;
<span class="kw">use</span> <span class="ident">std</span>::<span class="ident">slice</span>;
<span class="kw">use</span> <span class="ident">char_utf8</span>::<span class="ident">encode_utf8</span>;
<span class="kw">const</span> <span class="ident">MAX_UTF8_BYTES</span>: <span class="ident">usize</span> <span class="op">=</span> <span class="number">4</span>;
<span class="kw">mod</span> <span class="ident">char_utf8</span>;
<span class="doccomment">/// Utf8Sequence represents a sequence of byte ranges.</span>
<span class="doccomment">///</span>
<span class="doccomment">/// To match a Utf8Sequence, a candidate byte sequence must match each</span>
<span class="doccomment">/// successive range.</span>
<span class="doccomment">///</span>
<span class="doccomment">/// For example, if there are two ranges, `[C2-DF][80-BF]`, then the byte</span>
<span class="doccomment">/// sequence `\xDD\x61` would not match because `0x61 &lt; 0x80`.</span>
<span class="attribute">#[<span class="ident">derive</span>(<span class="ident">Copy</span>, <span class="ident">Clone</span>, <span class="ident">Eq</span>, <span class="ident">PartialEq</span>)]</span>
<span class="kw">pub</span> <span class="kw">enum</span> <span class="ident">Utf8Sequence</span> {
<span class="doccomment">/// One byte range.</span>
<span class="ident">One</span>(<span class="ident">Utf8Range</span>),
<span class="doccomment">/// Two successive byte ranges.</span>
<span class="ident">Two</span>([<span class="ident">Utf8Range</span>; <span class="number">2</span>]),
<span class="doccomment">/// Three successive byte ranges.</span>
<span class="ident">Three</span>([<span class="ident">Utf8Range</span>; <span class="number">3</span>]),
<span class="doccomment">/// Four successive byte ranges.</span>
<span class="ident">Four</span>([<span class="ident">Utf8Range</span>; <span class="number">4</span>]),
}
<span class="kw">impl</span> <span class="ident">Utf8Sequence</span> {
<span class="doccomment">/// Creates a new UTF-8 sequence from the encoded bytes of a scalar value</span>
<span class="doccomment">/// range.</span>
<span class="doccomment">///</span>
<span class="doccomment">/// This assumes that `start` and `end` have the same length.</span>
<span class="kw">fn</span> <span class="ident">from_encoded_range</span>(<span class="ident">start</span>: <span class="kw-2">&amp;</span>[<span class="ident">u8</span>], <span class="ident">end</span>: <span class="kw-2">&amp;</span>[<span class="ident">u8</span>]) <span class="op">-&gt;</span> <span class="self">Self</span> {
<span class="macro">assert_eq</span><span class="macro">!</span>(<span class="ident">start</span>.<span class="ident">len</span>(), <span class="ident">end</span>.<span class="ident">len</span>());
<span class="kw">match</span> <span class="ident">start</span>.<span class="ident">len</span>() {
<span class="number">2</span> <span class="op">=&gt;</span> <span class="ident">Utf8Sequence</span>::<span class="ident">Two</span>([
<span class="ident">Utf8Range</span>::<span class="ident">new</span>(<span class="ident">start</span>[<span class="number">0</span>], <span class="ident">end</span>[<span class="number">0</span>]),
<span class="ident">Utf8Range</span>::<span class="ident">new</span>(<span class="ident">start</span>[<span class="number">1</span>], <span class="ident">end</span>[<span class="number">1</span>]),
]),
<span class="number">3</span> <span class="op">=&gt;</span> <span class="ident">Utf8Sequence</span>::<span class="ident">Three</span>([
<span class="ident">Utf8Range</span>::<span class="ident">new</span>(<span class="ident">start</span>[<span class="number">0</span>], <span class="ident">end</span>[<span class="number">0</span>]),
<span class="ident">Utf8Range</span>::<span class="ident">new</span>(<span class="ident">start</span>[<span class="number">1</span>], <span class="ident">end</span>[<span class="number">1</span>]),
<span class="ident">Utf8Range</span>::<span class="ident">new</span>(<span class="ident">start</span>[<span class="number">2</span>], <span class="ident">end</span>[<span class="number">2</span>]),
]),
<span class="number">4</span> <span class="op">=&gt;</span> <span class="ident">Utf8Sequence</span>::<span class="ident">Four</span>([
<span class="ident">Utf8Range</span>::<span class="ident">new</span>(<span class="ident">start</span>[<span class="number">0</span>], <span class="ident">end</span>[<span class="number">0</span>]),
<span class="ident">Utf8Range</span>::<span class="ident">new</span>(<span class="ident">start</span>[<span class="number">1</span>], <span class="ident">end</span>[<span class="number">1</span>]),
<span class="ident">Utf8Range</span>::<span class="ident">new</span>(<span class="ident">start</span>[<span class="number">2</span>], <span class="ident">end</span>[<span class="number">2</span>]),
<span class="ident">Utf8Range</span>::<span class="ident">new</span>(<span class="ident">start</span>[<span class="number">3</span>], <span class="ident">end</span>[<span class="number">3</span>]),
]),
<span class="ident">n</span> <span class="op">=&gt;</span> <span class="macro">unreachable</span><span class="macro">!</span>(<span class="string">&quot;invalid encoded length: {}&quot;</span>, <span class="ident">n</span>),
}
}
<span class="doccomment">/// Returns the underlying sequence of byte ranges as a slice.</span>
<span class="kw">pub</span> <span class="kw">fn</span> <span class="ident">as_slice</span>(<span class="kw-2">&amp;</span><span class="self">self</span>) <span class="op">-&gt;</span> <span class="kw-2">&amp;</span>[<span class="ident">Utf8Range</span>] {
<span class="kw">use</span> <span class="self">self</span>::<span class="ident">Utf8Sequence</span>::<span class="kw-2">*</span>;
<span class="kw">match</span> <span class="kw-2">*</span><span class="self">self</span> {
<span class="ident">One</span>(<span class="kw-2">ref</span> <span class="ident">r</span>) <span class="op">=&gt;</span> <span class="kw">unsafe</span> { <span class="ident">slice</span>::<span class="ident">from_raw_parts</span>(<span class="ident">r</span>, <span class="number">1</span>) },
<span class="ident">Two</span>(<span class="kw-2">ref</span> <span class="ident">r</span>) <span class="op">=&gt;</span> <span class="kw-2">&amp;</span><span class="ident">r</span>[..],
<span class="ident">Three</span>(<span class="kw-2">ref</span> <span class="ident">r</span>) <span class="op">=&gt;</span> <span class="kw-2">&amp;</span><span class="ident">r</span>[..],
<span class="ident">Four</span>(<span class="kw-2">ref</span> <span class="ident">r</span>) <span class="op">=&gt;</span> <span class="kw-2">&amp;</span><span class="ident">r</span>[..],
}
}
<span class="doccomment">/// Returns the number of byte ranges in this sequence.</span>
<span class="doccomment">///</span>
<span class="doccomment">/// The length is guaranteed to be in the closed interval `[1, 4]`.</span>
<span class="kw">pub</span> <span class="kw">fn</span> <span class="ident">len</span>(<span class="kw-2">&amp;</span><span class="self">self</span>) <span class="op">-&gt;</span> <span class="ident">usize</span> {
<span class="self">self</span>.<span class="ident">as_slice</span>().<span class="ident">len</span>()
}
<span class="doccomment">/// Returns true if and only if a prefix of `bytes` matches this sequence</span>
<span class="doccomment">/// of byte ranges.</span>
<span class="kw">pub</span> <span class="kw">fn</span> <span class="ident">matches</span>(<span class="kw-2">&amp;</span><span class="self">self</span>, <span class="ident">bytes</span>: <span class="kw-2">&amp;</span>[<span class="ident">u8</span>]) <span class="op">-&gt;</span> <span class="ident">bool</span> {
<span class="kw">if</span> <span class="ident">bytes</span>.<span class="ident">len</span>() <span class="op">&lt;</span> <span class="self">self</span>.<span class="ident">len</span>() {
<span class="kw">return</span> <span class="bool-val">false</span>;
}
<span class="kw">for</span> (<span class="kw-2">&amp;</span><span class="ident">b</span>, <span class="ident">r</span>) <span class="kw">in</span> <span class="ident">bytes</span>.<span class="ident">iter</span>().<span class="ident">zip</span>(<span class="self">self</span>) {
<span class="kw">if</span> <span class="op">!</span><span class="ident">r</span>.<span class="ident">matches</span>(<span class="ident">b</span>) {
<span class="kw">return</span> <span class="bool-val">false</span>;
}
}
<span class="bool-val">true</span>
}
}
<span class="kw">impl</span><span class="op">&lt;</span><span class="lifetime">&#39;a</span><span class="op">&gt;</span> <span class="ident">IntoIterator</span> <span class="kw">for</span> <span class="kw-2">&amp;</span><span class="lifetime">&#39;a</span> <span class="ident">Utf8Sequence</span> {
<span class="kw">type</span> <span class="ident">IntoIter</span> <span class="op">=</span> <span class="ident">slice</span>::<span class="ident">Iter</span><span class="op">&lt;</span><span class="lifetime">&#39;a</span>, <span class="ident">Utf8Range</span><span class="op">&gt;</span>;
<span class="kw">type</span> <span class="ident">Item</span> <span class="op">=</span> <span class="kw-2">&amp;</span><span class="lifetime">&#39;a</span> <span class="ident">Utf8Range</span>;
<span class="kw">fn</span> <span class="ident">into_iter</span>(<span class="self">self</span>) <span class="op">-&gt;</span> <span class="self">Self</span>::<span class="ident">IntoIter</span> {
<span class="self">self</span>.<span class="ident">as_slice</span>().<span class="ident">into_iter</span>()
}
}
<span class="kw">impl</span> <span class="ident">fmt</span>::<span class="ident">Debug</span> <span class="kw">for</span> <span class="ident">Utf8Sequence</span> {
<span class="kw">fn</span> <span class="ident">fmt</span>(<span class="kw-2">&amp;</span><span class="self">self</span>, <span class="ident">f</span>: <span class="kw-2">&amp;</span><span class="kw-2">mut</span> <span class="ident">fmt</span>::<span class="ident">Formatter</span>) <span class="op">-&gt;</span> <span class="ident">fmt</span>::<span class="prelude-ty">Result</span> {
<span class="kw">use</span> <span class="self">self</span>::<span class="ident">Utf8Sequence</span>::<span class="kw-2">*</span>;
<span class="kw">match</span> <span class="kw-2">*</span><span class="self">self</span> {
<span class="ident">One</span>(<span class="kw-2">ref</span> <span class="ident">r</span>) <span class="op">=&gt;</span> <span class="macro">write</span><span class="macro">!</span>(<span class="ident">f</span>, <span class="string">&quot;{:?}&quot;</span>, <span class="ident">r</span>),
<span class="ident">Two</span>(<span class="kw-2">ref</span> <span class="ident">r</span>) <span class="op">=&gt;</span> <span class="macro">write</span><span class="macro">!</span>(<span class="ident">f</span>, <span class="string">&quot;{:?}{:?}&quot;</span>, <span class="ident">r</span>[<span class="number">0</span>], <span class="ident">r</span>[<span class="number">1</span>]),
<span class="ident">Three</span>(<span class="kw-2">ref</span> <span class="ident">r</span>) <span class="op">=&gt;</span> <span class="macro">write</span><span class="macro">!</span>(<span class="ident">f</span>, <span class="string">&quot;{:?}{:?}{:?}&quot;</span>, <span class="ident">r</span>[<span class="number">0</span>], <span class="ident">r</span>[<span class="number">1</span>], <span class="ident">r</span>[<span class="number">2</span>]),
<span class="ident">Four</span>(<span class="kw-2">ref</span> <span class="ident">r</span>) <span class="op">=&gt;</span> <span class="macro">write</span><span class="macro">!</span>(<span class="ident">f</span>, <span class="string">&quot;{:?}{:?}{:?}{:?}&quot;</span>,
<span class="ident">r</span>[<span class="number">0</span>], <span class="ident">r</span>[<span class="number">1</span>], <span class="ident">r</span>[<span class="number">2</span>], <span class="ident">r</span>[<span class="number">3</span>]),
}
}
}
<span class="doccomment">/// A single inclusive range of UTF-8 bytes.</span>
<span class="attribute">#[<span class="ident">derive</span>(<span class="ident">Clone</span>, <span class="ident">Copy</span>, <span class="ident">PartialEq</span>, <span class="ident">Eq</span>)]</span>
<span class="kw">pub</span> <span class="kw">struct</span> <span class="ident">Utf8Range</span> {
<span class="doccomment">/// Start of byte range (inclusive).</span>
<span class="kw">pub</span> <span class="ident">start</span>: <span class="ident">u8</span>,
<span class="doccomment">/// End of byte range (inclusive).</span>
<span class="kw">pub</span> <span class="ident">end</span>: <span class="ident">u8</span>,
}
<span class="kw">impl</span> <span class="ident">Utf8Range</span> {
<span class="kw">fn</span> <span class="ident">new</span>(<span class="ident">start</span>: <span class="ident">u8</span>, <span class="ident">end</span>: <span class="ident">u8</span>) <span class="op">-&gt;</span> <span class="self">Self</span> {
<span class="ident">Utf8Range</span> { <span class="ident">start</span>: <span class="ident">start</span>, <span class="ident">end</span>: <span class="ident">end</span> }
}
<span class="doccomment">/// Returns true if and only if the given byte is in this range.</span>
<span class="kw">pub</span> <span class="kw">fn</span> <span class="ident">matches</span>(<span class="kw-2">&amp;</span><span class="self">self</span>, <span class="ident">b</span>: <span class="ident">u8</span>) <span class="op">-&gt;</span> <span class="ident">bool</span> {
<span class="self">self</span>.<span class="ident">start</span> <span class="op">&lt;=</span> <span class="ident">b</span> <span class="op">&amp;&amp;</span> <span class="ident">b</span> <span class="op">&lt;=</span> <span class="self">self</span>.<span class="ident">end</span>
}
}
<span class="kw">impl</span> <span class="ident">fmt</span>::<span class="ident">Debug</span> <span class="kw">for</span> <span class="ident">Utf8Range</span> {
<span class="kw">fn</span> <span class="ident">fmt</span>(<span class="kw-2">&amp;</span><span class="self">self</span>, <span class="ident">f</span>: <span class="kw-2">&amp;</span><span class="kw-2">mut</span> <span class="ident">fmt</span>::<span class="ident">Formatter</span>) <span class="op">-&gt;</span> <span class="ident">fmt</span>::<span class="prelude-ty">Result</span> {
<span class="kw">if</span> <span class="self">self</span>.<span class="ident">start</span> <span class="op">==</span> <span class="self">self</span>.<span class="ident">end</span> {
<span class="macro">write</span><span class="macro">!</span>(<span class="ident">f</span>, <span class="string">&quot;[{:X}]&quot;</span>, <span class="self">self</span>.<span class="ident">start</span>)
} <span class="kw">else</span> {
<span class="macro">write</span><span class="macro">!</span>(<span class="ident">f</span>, <span class="string">&quot;[{:X}-{:X}]&quot;</span>, <span class="self">self</span>.<span class="ident">start</span>, <span class="self">self</span>.<span class="ident">end</span>)
}
}
}
<span class="doccomment">/// An iterator over ranges of matching UTF-8 byte sequences.</span>
<span class="doccomment">///</span>
<span class="doccomment">/// The iteration represents an alternation of comprehensive byte sequences</span>
<span class="doccomment">/// that match precisely the set of UTF-8 encoded scalar values.</span>
<span class="doccomment">///</span>
<span class="doccomment">/// A byte sequence corresponds to one of the scalar values in the range given</span>
<span class="doccomment">/// if and only if it completely matches exactly one of the sequences of byte</span>
<span class="doccomment">/// ranges produced by this iterator.</span>
<span class="doccomment">///</span>
<span class="doccomment">/// Each sequence of byte ranges matches a unique set of bytes. That is, no two</span>
<span class="doccomment">/// sequences will match the same bytes.</span>
<span class="doccomment">///</span>
<span class="doccomment">/// # Example</span>
<span class="doccomment">///</span>
<span class="doccomment">/// This shows how to match an arbitrary byte sequence against a range of</span>
<span class="doccomment">/// scalar values.</span>
<span class="doccomment">///</span>
<span class="doccomment">/// ```rust</span>
<span class="doccomment">/// use utf8_ranges::{Utf8Sequences, Utf8Sequence};</span>
<span class="doccomment">///</span>
<span class="doccomment">/// fn matches(seqs: &amp;[Utf8Sequence], bytes: &amp;[u8]) -&gt; bool {</span>
<span class="doccomment">/// for range in seqs {</span>
<span class="doccomment">/// if range.matches(bytes) {</span>
<span class="doccomment">/// return true;</span>
<span class="doccomment">/// }</span>
<span class="doccomment">/// }</span>
<span class="doccomment">/// false</span>
<span class="doccomment">/// }</span>
<span class="doccomment">///</span>
<span class="doccomment">/// // Test the basic multilingual plane.</span>
<span class="doccomment">/// let seqs: Vec&lt;_&gt; = Utf8Sequences::new(&#39;\u{0}&#39;, &#39;\u{FFFF}&#39;).collect();</span>
<span class="doccomment">///</span>
<span class="doccomment">/// // UTF-8 encoding of &#39;a&#39;.</span>
<span class="doccomment">/// assert!(matches(&amp;seqs, &amp;[0x61]));</span>
<span class="doccomment">/// // UTF-8 encoding of &#39;&#39; (`\u{2603}`).</span>
<span class="doccomment">/// assert!(matches(&amp;seqs, &amp;[0xE2, 0x98, 0x83]));</span>
<span class="doccomment">/// // UTF-8 encoding of `\u{10348}` (outside the BMP).</span>
<span class="doccomment">/// assert!(!matches(&amp;seqs, &amp;[0xF0, 0x90, 0x8D, 0x88]));</span>
<span class="doccomment">/// // Tries to match against a UTF-8 encoding of a surrogate codepoint,</span>
<span class="doccomment">/// // which is invalid UTF-8, and therefore fails, despite the fact that</span>
<span class="doccomment">/// // the corresponding codepoint (0xD800) falls in the range given.</span>
<span class="doccomment">/// assert!(!matches(&amp;seqs, &amp;[0xED, 0xA0, 0x80]));</span>
<span class="doccomment">/// // And fails against plain old invalid UTF-8.</span>
<span class="doccomment">/// assert!(!matches(&amp;seqs, &amp;[0xFF, 0xFF]));</span>
<span class="doccomment">/// ```</span>
<span class="doccomment">///</span>
<span class="doccomment">/// If this example seems circuitous, that&#39;s because it is! It&#39;s meant to be</span>
<span class="doccomment">/// illustrative. In practice, you could just try to decode your byte sequence</span>
<span class="doccomment">/// and compare it with the scalar value range directly. However, this is not</span>
<span class="doccomment">/// always possible (for example, in a byte based automaton).</span>
<span class="kw">pub</span> <span class="kw">struct</span> <span class="ident">Utf8Sequences</span> {
<span class="ident">range_stack</span>: <span class="ident">Vec</span><span class="op">&lt;</span><span class="ident">ScalarRange</span><span class="op">&gt;</span>,
}
<span class="kw">impl</span> <span class="ident">Utf8Sequences</span> {
<span class="doccomment">/// Create a new iterator over UTF-8 byte ranges for the scalar value range</span>
<span class="doccomment">/// given.</span>
<span class="kw">pub</span> <span class="kw">fn</span> <span class="ident">new</span>(<span class="ident">start</span>: <span class="ident">char</span>, <span class="ident">end</span>: <span class="ident">char</span>) <span class="op">-&gt;</span> <span class="self">Self</span> {
<span class="kw">let</span> <span class="kw-2">mut</span> <span class="ident">it</span> <span class="op">=</span> <span class="ident">Utf8Sequences</span> { <span class="ident">range_stack</span>: <span class="macro">vec</span><span class="macro">!</span>[] };
<span class="ident">it</span>.<span class="ident">push</span>(<span class="ident">start</span> <span class="kw">as</span> <span class="ident">u32</span>, <span class="ident">end</span> <span class="kw">as</span> <span class="ident">u32</span>);
<span class="ident">it</span>
}
<span class="doccomment">/// reset resets the scalar value range.</span>
<span class="doccomment">/// Any existing state is cleared, but resources may be reused.</span>
<span class="doccomment">///</span>
<span class="doccomment">/// N.B. Benchmarks say that this method is dubious.</span>
<span class="attribute">#[<span class="ident">doc</span>(<span class="ident">hidden</span>)]</span>
<span class="kw">pub</span> <span class="kw">fn</span> <span class="ident">reset</span>(<span class="kw-2">&amp;</span><span class="kw-2">mut</span> <span class="self">self</span>, <span class="ident">start</span>: <span class="ident">char</span>, <span class="ident">end</span>: <span class="ident">char</span>) {
<span class="self">self</span>.<span class="ident">range_stack</span>.<span class="ident">clear</span>();
<span class="self">self</span>.<span class="ident">push</span>(<span class="ident">start</span> <span class="kw">as</span> <span class="ident">u32</span>, <span class="ident">end</span> <span class="kw">as</span> <span class="ident">u32</span>);
}
<span class="kw">fn</span> <span class="ident">push</span>(<span class="kw-2">&amp;</span><span class="kw-2">mut</span> <span class="self">self</span>, <span class="ident">start</span>: <span class="ident">u32</span>, <span class="ident">end</span>: <span class="ident">u32</span>) {
<span class="self">self</span>.<span class="ident">range_stack</span>.<span class="ident">push</span>(<span class="ident">ScalarRange</span> { <span class="ident">start</span>: <span class="ident">start</span>, <span class="ident">end</span>: <span class="ident">end</span> });
}
}
<span class="kw">struct</span> <span class="ident">ScalarRange</span> {
<span class="ident">start</span>: <span class="ident">u32</span>,
<span class="ident">end</span>: <span class="ident">u32</span>,
}
<span class="kw">impl</span> <span class="ident">fmt</span>::<span class="ident">Debug</span> <span class="kw">for</span> <span class="ident">ScalarRange</span> {
<span class="kw">fn</span> <span class="ident">fmt</span>(<span class="kw-2">&amp;</span><span class="self">self</span>, <span class="ident">f</span>: <span class="kw-2">&amp;</span><span class="kw-2">mut</span> <span class="ident">fmt</span>::<span class="ident">Formatter</span>) <span class="op">-&gt;</span> <span class="ident">fmt</span>::<span class="prelude-ty">Result</span> {
<span class="macro">write</span><span class="macro">!</span>(<span class="ident">f</span>, <span class="string">&quot;ScalarRange({:X}, {:X})&quot;</span>, <span class="self">self</span>.<span class="ident">start</span>, <span class="self">self</span>.<span class="ident">end</span>)
}
}
<span class="kw">impl</span> <span class="ident">Iterator</span> <span class="kw">for</span> <span class="ident">Utf8Sequences</span> {
<span class="kw">type</span> <span class="ident">Item</span> <span class="op">=</span> <span class="ident">Utf8Sequence</span>;
<span class="kw">fn</span> <span class="ident">next</span>(<span class="kw-2">&amp;</span><span class="kw-2">mut</span> <span class="self">self</span>) <span class="op">-&gt;</span> <span class="prelude-ty">Option</span><span class="op">&lt;</span><span class="self">Self</span>::<span class="ident">Item</span><span class="op">&gt;</span> {
<span class="lifetime">&#39;TOP</span>:
<span class="kw">while</span> <span class="kw">let</span> <span class="prelude-val">Some</span>(<span class="kw-2">mut</span> <span class="ident">r</span>) <span class="op">=</span> <span class="self">self</span>.<span class="ident">range_stack</span>.<span class="ident">pop</span>() {
<span class="lifetime">&#39;INNER</span>:
<span class="kw">loop</span> {
<span class="kw">if</span> <span class="kw">let</span> <span class="prelude-val">Some</span>((<span class="ident">r1</span>, <span class="ident">r2</span>)) <span class="op">=</span> <span class="ident">r</span>.<span class="ident">split</span>() {
<span class="self">self</span>.<span class="ident">push</span>(<span class="ident">r2</span>.<span class="ident">start</span>, <span class="ident">r2</span>.<span class="ident">end</span>);
<span class="ident">r</span>.<span class="ident">start</span> <span class="op">=</span> <span class="ident">r1</span>.<span class="ident">start</span>;
<span class="ident">r</span>.<span class="ident">end</span> <span class="op">=</span> <span class="ident">r1</span>.<span class="ident">end</span>;
<span class="kw">continue</span> <span class="lifetime">&#39;INNER</span>;
}
<span class="kw">if</span> <span class="op">!</span><span class="ident">r</span>.<span class="ident">is_valid</span>() {
<span class="kw">continue</span> <span class="lifetime">&#39;TOP</span>;
}
<span class="kw">for</span> <span class="ident">i</span> <span class="kw">in</span> <span class="number">1</span>..<span class="ident">MAX_UTF8_BYTES</span> {
<span class="kw">let</span> <span class="ident">max</span> <span class="op">=</span> <span class="ident">max_scalar_value</span>(<span class="ident">i</span>);
<span class="kw">if</span> <span class="ident">r</span>.<span class="ident">start</span> <span class="op">&lt;=</span> <span class="ident">max</span> <span class="op">&amp;&amp;</span> <span class="ident">max</span> <span class="op">&lt;</span> <span class="ident">r</span>.<span class="ident">end</span> {
<span class="self">self</span>.<span class="ident">push</span>(<span class="ident">max</span> <span class="op">+</span> <span class="number">1</span>, <span class="ident">r</span>.<span class="ident">end</span>);
<span class="ident">r</span>.<span class="ident">end</span> <span class="op">=</span> <span class="ident">max</span>;
<span class="kw">continue</span> <span class="lifetime">&#39;INNER</span>;
}
}
<span class="kw">if</span> <span class="kw">let</span> <span class="prelude-val">Some</span>(<span class="ident">ascii_range</span>) <span class="op">=</span> <span class="ident">r</span>.<span class="ident">as_ascii</span>() {
<span class="kw">return</span> <span class="prelude-val">Some</span>(<span class="ident">Utf8Sequence</span>::<span class="ident">One</span>(<span class="ident">ascii_range</span>));
}
<span class="kw">for</span> <span class="ident">i</span> <span class="kw">in</span> <span class="number">1</span>..<span class="ident">MAX_UTF8_BYTES</span> {
<span class="kw">let</span> <span class="ident">m</span> <span class="op">=</span> (<span class="number">1</span> <span class="op">&lt;&lt;</span> (<span class="number">6</span> <span class="op">*</span> <span class="ident">i</span>)) <span class="op">-</span> <span class="number">1</span>;
<span class="kw">if</span> (<span class="ident">r</span>.<span class="ident">start</span> <span class="op">&amp;</span> <span class="op">!</span><span class="ident">m</span>) <span class="op">!=</span> (<span class="ident">r</span>.<span class="ident">end</span> <span class="op">&amp;</span> <span class="op">!</span><span class="ident">m</span>) {
<span class="kw">if</span> (<span class="ident">r</span>.<span class="ident">start</span> <span class="op">&amp;</span> <span class="ident">m</span>) <span class="op">!=</span> <span class="number">0</span> {
<span class="self">self</span>.<span class="ident">push</span>((<span class="ident">r</span>.<span class="ident">start</span> <span class="op">|</span> <span class="ident">m</span>) <span class="op">+</span> <span class="number">1</span>, <span class="ident">r</span>.<span class="ident">end</span>);
<span class="ident">r</span>.<span class="ident">end</span> <span class="op">=</span> <span class="ident">r</span>.<span class="ident">start</span> <span class="op">|</span> <span class="ident">m</span>;
<span class="kw">continue</span> <span class="lifetime">&#39;INNER</span>;
}
<span class="kw">if</span> (<span class="ident">r</span>.<span class="ident">end</span> <span class="op">&amp;</span> <span class="ident">m</span>) <span class="op">!=</span> <span class="ident">m</span> {
<span class="self">self</span>.<span class="ident">push</span>(<span class="ident">r</span>.<span class="ident">end</span> <span class="op">&amp;</span> <span class="op">!</span><span class="ident">m</span>, <span class="ident">r</span>.<span class="ident">end</span>);
<span class="ident">r</span>.<span class="ident">end</span> <span class="op">=</span> (<span class="ident">r</span>.<span class="ident">end</span> <span class="op">&amp;</span> <span class="op">!</span><span class="ident">m</span>) <span class="op">-</span> <span class="number">1</span>;
<span class="kw">continue</span> <span class="lifetime">&#39;INNER</span>;
}
}
}
<span class="kw">let</span> <span class="kw-2">mut</span> <span class="ident">start</span> <span class="op">=</span> [<span class="number">0</span>; <span class="ident">MAX_UTF8_BYTES</span>];
<span class="kw">let</span> <span class="kw-2">mut</span> <span class="ident">end</span> <span class="op">=</span> [<span class="number">0</span>; <span class="ident">MAX_UTF8_BYTES</span>];
<span class="kw">let</span> <span class="ident">n</span> <span class="op">=</span> <span class="ident">r</span>.<span class="ident">encode</span>(<span class="kw-2">&amp;</span><span class="kw-2">mut</span> <span class="ident">start</span>, <span class="kw-2">&amp;</span><span class="kw-2">mut</span> <span class="ident">end</span>);
<span class="kw">return</span> <span class="prelude-val">Some</span>(<span class="ident">Utf8Sequence</span>::<span class="ident">from_encoded_range</span>(
<span class="kw-2">&amp;</span><span class="ident">start</span>[<span class="number">0</span>..<span class="ident">n</span>], <span class="kw-2">&amp;</span><span class="ident">end</span>[<span class="number">0</span>..<span class="ident">n</span>]));
}
}
<span class="prelude-val">None</span>
}
}
<span class="kw">impl</span> <span class="ident">ScalarRange</span> {
<span class="doccomment">/// split splits this range if it overlaps with a surrogate codepoint.</span>
<span class="doccomment">///</span>
<span class="doccomment">/// Either or both ranges may be invalid.</span>
<span class="kw">fn</span> <span class="ident">split</span>(<span class="kw-2">&amp;</span><span class="self">self</span>) <span class="op">-&gt;</span> <span class="prelude-ty">Option</span><span class="op">&lt;</span>(<span class="ident">ScalarRange</span>, <span class="ident">ScalarRange</span>)<span class="op">&gt;</span> {
<span class="kw">if</span> <span class="self">self</span>.<span class="ident">start</span> <span class="op">&lt;</span> <span class="number">0xE000</span> <span class="op">&amp;&amp;</span> <span class="self">self</span>.<span class="ident">end</span> <span class="op">&gt;</span> <span class="number">0xD7FF</span> {
<span class="prelude-val">Some</span>((<span class="ident">ScalarRange</span> {
<span class="ident">start</span>: <span class="self">self</span>.<span class="ident">start</span>,
<span class="ident">end</span>: <span class="number">0xD7FF</span>,
}, <span class="ident">ScalarRange</span> {
<span class="ident">start</span>: <span class="number">0xE000</span>,
<span class="ident">end</span>: <span class="self">self</span>.<span class="ident">end</span>,
}))
} <span class="kw">else</span> {
<span class="prelude-val">None</span>
}
}
<span class="doccomment">/// is_valid returns true if and only if start &lt;= end.</span>
<span class="kw">fn</span> <span class="ident">is_valid</span>(<span class="kw-2">&amp;</span><span class="self">self</span>) <span class="op">-&gt;</span> <span class="ident">bool</span> {
<span class="self">self</span>.<span class="ident">start</span> <span class="op">&lt;=</span> <span class="self">self</span>.<span class="ident">end</span>
}
<span class="doccomment">/// as_ascii returns this range as a Utf8Range if and only if all scalar</span>
<span class="doccomment">/// values in this range can be encoded as a single byte.</span>
<span class="kw">fn</span> <span class="ident">as_ascii</span>(<span class="kw-2">&amp;</span><span class="self">self</span>) <span class="op">-&gt;</span> <span class="prelude-ty">Option</span><span class="op">&lt;</span><span class="ident">Utf8Range</span><span class="op">&gt;</span> {
<span class="kw">if</span> <span class="self">self</span>.<span class="ident">is_ascii</span>() {
<span class="prelude-val">Some</span>(<span class="ident">Utf8Range</span>::<span class="ident">new</span>(<span class="self">self</span>.<span class="ident">start</span> <span class="kw">as</span> <span class="ident">u8</span>, <span class="self">self</span>.<span class="ident">end</span> <span class="kw">as</span> <span class="ident">u8</span>))
} <span class="kw">else</span> {
<span class="prelude-val">None</span>
}
}
<span class="doccomment">/// is_ascii returns true if the range is ASCII only (i.e., takes a single</span>
<span class="doccomment">/// byte to encode any scalar value).</span>
<span class="kw">fn</span> <span class="ident">is_ascii</span>(<span class="kw-2">&amp;</span><span class="self">self</span>) <span class="op">-&gt;</span> <span class="ident">bool</span> {
<span class="self">self</span>.<span class="ident">is_valid</span>() <span class="op">&amp;&amp;</span> <span class="self">self</span>.<span class="ident">end</span> <span class="op">&lt;=</span> <span class="number">0x7f</span>
}
<span class="doccomment">/// encode writes the UTF-8 encoding of the start and end of this range</span>
<span class="doccomment">/// to the corresponding destination slices.</span>
<span class="doccomment">///</span>
<span class="doccomment">/// The slices should have room for at least `MAX_UTF8_BYTES`.</span>
<span class="kw">fn</span> <span class="ident">encode</span>(<span class="kw-2">&amp;</span><span class="self">self</span>, <span class="ident">start</span>: <span class="kw-2">&amp;</span><span class="kw-2">mut</span> [<span class="ident">u8</span>], <span class="ident">end</span>: <span class="kw-2">&amp;</span><span class="kw-2">mut</span> [<span class="ident">u8</span>]) <span class="op">-&gt;</span> <span class="ident">usize</span> {
<span class="kw">let</span> <span class="ident">cs</span> <span class="op">=</span> <span class="ident">char</span>::<span class="ident">from_u32</span>(<span class="self">self</span>.<span class="ident">start</span>).<span class="ident">unwrap</span>();
<span class="kw">let</span> <span class="ident">ce</span> <span class="op">=</span> <span class="ident">char</span>::<span class="ident">from_u32</span>(<span class="self">self</span>.<span class="ident">end</span>).<span class="ident">unwrap</span>();
<span class="kw">let</span> <span class="ident">n</span> <span class="op">=</span> <span class="ident">encode_utf8</span>(<span class="ident">cs</span>, <span class="ident">start</span>).<span class="ident">unwrap</span>();
<span class="kw">let</span> <span class="ident">m</span> <span class="op">=</span> <span class="ident">encode_utf8</span>(<span class="ident">ce</span>, <span class="ident">end</span>).<span class="ident">unwrap</span>();
<span class="macro">assert_eq</span><span class="macro">!</span>(<span class="ident">n</span>, <span class="ident">m</span>);
<span class="ident">n</span>
}
}
<span class="kw">fn</span> <span class="ident">max_scalar_value</span>(<span class="ident">nbytes</span>: <span class="ident">usize</span>) <span class="op">-&gt;</span> <span class="ident">u32</span> {
<span class="kw">match</span> <span class="ident">nbytes</span> {
<span class="number">1</span> <span class="op">=&gt;</span> <span class="number">0x007F</span>,
<span class="number">2</span> <span class="op">=&gt;</span> <span class="number">0x07FF</span>,
<span class="number">3</span> <span class="op">=&gt;</span> <span class="number">0xFFFF</span>,
<span class="number">4</span> <span class="op">=&gt;</span> <span class="number">0x10FFFF</span>,
<span class="kw">_</span> <span class="op">=&gt;</span> <span class="macro">unreachable</span><span class="macro">!</span>(<span class="string">&quot;invalid UTF-8 byte sequence size&quot;</span>),
}
}
<span class="attribute">#[<span class="ident">cfg</span>(<span class="ident">test</span>)]</span>
<span class="kw">mod</span> <span class="ident">tests</span> {
<span class="kw">use</span> <span class="ident">std</span>::<span class="ident">char</span>;
<span class="kw">use</span> <span class="ident">quickcheck</span>::{<span class="ident">TestResult</span>, <span class="ident">quickcheck</span>};
<span class="kw">use</span> <span class="ident">char_utf8</span>::<span class="ident">encode_utf8</span>;
<span class="kw">use</span> {<span class="ident">MAX_UTF8_BYTES</span>, <span class="ident">Utf8Range</span>, <span class="ident">Utf8Sequences</span>};
<span class="kw">fn</span> <span class="ident">rutf8</span>(<span class="ident">s</span>: <span class="ident">u8</span>, <span class="ident">e</span>: <span class="ident">u8</span>) <span class="op">-&gt;</span> <span class="ident">Utf8Range</span> {
<span class="ident">Utf8Range</span>::<span class="ident">new</span>(<span class="ident">s</span>, <span class="ident">e</span>)
}
<span class="kw">fn</span> <span class="ident">never_accepts_surrogate_codepoints</span>(<span class="ident">start</span>: <span class="ident">char</span>, <span class="ident">end</span>: <span class="ident">char</span>) {
<span class="kw">let</span> <span class="kw-2">mut</span> <span class="ident">buf</span> <span class="op">=</span> [<span class="number">0</span>; <span class="ident">MAX_UTF8_BYTES</span>];
<span class="kw">for</span> <span class="ident">cp</span> <span class="kw">in</span> <span class="number">0xD800</span>..<span class="number">0xE000</span> {
<span class="kw">let</span> <span class="ident">c</span> <span class="op">=</span> <span class="kw">unsafe</span> { ::<span class="ident">std</span>::<span class="ident">mem</span>::<span class="ident">transmute</span>(<span class="ident">cp</span>) };
<span class="kw">let</span> <span class="ident">n</span> <span class="op">=</span> <span class="ident">encode_utf8</span>(<span class="ident">c</span>, <span class="kw-2">&amp;</span><span class="kw-2">mut</span> <span class="ident">buf</span>).<span class="ident">unwrap</span>();
<span class="kw">for</span> <span class="ident">r</span> <span class="kw">in</span> <span class="ident">Utf8Sequences</span>::<span class="ident">new</span>(<span class="ident">start</span>, <span class="ident">end</span>) {
<span class="kw">if</span> <span class="ident">r</span>.<span class="ident">matches</span>(<span class="kw-2">&amp;</span><span class="ident">buf</span>[<span class="number">0</span>..<span class="ident">n</span>]) {
<span class="macro">panic</span><span class="macro">!</span>(<span class="string">&quot;Sequence ({:X}, {:X}) contains range {:?}, \
which matches surrogate code point {:X} \
with encoded bytes {:?}&quot;</span>,
<span class="ident">start</span> <span class="kw">as</span> <span class="ident">u32</span>, <span class="ident">end</span> <span class="kw">as</span> <span class="ident">u32</span>, <span class="ident">r</span>, <span class="ident">cp</span>, <span class="kw-2">&amp;</span><span class="ident">buf</span>[<span class="number">0</span>..<span class="ident">n</span>]);
}
}
}
}
<span class="attribute">#[<span class="ident">test</span>]</span>
<span class="kw">fn</span> <span class="ident">codepoints_no_surrogates</span>() {
<span class="ident">never_accepts_surrogate_codepoints</span>(<span class="string">&#39;\u{0}&#39;</span>, <span class="string">&#39;\u{FFFF}&#39;</span>);
<span class="ident">never_accepts_surrogate_codepoints</span>(<span class="string">&#39;\u{0}&#39;</span>, <span class="string">&#39;\u{10FFFF}&#39;</span>);
<span class="ident">never_accepts_surrogate_codepoints</span>(<span class="string">&#39;\u{0}&#39;</span>, <span class="string">&#39;\u{10FFFE}&#39;</span>);
<span class="ident">never_accepts_surrogate_codepoints</span>(<span class="string">&#39;\u{80}&#39;</span>, <span class="string">&#39;\u{10FFFF}&#39;</span>);
<span class="ident">never_accepts_surrogate_codepoints</span>(<span class="string">&#39;\u{D7FF}&#39;</span>, <span class="string">&#39;\u{E000}&#39;</span>);
}
<span class="attribute">#[<span class="ident">test</span>]</span>
<span class="kw">fn</span> <span class="ident">single_codepoint_one_sequence</span>() {
<span class="comment">// Tests that every range of scalar values that contains a single</span>
<span class="comment">// scalar value is recognized by one sequence of byte ranges.</span>
<span class="kw">for</span> <span class="ident">i</span> <span class="kw">in</span> <span class="number">0x0</span>..(<span class="number">0x10FFFF</span> <span class="op">+</span> <span class="number">1</span>) {
<span class="kw">let</span> <span class="ident">c</span> <span class="op">=</span> <span class="kw">match</span> <span class="ident">char</span>::<span class="ident">from_u32</span>(<span class="ident">i</span>) {
<span class="prelude-val">None</span> <span class="op">=&gt;</span> <span class="kw">continue</span>,
<span class="prelude-val">Some</span>(<span class="ident">c</span>) <span class="op">=&gt;</span> <span class="ident">c</span>,
};
<span class="kw">let</span> <span class="ident">seqs</span>: <span class="ident">Vec</span><span class="op">&lt;</span><span class="kw">_</span><span class="op">&gt;</span> <span class="op">=</span> <span class="ident">Utf8Sequences</span>::<span class="ident">new</span>(<span class="ident">c</span>, <span class="ident">c</span>).<span class="ident">collect</span>();
<span class="macro">assert_eq</span><span class="macro">!</span>(<span class="ident">seqs</span>.<span class="ident">len</span>(), <span class="number">1</span>);
}
}
<span class="attribute">#[<span class="ident">test</span>]</span>
<span class="kw">fn</span> <span class="ident">qc_codepoints_no_surrogate</span>() {
<span class="kw">fn</span> <span class="ident">p</span>(<span class="ident">s</span>: <span class="ident">char</span>, <span class="ident">e</span>: <span class="ident">char</span>) <span class="op">-&gt;</span> <span class="ident">TestResult</span> {
<span class="kw">if</span> <span class="ident">s</span> <span class="op">&gt;</span> <span class="ident">e</span> {
<span class="kw">return</span> <span class="ident">TestResult</span>::<span class="ident">discard</span>();
}
<span class="ident">never_accepts_surrogate_codepoints</span>(<span class="ident">s</span>, <span class="ident">e</span>);
<span class="ident">TestResult</span>::<span class="ident">passed</span>()
}
<span class="ident">quickcheck</span>(<span class="ident">p</span> <span class="kw">as</span> <span class="kw">fn</span>(<span class="ident">char</span>, <span class="ident">char</span>) <span class="op">-&gt;</span> <span class="ident">TestResult</span>);
}
<span class="attribute">#[<span class="ident">test</span>]</span>
<span class="kw">fn</span> <span class="ident">bmp</span>() {
<span class="kw">use</span> <span class="ident">Utf8Sequence</span>::<span class="kw-2">*</span>;
<span class="kw">let</span> <span class="ident">seqs</span> <span class="op">=</span> <span class="ident">Utf8Sequences</span>::<span class="ident">new</span>(<span class="string">&#39;\u{0}&#39;</span>, <span class="string">&#39;\u{FFFF}&#39;</span>)
.<span class="ident">collect</span>::<span class="op">&lt;</span><span class="ident">Vec</span><span class="op">&lt;</span><span class="kw">_</span><span class="op">&gt;&gt;</span>();
<span class="macro">assert_eq</span><span class="macro">!</span>(<span class="ident">seqs</span>, <span class="macro">vec</span><span class="macro">!</span>[
<span class="ident">One</span>(<span class="ident">rutf8</span>(<span class="number">0x0</span>, <span class="number">0x7F</span>)),
<span class="ident">Two</span>([<span class="ident">rutf8</span>(<span class="number">0xC2</span>, <span class="number">0xDF</span>), <span class="ident">rutf8</span>(<span class="number">0x80</span>, <span class="number">0xBF</span>)]),
<span class="ident">Three</span>([<span class="ident">rutf8</span>(<span class="number">0xE0</span>, <span class="number">0xE0</span>), <span class="ident">rutf8</span>(<span class="number">0xA0</span>, <span class="number">0xBF</span>), <span class="ident">rutf8</span>(<span class="number">0x80</span>, <span class="number">0xBF</span>)]),
<span class="ident">Three</span>([<span class="ident">rutf8</span>(<span class="number">0xE1</span>, <span class="number">0xEC</span>), <span class="ident">rutf8</span>(<span class="number">0x80</span>, <span class="number">0xBF</span>), <span class="ident">rutf8</span>(<span class="number">0x80</span>, <span class="number">0xBF</span>)]),
<span class="ident">Three</span>([<span class="ident">rutf8</span>(<span class="number">0xED</span>, <span class="number">0xED</span>), <span class="ident">rutf8</span>(<span class="number">0x80</span>, <span class="number">0x9F</span>), <span class="ident">rutf8</span>(<span class="number">0x80</span>, <span class="number">0xBF</span>)]),
<span class="ident">Three</span>([<span class="ident">rutf8</span>(<span class="number">0xEE</span>, <span class="number">0xEF</span>), <span class="ident">rutf8</span>(<span class="number">0x80</span>, <span class="number">0xBF</span>), <span class="ident">rutf8</span>(<span class="number">0x80</span>, <span class="number">0xBF</span>)]),
]);
}
<span class="attribute">#[<span class="ident">test</span>]</span>
<span class="kw">fn</span> <span class="ident">scratch</span>() {
<span class="kw">for</span> <span class="ident">range</span> <span class="kw">in</span> <span class="ident">Utf8Sequences</span>::<span class="ident">new</span>(<span class="string">&#39;\u{0}&#39;</span>, <span class="string">&#39;\u{FFFF}&#39;</span>) {
<span class="macro">println</span><span class="macro">!</span>(<span class="string">&quot;{:?}&quot;</span>, <span class="ident">range</span>);
}
}
}
</pre>
</section><section id="search" class="content hidden"></section><section class="footer"></section><aside id="help" class="hidden"><div><h1 class="hidden">Help</h1><div class="shortcuts"><h2>Keyboard Shortcuts</h2><dl><dt><kbd>?</kbd></dt><dd>Show this help dialog</dd><dt><kbd>S</kbd></dt><dd>Focus the search field</dd><dt><kbd></kbd></dt><dd>Move up in search results</dd><dt><kbd></kbd></dt><dd>Move down in search results</dd><dt><kbd></kbd></dt><dd>Switch tab</dd><dt><kbd>&#9166;</kbd></dt><dd>Go to active search result</dd><dt><kbd>+</kbd></dt><dd>Expand all sections</dd><dt><kbd>-</kbd></dt><dd>Collapse all sections</dd></dl></div><div class="infos"><h2>Search Tricks</h2><p>Prefix searches with a type followed by a colon (e.g. <code>fn:</code>) to restrict the search to a given type.</p><p>Accepted types are: <code>fn</code>, <code>mod</code>, <code>struct</code>, <code>enum</code>, <code>trait</code>, <code>type</code>, <code>macro</code>, and <code>const</code>.</p><p>Search functions by type signature (e.g. <code>vec -> usize</code> or <code>* -> vec</code>)</p><p>Search multiple things at once by splitting your query with comma (e.g. <code>str,u8</code> or <code>String,struct:Vec,test</code>)</p></div></div></aside><script>window.rootPath = "../../";window.currentCrate = "utf8_ranges";</script><script src="../../aliases.js"></script><script src="../../main.js"></script><script defer src="../../search-index.js"></script></body></html>