Forbidding \r\n line endings in the encoding just sort of sinks the whole idea. The first couple ideas are nice, but then you suddenly get normative with what characters are allowed to be encoded? That creates a very large initial hurdle to clear to get people to use your encoding. Suddenly you need to forbid specific texts, instead of just handling everything. Why put such a huge footgun in your system when it's not necessary?
Yeah it doesn’t make much sense. In addition to being the default line ending on Windows, \r\n is part of the syntax of many text-based protocols (e.g. SMTP and IMAP) that support UTF-8, so clients/servers of all these protocols would be broken.
Many things makes sense to me, but as we can all guess, this will never become a thing :(
But the "magic number" thing to me is a waste of space. If this standard is accepted, if no magic number you have corrected UTF-8.
As for \r\n, not a big deal to me. I would like to see if forbidden if only to force Microsoft to use \n like UN*X and Apple. I still need to deal with \r\n in files showing up every so often.
"If this standard is accepted, if no magic number you have corrected UTF-8."
That's true only if "corrected UTF-8" is accepted and existing UTF-8 becomes obsolete. That can't happen. There's too much existing UTF-8 text that will never be translated to a newer standard.
You do realize that it's the UNIX people who are the strange ones here? The CRLF has been used as line delimiter by everyone (except IBM who always lived in their own special EBCDIC land) since late sixties, but then Thompson decided that he'd rather do LF-to-CRLF translation in the kernel tty driver than store the text on the disk as-is, like literally every other OS did (and continued to do).
Besides, the terminal emulators nowadays speak UTF-8 natively; and they absolutely do behave differently for naked LF and CRLF, and you can see it for yourself if you exec "stty -onlcr" and then try to echo or cat some stuff. Sure, you can try to persuade every single terminal emulator's author to adopt "automatic carriage return" but most will refuse to; and you will also need to somehow persuade people to stop emitting CR+LF combination in the raw mode... but then you'll need to give them back the old LF functionality (go down one line, scroll if necessary) somehow. Now, such functionality exists as the IND character — which is in the now forbidden C1 block. Simply amazing!
There's no point in a carriage return without a newline. So why have both just because of the 1933 teletype's hardware implementation? It's purely a hardware thing. That's why Multics used \n, and that's likely why Thompson chose to continue that practice.
When ASCII came about, it wasn't really about text files. Computers didn't talk to each other back then. ASCII was about sending characters between devices, and for compatibility reasons a lot of devices copied \r\n from the teletype. But there were a lot of devices that didn't as well. Putting it in the driver makes perfect sense from the point of view of someone developing a system in the 1960s.
> There's no point in a carriage return without a newline.
Progress bars. And TUIs in general, CR without LF is still quite handy for them. But even for paper teletypes, it has marginal use of correcting typos: print spaces over the correct text, overtype the corrections on top of mis-typed letters.
> When ASCII came about, it wasn't really about text files. Computers didn't talk to each other back then.
By the time it was finished being standardized, they did. One of the earliest RFCs, RFC 20 talks, among other things, about using ASCII in "HOST-HOST primary connections". Actually, reading the descriptions of the "format effectors" in that RFC is quite illuminating, it explicitly mentions "display devices", that is, "glass" terminals. And it also talks about the possibility of using LF as NL but notices that it requires exact matching of the semantics in both sender and receiver. But even without networks, exchanging data on magnetic/paper tapes, punch cards etc. between the computers was already a well established thing. After all, they did not invented and standardized ASCII simply because they had nothing better to do with their time!
Don't get me wrong, I too think that having a single-character new line delimiter/line terminator for use in text files is better than using a two-character sequence cobbled together. But many disagreed, and all of the RFC-documented protocols up until very recently use CRLF as line separators, so this convention obviously used to have a rather large support. Now, whether LF-to-CRLF translation, and line discipline in general, belongs in the kernel is a different question; I personally think it should've been lifted out of there and not conflated with serial port management but alas, it is what it is.
Yeah, Macs IIRC just stored the keyboard input instead (the Enter key generates CR, which is also an old and venerable tradition, that's why there is ICRNL flag for the terminal input).
Obviously, in a perfect world we would have a single NL character for storing in text files in memory/on disk/in transit, and terminals would use entirely different CR and IND control codes, and internally translate NL to CR+IND combination when printing text, and send NL to the host as the keycode of the Enter key when it's pressed. Alas, that train has sailed long ago (and let's not even start on choosing BS versus DEL).
I don't expect anyone to adopt this. Listing complaints about a heavily used standard, and proposing something else incompatible won't gain any traction.
Compare to WTF-8, which solves a different problem (representing invalid 16-bit characters within an 8-bit encoding).
Yeah, WTF-8 is a very straightforward "the spec semi-artificially says we can't do this one thing, and it prevents you from using utf8 under the hood to represent JS and Java strings which allow for unpaired utf16 surrogates, so in practice utf8-except-this-one-thing is the only way to do an in memory representation in things that want to implement or interop round trip with those".
It's literally the exact opposite of this proposal, in that there's an actual concrete problem and how to make it not a problem. This one is a list of weird grievances that aren't actually problems for anyone, like the max code point number.
I came up with a scheme a number of years ago that takes advantage of the illegality of overlong encodings [0].
Obviously UTF-8 has 256 code units (<00> to <FF>). 128 of them are always valid within a UTF-8 string (ASCII, <00> to <7F>), leaving 128 code units that could be invalid within a UTF-8 string (<80> to <FF>).
There also happen to be exactly 128 2-byte overlong representations (overlong representations of ASCII characters).
Basically, any byte in some input that can't be interpreted as valid UTF-8 can be replaced with a 2-byte overlong representation. This can be used as an extension of WTF-8 so that UTF-16 and UTF-8 errors can both be stored in the same stream. I called the encoding WTF-8b [2], though I'd be interested to know if someone else has come up with the same scheme.
This should technically be "fine" WRT Unicode text processing, since it involves transforming invalid Unicode into other invalid Unicode. This principle is already used by WTF-8.
I used it to improve preservation of invalid Unicode (ie, random 8-bit data in UTF-8 text or random 16-bit data in JSON strings) in jq, though I suspect the PR [1] won't be accepted. I still find the changes very useful personally, so maybe I'll come up with a different approach some time.
[2] I think I used the name "WTF-8b" as an allusion to UTF-8b/surrogateescape/PEP-383 which also encodes ill-formed UTF-8, though UTF-8b is less efficient storage-wise and is not compatible with WTF-8.
That's brilliant, tbh. I guess the challenge is how you represent those in the decoded character space. Maybe they should allocate 128 characters somewhere and define them as "invalid byte values".
In my jq PR I used negative numbers to represent them (the original byte, negated), since they're already just using `int` to represent a decoded code point, and it's somewhat normal to return distinguishable errors as negative numbers in C. I think it would also make sense to represent the UTF-16 errors ("unpaired surrogates") as negative numbers, though I didn't make that change internally (maybe because they're already used elsewhere). I did make it so that they are represented as negatives in `explode` however, so `"\uD800" | explode` emits `[-0xD800]`.
In something other than C, I'd expect they should be distinguished as members of an enumeration or something, eg:
I'm surprised this doesn't mandate one of the Unicode Normalization Forms. Normalization is both obscure and complex. Unicode should have a single canonical binary encoding for all character sequences.
Its a missed opportunity that this isn't already the case - but if you're going to replace utf8, we should absolutely mandate one of the normalization forms along the way.
This proposal seems like trying to reverse engineer a normalization form into an encoding form, which at face value having an encoding form that doesn't even technically support denormalized forms sounds like a good thing until you start to read the details on all of the normalization forms and get into the weeds and edge cases of why normal forms are locale specific and why normal forms are so complex even beyond that that you start to question if "single canonical binary encoding for character sequences" is at all possible and I think you start to appreciate why the normal forms are algorithms at a higher level above the raw binary encoding rather than attempted to be built into the binary encoding form.
Normalization is annoying but understandable - you have common characters that are clearly SOMETHING + MODIFIER, and they are common enough that you want to represent them as a single character to avoid byte explosion. SOMETHING and MODIFIER are also both useful on their own, potentially combining with other less common characters that are less valuable to encode (unfrequent, but valuable).
If you skip all the modifiers, you end up with an explosion in code space. If you skip all the precomposed characters, you end up with an explosion in bytes.
There's no good solution here, so normalization makes sense. But then the committee says ".. and what about this kind of normalization" and then you end up.. here.
Right. But if we had a chance for a do-over, it'd be really nice if we all just agreed on a normalization form and used it from the start in all our software. Seems like a missed opportunity not to.
I think NFC is the agreed-upon normalization form, is it not? The only real exception I can think of is HFS+ but that was corrected in APFS (which uses NFC now like the rest of the world).
I don't think you can mandate that in this kind of encoding. This just encodes code points, with some choices so certain invalid code points are unable to be encoded.
But normalized forms are about sequences of code points that are semantically equivalent. You can't make the non-normalized code point sequences unencodable in an encoding that only looks at one code point at a time. You wouldn't want to anchor the encoding to any particular version of Unicode either.
Normalized forms have to happen at another layer. That layer is often omitted for efficiency or because nobody stopped to consider it, but the code point encoding layer isn't the right place.
This scheme skips over 80 through 9F because they claim it's never appropriate to send those control characters through interchangeable text, but it just seems like a very brave proposal to intentionally have codepoints that can't be encoded.
I think the offset scheme should only be used to fix overlength encodings, and not trying to patch over an adhoc hole at the same time. It seems safer to make it possible to encode all codepoints whether those codepoints should be used or not. Unicode already has holes in various ranges anyways.
1) Adding offsets to multi-byte sequences breaks compatibility with existing UTF-8 text, while generating text which can be decoded (incorrectly) as UTF-8. That seems like a non-starter. The alleged benefit of "eliminating overlength encodings" seems marginal; overlength encodings are already invalid. It also significantly increases the complexity of encoders and decoders, especially in dealing with discontinuities like the UTF-16 surrogate "hole".
2) I really doubt that the current upper limit of U+10_FFFF is going to need to be raised. Past growth in the Unicode standard has primarily been driven by the addition of more CJK characters; that isn't going to continue indefinitely.
3) Disallowing C0 characters like U+0009 (horizontal tab) is absurd, especially at the level of a text encoding.
4) BOMs are dumb. We learned that lesson in the early 2000s - even if they sound great as a way of identifying text encodings, they have a nasty way of sneaking into the middle of strings and causing havoc. Bringing them back is a terrible idea.
I was going to object to using something new at all, but their recommendation for up to 31 bits is the same as the original UTF-8. They only add new logic for sequences starting with FF.
I'm not super thrilled with the extensions, though. They jump directly from 36 bits to 63/71 bits with nothing in between and then use a complicated scheme to go further.
The proposed extension mechanism itself is quite extensible in my understanding, so you should be able to define UCS-T and UCS-P (for tera and peta respectively) with minimal changes. The website offers an FAQ for this very topic [1], too.
That FAQ doesn't address my issues with their UTF-8 variants. And I don't want more extensions, I want it to be simpler. Once your prefix bits fill up, go directly to storing the number of bytes. Don't have this implicit jump from 7 to 13. And arrange the length encoding so you don't have to do that weird B4 thing to keep it in order.
The length encoding is fun to think about, if you want it to go all the way up to infinity, and avoid wasting bytes.
My thought: Bytes C2–FE begin 1 to 6 continuation bytes as usual. "FF 80+x", for x ≤ 0x3E, begins an (x+7)-byte sequence. "FF BF 80+x", again for x ≤ 0x3E, begins an (x+2)-byte length for the following sequence, offset as necessary to avoid overlong length encodings. (Length bits are expressed in the same 6-bit "80+x" encoding as the codepoint itself.) "FF BF BF 80+x" begins an (x+2)-byte length for the encoded length of the sequence. And so on, where the number of initial BF bytes denotes the number of length levels past the first. (I believe there's a name for this sort of representation, but I cannot find it.)
Assuming offsets are used properly, decoders would have an easy time jumping off the wagon at whatever point the lengths would become too long for them to possibly work with. In particular, you can get a simple subset up to 222 codepoint bits by just using "FF 80" through "FF BE" as simple lengths, and leaving "FF BF" reserved.
It seems plausible that this could be made efficiently doable byte-wise. For example, C3 xx could be made to uppercase to C4 xx. Unicode actually does structure its codespace to make certain properties easier to compute, but those properties are mostly related to legacy encodings, and things are designed with USC2 or UTF32 in mind, not UTF8.
It’s also not clear to me that the code point is a good abstraction in the design of UTF8. Usually, what you want is either the byte or the grapheme cluster.
"Even in the widest encoding, UTF-32, [some grapheme] will still take three 4-byte units to encode. And it still needs to be treated as a single character. If the analogy helps, we can think of the Unicode itself (without any encodings) as being variable-length."
I tend to think it's the biggest design decision in Unicode (but maybe I just don't fully see the need and use-cases beyond emojis. Of course I read the section saying it's used in actual languages, but the few examples described could have been made with a dedicated 32 bits codepoint...)
Can you fit everything into 32 bits? I have no idea, but Hangul and indict scripts seem like they might have a combinatoric explosion of infrequently used characters.
You still get the combinatoric explosion, but you have more bits to work with. Imagine if you could combine any 9 jamo into a single hangul syllable block. (The real combinatorics is more complicated, and I don't know if it's this bad.) Encoding just the 24 jamo and a a control character requires 25 codepoints. Giving each syllable block its own codepoint would require 24^9>2^32 codepoints.
Character case is a locale-dependent mess; trying to represent it in the values of code points (which need to be universal) is a terrible idea.
For example: in English, U+0049 and U+0069 ("I" and "i") are considered an uppercase/lowercase pair. In the Turkish locale, these are considered two separate characters with their own uppercase and lowercase versions: U+0049/U+0130 ("I" / "ı") and U+0131/U+0069 ("İ" / "i").
Of course you sometimes need tailoring to a particular language. On the other hand, I don't see how encoding untailered casing would make tailored casing harder.
Magic prefix (similar to byte-order-mark, BOM) is also killing the idea. The reason for success of any standard is the ability to establish consensus while navigating existing constraints. UTF-8 won over codepages, and UTF-16/32 by being purely ASCII-compatible. A magic prefix is killing that compatibility.
Even if all wire format encoding is utf8, you wouldn't be able to decode these new high codepoints into systems that are semantically utf16. Which is Java and JS at least, hardly "obsolete" targets to worry about.
And even Swift is designed so the strings can be utf8 or utf16 for cheap objc interop reasons.
Discarding compatibility with 2 of the top ~5 most widely used languages kind of reflects how disconnected the author of this is from the technical realities if any fixed utf8 was feasible outside of the most toy use cases.
This is really helpful - thanks. I write a CRDT library for text editing. I should probably restrict the characters that I transport to the "Unicode Assignables" subset. I can't think of any sensible reason to let people insert characters like U+0000 into a collaborative text document.
> The original design of UTF-8 (as "FSS-UTF," by Pike and Thompson; standardized in 1996 by RFC 2044) could encode codepoints up to U+7FFF FFFF. In 2003 the IETF changed the specification (via RFC 3629) to disallow encoding any codepoint beyond U+10 FFFF. This was purely because of internal ISO and Unicode Consortium politics; they rejected the possibility of a future in which codepoints would exist that UTF-16 could not represent. UTF-16 is now obsolete, so there is no longer any reason to stick to this upper limit, and at the present rate of codepoint allocation, the space below U+10 FFFF will be exhausted in something like 600 years (less if private-use space is not reclaimed). Text encodings are forever; the time to avoid running out of space is now, not 550 years from now.
UTF-16 is integral to the workings of Windows, Java, and JavaScript, so it's not going away anytime soon. To make things worse, those systems don't even support surrogates correctly, to the point where we had to build WTF-8, a system for handling malformed UTF-8 converted from these UTF-16 early adopters. Before we can start talking about characters beyond plane 16, we need to find an answer for how those existing systems should handle characters beyond U+10FFFF.
I can't think of a good way for them to do this, though:
1. Opting in to an alternate UTF-8 string type to migrate these systems off UTF-16 means loads of old software that just chokes on new characters. Do you remember how MySQL decided you had to opt into utf8mb4 encoding to use astral characters in strings? And how basically nobody bothered to do this up until emoji forced everyone's hand? Do you want to do that dance again, but for the entire Windows API?
2. We can't just "rip out UTF-16" without breaking compatibility. WCHAR strings in Windows are expected to be 16 bits long and hold Unicode codepoints, and programs can index those directly. JavaScript strings are a bit better in that they could be UTF-8 internally, but they still have length and indexing semantics inherited from Unicode 1.0.
3. If we don't "rip out UTF-16" though, then we need some kind of representation of characters beyond plane 16. There is no space left in plane 1 for this; we already used a good chunk of it for surrogates. Furthermore, it's a practical requirement of Unicode that all encodings be self-synchronizing. Deleting or inserting a byte shouldn't change the meaning of more than one or two characters.
The most practical way forward for >U+10FFFF "superastrals" would be to reserve space for super-surrogates in the currently unused plane 4-13 space. A plane for low surrogates and half a plane for high would give us 31 bits of coding, but they'd already be astral characters. This yields the rather comical result of requiring 8 bytes to represent a 4 byte codepoint, because of two layers of surrogacy.
If we hadn't already dedicated codepoints to the first layer of surrogates, we could have had an alternative with unlimited coding range like UTF-8. If I were allowed to redefine 0xD800-0xDFFF, I'd change them from low and high surrogates to initial and extension surrogates, as such:
- 2-word initial surrogate: 0b1101110 + 9 bits of initial codepoint index (U+10000 through U+7FFFF)
- 3-word initial surrogate: 0b11011110 + 8 bits of initial codepoint index (U+80000 through U+FFFFFFF)
- 4-word initial surrogate: 0b110111110 + 7 bits of initial codepoint index (U+10000000 through U+1FFFFFFFFF)
- Extension surrogate: 0b110110 + 10 bits of additional codepoint index
U+80000 to U+10FFFF now take 6 bytes to encode instead of 4, but in exchange we now can encode U+110000 through U+FFFFFFF in the same size. We can even trudge on to 37-bit codepoints, if we decided to invent a surrogacy scheme for UTF-32[0] and also allow FE/FF to signal very long UTF-8 sequences as suggested in the original article. Suffice it to say this is a comically overbuilt system.
Of course, the feasibility of this is also debatable. I just spent a good while explaining why we can't touch UTF-16 at all, right? Well, most of the stuff that is married to UTF-16 specifically ignores surrogates, treating it as headache for the application developer. In practice, mispaired surrogates never break things, that's why we had to invent WTF-8 to clean up after that mess.
You may have noticed that initial surrogates in my scheme occupy the coding space for low surrogates. Existing surrogates are supposed to be sent in the order high, low. So an initial, extension pair is actually the opposite surrogate order from what existing code expects. Unfortunately this isn't quite self-synchronizing in the world we currently live in. Deleting an initial surrogate will change the meaning of all following 2-word pairs to high/low pairs, unless you have some out of band way to signal that some text is encoded with initial / extension surrogates instead of high / low pairs. So I wouldn't recommend sending anything like this on the wire, and UTF-16 parsers would need to forbid mixed surrogacy ordering.
But then again, nobody sends UTF-16 on the wire anymore, so I don't know how much of a problem this would be. And of course, there's the underlying problem that the demand for codepoints beyond U+10FFFF is very low. Hell, the article itself admits the current Unicode growth rate has 600 years before it runs into this problem.
[0] Un(?)fortunately this would not be able to reuse the existing surrogate space for UTF-16, meaning we'd need to have a huge amount of the superastral planes reserved for even more comically large expansion.
Forbidding \r\n line endings in the encoding just sort of sinks the whole idea. The first couple ideas are nice, but then you suddenly get normative with what characters are allowed to be encoded? That creates a very large initial hurdle to clear to get people to use your encoding. Suddenly you need to forbid specific texts, instead of just handling everything. Why put such a huge footgun in your system when it's not necessary?
Yeah it doesn’t make much sense. In addition to being the default line ending on Windows, \r\n is part of the syntax of many text-based protocols (e.g. SMTP and IMAP) that support UTF-8, so clients/servers of all these protocols would be broken.
Many things makes sense to me, but as we can all guess, this will never become a thing :(
But the "magic number" thing to me is a waste of space. If this standard is accepted, if no magic number you have corrected UTF-8.
As for \r\n, not a big deal to me. I would like to see if forbidden if only to force Microsoft to use \n like UN*X and Apple. I still need to deal with \r\n in files showing up every so often.
"If this standard is accepted, if no magic number you have corrected UTF-8."
That's true only if "corrected UTF-8" is accepted and existing UTF-8 becomes obsolete. That can't happen. There's too much existing UTF-8 text that will never be translated to a newer standard.
You do realize that it's the UNIX people who are the strange ones here? The CRLF has been used as line delimiter by everyone (except IBM who always lived in their own special EBCDIC land) since late sixties, but then Thompson decided that he'd rather do LF-to-CRLF translation in the kernel tty driver than store the text on the disk as-is, like literally every other OS did (and continued to do).
Besides, the terminal emulators nowadays speak UTF-8 natively; and they absolutely do behave differently for naked LF and CRLF, and you can see it for yourself if you exec "stty -onlcr" and then try to echo or cat some stuff. Sure, you can try to persuade every single terminal emulator's author to adopt "automatic carriage return" but most will refuse to; and you will also need to somehow persuade people to stop emitting CR+LF combination in the raw mode... but then you'll need to give them back the old LF functionality (go down one line, scroll if necessary) somehow. Now, such functionality exists as the IND character — which is in the now forbidden C1 block. Simply amazing!
I gotta side with Thompson on this one.
There's no point in a carriage return without a newline. So why have both just because of the 1933 teletype's hardware implementation? It's purely a hardware thing. That's why Multics used \n, and that's likely why Thompson chose to continue that practice.
When ASCII came about, it wasn't really about text files. Computers didn't talk to each other back then. ASCII was about sending characters between devices, and for compatibility reasons a lot of devices copied \r\n from the teletype. But there were a lot of devices that didn't as well. Putting it in the driver makes perfect sense from the point of view of someone developing a system in the 1960s.
> There's no point in a carriage return without a newline.
Progress bars. And TUIs in general, CR without LF is still quite handy for them. But even for paper teletypes, it has marginal use of correcting typos: print spaces over the correct text, overtype the corrections on top of mis-typed letters.
> When ASCII came about, it wasn't really about text files. Computers didn't talk to each other back then.
By the time it was finished being standardized, they did. One of the earliest RFCs, RFC 20 talks, among other things, about using ASCII in "HOST-HOST primary connections". Actually, reading the descriptions of the "format effectors" in that RFC is quite illuminating, it explicitly mentions "display devices", that is, "glass" terminals. And it also talks about the possibility of using LF as NL but notices that it requires exact matching of the semantics in both sender and receiver. But even without networks, exchanging data on magnetic/paper tapes, punch cards etc. between the computers was already a well established thing. After all, they did not invented and standardized ASCII simply because they had nothing better to do with their time!
Don't get me wrong, I too think that having a single-character new line delimiter/line terminator for use in text files is better than using a two-character sequence cobbled together. But many disagreed, and all of the RFC-documented protocols up until very recently use CRLF as line separators, so this convention obviously used to have a rather large support. Now, whether LF-to-CRLF translation, and line discipline in general, belongs in the kernel is a different question; I personally think it should've been lifted out of there and not conflated with serial port management but alas, it is what it is.
> "like literally every other OS did (and continued to do)."
If I remember correctly macs used a bare carrage return as the line delimiter.
So the trick when you got a text document was to figure out where it came from.
windows = crlf mac = cr unix = lf
I suspect nowadays(don't have a mac so this is a guess) because macs are more or less a unix system they default to linefeeds.
Yeah, Macs IIRC just stored the keyboard input instead (the Enter key generates CR, which is also an old and venerable tradition, that's why there is ICRNL flag for the terminal input).
Obviously, in a perfect world we would have a single NL character for storing in text files in memory/on disk/in transit, and terminals would use entirely different CR and IND control codes, and internally translate NL to CR+IND combination when printing text, and send NL to the host as the keycode of the Enter key when it's pressed. Alas, that train has sailed long ago (and let's not even start on choosing BS versus DEL).
Magic numbers do appear a lot in C# programs. The default text encoder will output a BOM marker.
I don't expect anyone to adopt this. Listing complaints about a heavily used standard, and proposing something else incompatible won't gain any traction.
Compare to WTF-8, which solves a different problem (representing invalid 16-bit characters within an 8-bit encoding).
Yeah, WTF-8 is a very straightforward "the spec semi-artificially says we can't do this one thing, and it prevents you from using utf8 under the hood to represent JS and Java strings which allow for unpaired utf16 surrogates, so in practice utf8-except-this-one-thing is the only way to do an in memory representation in things that want to implement or interop round trip with those".
It's literally the exact opposite of this proposal, in that there's an actual concrete problem and how to make it not a problem. This one is a list of weird grievances that aren't actually problems for anyone, like the max code point number.
I came up with a scheme a number of years ago that takes advantage of the illegality of overlong encodings [0].
Obviously UTF-8 has 256 code units (<00> to <FF>). 128 of them are always valid within a UTF-8 string (ASCII, <00> to <7F>), leaving 128 code units that could be invalid within a UTF-8 string (<80> to <FF>).
There also happen to be exactly 128 2-byte overlong representations (overlong representations of ASCII characters).
Basically, any byte in some input that can't be interpreted as valid UTF-8 can be replaced with a 2-byte overlong representation. This can be used as an extension of WTF-8 so that UTF-16 and UTF-8 errors can both be stored in the same stream. I called the encoding WTF-8b [2], though I'd be interested to know if someone else has come up with the same scheme.
This should technically be "fine" WRT Unicode text processing, since it involves transforming invalid Unicode into other invalid Unicode. This principle is already used by WTF-8.
I used it to improve preservation of invalid Unicode (ie, random 8-bit data in UTF-8 text or random 16-bit data in JSON strings) in jq, though I suspect the PR [1] won't be accepted. I still find the changes very useful personally, so maybe I'll come up with a different approach some time.
[0] https://github.com/Maxdamantus/jq/blob/911d01aaa5bd33137fadf...
[1] https://github.com/jqlang/jq/pull/2314
[2] I think I used the name "WTF-8b" as an allusion to UTF-8b/surrogateescape/PEP-383 which also encodes ill-formed UTF-8, though UTF-8b is less efficient storage-wise and is not compatible with WTF-8.
That's brilliant, tbh. I guess the challenge is how you represent those in the decoded character space. Maybe they should allocate 128 characters somewhere and define them as "invalid byte values".
In my jq PR I used negative numbers to represent them (the original byte, negated), since they're already just using `int` to represent a decoded code point, and it's somewhat normal to return distinguishable errors as negative numbers in C. I think it would also make sense to represent the UTF-16 errors ("unpaired surrogates") as negative numbers, though I didn't make that change internally (maybe because they're already used elsewhere). I did make it so that they are represented as negatives in `explode` however, so `"\uD800" | explode` emits `[-0xD800]`.
In something other than C, I'd expect they should be distinguished as members of an enumeration or something, eg:
I'm surprised this doesn't mandate one of the Unicode Normalization Forms. Normalization is both obscure and complex. Unicode should have a single canonical binary encoding for all character sequences.
Its a missed opportunity that this isn't already the case - but if you're going to replace utf8, we should absolutely mandate one of the normalization forms along the way.
https://unicode.org/reports/tr15/
This proposal seems like trying to reverse engineer a normalization form into an encoding form, which at face value having an encoding form that doesn't even technically support denormalized forms sounds like a good thing until you start to read the details on all of the normalization forms and get into the weeds and edge cases of why normal forms are locale specific and why normal forms are so complex even beyond that that you start to question if "single canonical binary encoding for character sequences" is at all possible and I think you start to appreciate why the normal forms are algorithms at a higher level above the raw binary encoding rather than attempted to be built into the binary encoding form.
Normalization is annoying but understandable - you have common characters that are clearly SOMETHING + MODIFIER, and they are common enough that you want to represent them as a single character to avoid byte explosion. SOMETHING and MODIFIER are also both useful on their own, potentially combining with other less common characters that are less valuable to encode (unfrequent, but valuable).
If you skip all the modifiers, you end up with an explosion in code space. If you skip all the precomposed characters, you end up with an explosion in bytes.
There's no good solution here, so normalization makes sense. But then the committee says ".. and what about this kind of normalization" and then you end up.. here.
Right. But if we had a chance for a do-over, it'd be really nice if we all just agreed on a normalization form and used it from the start in all our software. Seems like a missed opportunity not to.
I think NFC is the agreed-upon normalization form, is it not? The only real exception I can think of is HFS+ but that was corrected in APFS (which uses NFC now like the rest of the world).
I don't think you can mandate that in this kind of encoding. This just encodes code points, with some choices so certain invalid code points are unable to be encoded.
But normalized forms are about sequences of code points that are semantically equivalent. You can't make the non-normalized code point sequences unencodable in an encoding that only looks at one code point at a time. You wouldn't want to anchor the encoding to any particular version of Unicode either.
Normalized forms have to happen at another layer. That layer is often omitted for efficiency or because nobody stopped to consider it, but the code point encoding layer isn't the right place.
He got very close to killing the SOH (U+01) which is useful in various technical specifications. Seems to still want to put the boot in.
I don't understand the desire to make existing characters unrepresentable for the sake of what? Shifting used characters earlier in the byte sequence?
This scheme skips over 80 through 9F because they claim it's never appropriate to send those control characters through interchangeable text, but it just seems like a very brave proposal to intentionally have codepoints that can't be encoded.
I think the offset scheme should only be used to fix overlength encodings, and not trying to patch over an adhoc hole at the same time. It seems safer to make it possible to encode all codepoints whether those codepoints should be used or not. Unicode already has holes in various ranges anyways.
> I would like to discard almost all of the C0 controls as well—preserving only U+0000 and U+000A
What's wrong with horizontal tab?
1) Adding offsets to multi-byte sequences breaks compatibility with existing UTF-8 text, while generating text which can be decoded (incorrectly) as UTF-8. That seems like a non-starter. The alleged benefit of "eliminating overlength encodings" seems marginal; overlength encodings are already invalid. It also significantly increases the complexity of encoders and decoders, especially in dealing with discontinuities like the UTF-16 surrogate "hole".
2) I really doubt that the current upper limit of U+10_FFFF is going to need to be raised. Past growth in the Unicode standard has primarily been driven by the addition of more CJK characters; that isn't going to continue indefinitely.
3) Disallowing C0 characters like U+0009 (horizontal tab) is absurd, especially at the level of a text encoding.
4) BOMs are dumb. We learned that lesson in the early 2000s - even if they sound great as a way of identifying text encodings, they have a nasty way of sneaking into the middle of strings and causing havoc. Bringing them back is a terrible idea.
Yes it should be completely incompatible with UTF-8 not only partially. As in, anything beyond ASCII should be invalid and not decodable as UTF.
If you do need the expansion of code point space, https://ucsx.org/ is the definitive answer; it was designed by actual Unicode contributors.
I was going to object to using something new at all, but their recommendation for up to 31 bits is the same as the original UTF-8. They only add new logic for sequences starting with FF.
I'm not super thrilled with the extensions, though. They jump directly from 36 bits to 63/71 bits with nothing in between and then use a complicated scheme to go further.
The proposed extension mechanism itself is quite extensible in my understanding, so you should be able to define UCS-T and UCS-P (for tera and peta respectively) with minimal changes. The website offers an FAQ for this very topic [1], too.
[1] https://ucsx.org/why#3.1
That FAQ doesn't address my issues with their UTF-8 variants. And I don't want more extensions, I want it to be simpler. Once your prefix bits fill up, go directly to storing the number of bytes. Don't have this implicit jump from 7 to 13. And arrange the length encoding so you don't have to do that weird B4 thing to keep it in order.
The length encoding is fun to think about, if you want it to go all the way up to infinity, and avoid wasting bytes.
My thought: Bytes C2–FE begin 1 to 6 continuation bytes as usual. "FF 80+x", for x ≤ 0x3E, begins an (x+7)-byte sequence. "FF BF 80+x", again for x ≤ 0x3E, begins an (x+2)-byte length for the following sequence, offset as necessary to avoid overlong length encodings. (Length bits are expressed in the same 6-bit "80+x" encoding as the codepoint itself.) "FF BF BF 80+x" begins an (x+2)-byte length for the encoded length of the sequence. And so on, where the number of initial BF bytes denotes the number of length levels past the first. (I believe there's a name for this sort of representation, but I cannot find it.)
Assuming offsets are used properly, decoders would have an easy time jumping off the wagon at whatever point the lengths would become too long for them to possibly work with. In particular, you can get a simple subset up to 222 codepoint bits by just using "FF 80" through "FF BE" as simple lengths, and leaving "FF BF" reserved.
What about encoding it in such way we dont need huge tables to figure the category for each code point?
It means that you are encoding those categories into the code point itself, which is a waste for every single use of the character encoding.
It seems plausible that this could be made efficiently doable byte-wise. For example, C3 xx could be made to uppercase to C4 xx. Unicode actually does structure its codespace to make certain properties easier to compute, but those properties are mostly related to legacy encodings, and things are designed with USC2 or UTF32 in mind, not UTF8.
It’s also not clear to me that the code point is a good abstraction in the design of UTF8. Usually, what you want is either the byte or the grapheme cluster.
> Usually, what you want is either the byte or the grapheme cluster.
Exactly ! That's what I understood after reading this great post https://tonsky.me/blog/unicode/
"Even in the widest encoding, UTF-32, [some grapheme] will still take three 4-byte units to encode. And it still needs to be treated as a single character. If the analogy helps, we can think of the Unicode itself (without any encodings) as being variable-length."
I tend to think it's the biggest design decision in Unicode (but maybe I just don't fully see the need and use-cases beyond emojis. Of course I read the section saying it's used in actual languages, but the few examples described could have been made with a dedicated 32 bits codepoint...)
Can you fit everything into 32 bits? I have no idea, but Hangul and indict scripts seem like they might have a combinatoric explosion of infrequently used characters.
But they don't have that explosion if you only encode the combinatoric primitives those characters are made of and then use composing rules?
You still get the combinatoric explosion, but you have more bits to work with. Imagine if you could combine any 9 jamo into a single hangul syllable block. (The real combinatorics is more complicated, and I don't know if it's this bad.) Encoding just the 24 jamo and a a control character requires 25 codepoints. Giving each syllable block its own codepoint would require 24^9>2^32 codepoints.
> Giving each syllable block its own codepoint
That's the thing - you wouldn't do that! Only a small subset of frequently used combos would get it's own id, the rest would only be composable
Character case is a locale-dependent mess; trying to represent it in the values of code points (which need to be universal) is a terrible idea.
For example: in English, U+0049 and U+0069 ("I" and "i") are considered an uppercase/lowercase pair. In the Turkish locale, these are considered two separate characters with their own uppercase and lowercase versions: U+0049/U+0130 ("I" / "ı") and U+0131/U+0069 ("İ" / "i").
Of course you sometimes need tailoring to a particular language. On the other hand, I don't see how encoding untailered casing would make tailored casing harder.
> but that would break ASCII compatibility, which is a step too far.
By the way, is there any great modern design system that is not afraid to do this?
Magic prefix (similar to byte-order-mark, BOM) is also killing the idea. The reason for success of any standard is the ability to establish consensus while navigating existing constraints. UTF-8 won over codepages, and UTF-16/32 by being purely ASCII-compatible. A magic prefix is killing that compatibility.
"UTF-16 is now obsolete."? That's news to me.
I wish it were true, but it's not.
Yeah, for example it's how Java stores strings to this day. But I think it's more or less never transmitted over the Network.
Even if all wire format encoding is utf8, you wouldn't be able to decode these new high codepoints into systems that are semantically utf16. Which is Java and JS at least, hardly "obsolete" targets to worry about.
And even Swift is designed so the strings can be utf8 or utf16 for cheap objc interop reasons.
Discarding compatibility with 2 of the top ~5 most widely used languages kind of reflects how disconnected the author of this is from the technical realities if any fixed utf8 was feasible outside of the most toy use cases.
Relevant: https://www.ietf.org/archive/id/draft-bray-unichars-15.html - IETF approved and will have an RFC number in a few weeks.
Tl;dr: Since we're kinda stuck with Uncorrected UTF-8, here are the "characters" you shouldn't use. Includes a bunch of stuff the OP mentioned.
The most important bit of that is the “Unicode Assignables” subset <https://www.ietf.org/archive/id/draft-bray-unichars-15.html#...>:
This is really helpful - thanks. I write a CRDT library for text editing. I should probably restrict the characters that I transport to the "Unicode Assignables" subset. I can't think of any sensible reason to let people insert characters like U+0000 into a collaborative text document.
> The original design of UTF-8 (as "FSS-UTF," by Pike and Thompson; standardized in 1996 by RFC 2044) could encode codepoints up to U+7FFF FFFF. In 2003 the IETF changed the specification (via RFC 3629) to disallow encoding any codepoint beyond U+10 FFFF. This was purely because of internal ISO and Unicode Consortium politics; they rejected the possibility of a future in which codepoints would exist that UTF-16 could not represent. UTF-16 is now obsolete, so there is no longer any reason to stick to this upper limit, and at the present rate of codepoint allocation, the space below U+10 FFFF will be exhausted in something like 600 years (less if private-use space is not reclaimed). Text encodings are forever; the time to avoid running out of space is now, not 550 years from now.
UTF-16 is integral to the workings of Windows, Java, and JavaScript, so it's not going away anytime soon. To make things worse, those systems don't even support surrogates correctly, to the point where we had to build WTF-8, a system for handling malformed UTF-8 converted from these UTF-16 early adopters. Before we can start talking about characters beyond plane 16, we need to find an answer for how those existing systems should handle characters beyond U+10FFFF.
I can't think of a good way for them to do this, though:
1. Opting in to an alternate UTF-8 string type to migrate these systems off UTF-16 means loads of old software that just chokes on new characters. Do you remember how MySQL decided you had to opt into utf8mb4 encoding to use astral characters in strings? And how basically nobody bothered to do this up until emoji forced everyone's hand? Do you want to do that dance again, but for the entire Windows API?
2. We can't just "rip out UTF-16" without breaking compatibility. WCHAR strings in Windows are expected to be 16 bits long and hold Unicode codepoints, and programs can index those directly. JavaScript strings are a bit better in that they could be UTF-8 internally, but they still have length and indexing semantics inherited from Unicode 1.0.
3. If we don't "rip out UTF-16" though, then we need some kind of representation of characters beyond plane 16. There is no space left in plane 1 for this; we already used a good chunk of it for surrogates. Furthermore, it's a practical requirement of Unicode that all encodings be self-synchronizing. Deleting or inserting a byte shouldn't change the meaning of more than one or two characters.
The most practical way forward for >U+10FFFF "superastrals" would be to reserve space for super-surrogates in the currently unused plane 4-13 space. A plane for low surrogates and half a plane for high would give us 31 bits of coding, but they'd already be astral characters. This yields the rather comical result of requiring 8 bytes to represent a 4 byte codepoint, because of two layers of surrogacy.
If we hadn't already dedicated codepoints to the first layer of surrogates, we could have had an alternative with unlimited coding range like UTF-8. If I were allowed to redefine 0xD800-0xDFFF, I'd change them from low and high surrogates to initial and extension surrogates, as such:
- 2-word initial surrogate: 0b1101110 + 9 bits of initial codepoint index (U+10000 through U+7FFFF)
- 3-word initial surrogate: 0b11011110 + 8 bits of initial codepoint index (U+80000 through U+FFFFFFF)
- 4-word initial surrogate: 0b110111110 + 7 bits of initial codepoint index (U+10000000 through U+1FFFFFFFFF)
- Extension surrogate: 0b110110 + 10 bits of additional codepoint index
U+80000 to U+10FFFF now take 6 bytes to encode instead of 4, but in exchange we now can encode U+110000 through U+FFFFFFF in the same size. We can even trudge on to 37-bit codepoints, if we decided to invent a surrogacy scheme for UTF-32[0] and also allow FE/FF to signal very long UTF-8 sequences as suggested in the original article. Suffice it to say this is a comically overbuilt system.
Of course, the feasibility of this is also debatable. I just spent a good while explaining why we can't touch UTF-16 at all, right? Well, most of the stuff that is married to UTF-16 specifically ignores surrogates, treating it as headache for the application developer. In practice, mispaired surrogates never break things, that's why we had to invent WTF-8 to clean up after that mess.
You may have noticed that initial surrogates in my scheme occupy the coding space for low surrogates. Existing surrogates are supposed to be sent in the order high, low. So an initial, extension pair is actually the opposite surrogate order from what existing code expects. Unfortunately this isn't quite self-synchronizing in the world we currently live in. Deleting an initial surrogate will change the meaning of all following 2-word pairs to high/low pairs, unless you have some out of band way to signal that some text is encoded with initial / extension surrogates instead of high / low pairs. So I wouldn't recommend sending anything like this on the wire, and UTF-16 parsers would need to forbid mixed surrogacy ordering.
But then again, nobody sends UTF-16 on the wire anymore, so I don't know how much of a problem this would be. And of course, there's the underlying problem that the demand for codepoints beyond U+10FFFF is very low. Hell, the article itself admits the current Unicode growth rate has 600 years before it runs into this problem.
[0] Un(?)fortunately this would not be able to reuse the existing surrogate space for UTF-16, meaning we'd need to have a huge amount of the superastral planes reserved for even more comically large expansion.