Not everybody will be happy about giving up convenience for correctness as this breaks the terrible terrible PEP383-hack...
- rGHC Glasgow Haskell Compiler
Lint Warnings Excuse: pre-existing condition Severity Location Code Message Warning compiler/parser/Lexer.x:1661 TXT3 Line Too Long Warning compiler/parser/Lexer.x:1662 TXT3 Line Too Long Warning compiler/parser/Lexer.x:1663 TXT3 Line Too Long Warning libraries/base/GHC/Char.hs:20 TXT3 Line Too Long Warning libraries/base/GHC/Enum.hs:299 TXT3 Line Too Long Warning libraries/base/GHC/Enum.hs:303 TXT3 Line Too Long Warning libraries/base/GHC/Enum.hs:341 TXT3 Line Too Long Warning libraries/base/GHC/Enum.hs:347 TXT3 Line Too Long
No Unit Test Coverage
- Build Status
Buildable 17039 Build 31882: [GHC] Linux/amd64: Patch building Build 31881: [GHC] OSX/amd64: Continuous Integration Build 31880: [GHC] Windows/amd64: Continuous Integration Build 31879: arc lint + arc unit
While I agree that this would be a reasonable design if we started from scratch today, I'm a bit weary of changing it at this point. I honestly have no idea what sort of assumptions might break by introducing this change now. We would at very least need to document the change loudly and add a note to infelicities.rst documenting the departure from the Report.
@JaffaCake, @simonpj what is your opinion on this?
Some background: I had a user of a serialisation library I work on complain that the library wasn't roundtripping Strings. It turned out the problem was that the strings weren't valid Unicode strings, containing surrogate code points which can't be represented by UTF-8. Sadly GHC currently lets users construct such string literals without even a warning and Enum, etc. will generate such invalid characters. This patch fixes that.
Whether this would represent a departure from the Report depends on what "Unicode character" is supposed to mean. The Haskell report doesn't use an exact terminology here IMHO:
The character type Char is an enumeration whose values represent Unicode characters
The character type Char is an enumeration whose values represent Unicode (or equivalently ISO/IEC 10646) characters (see http://www.unicode.org/ for details).
This boils down to whether a single code-point in the surrogate range (U+D800 to DFFF) represents a "Unicode character". Unicode talks about things like
- Code points (U+D000 .. U+D10FFFF)
- Scalar Values (all code-points sans [U+D800 .. U+DFFF]
- Code units (e.g. 8-bit for UTF8, 32-bit for UTF-32 encoding form)
- Graphemes ("What a user thinks of as a character", can be composed of one or more code-units)
- Glyph (the font image)
with more less well-defined meanings, but the term "character" seems to be an overloaded term with no exact definition that can mean many things.