WIP: Make `Char` & literals Unicode-correct by construction
Needs RevisionPublic

Authored by hvr on Aug 3 2017, 7:20 AM.

Details

Reviewers
bgamari
austin
Summary

Not everybody will be happy about giving up convenience for correctness as this breaks the terrible terrible PEP383-hack...

related: https://github.com/well-typed/cborg/issues/135#issuecomment-319923107

Test Plan

Breaks TEST="decodingerror002 encoding002 encoding004 encoding005" for good reason

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Branch
master
Lint
Lint WarningsExcuse: pre-existing condition
SeverityLocationCodeMessage
Warningcompiler/parser/Lexer.x:1661TXT3Line Too Long
Warningcompiler/parser/Lexer.x:1662TXT3Line Too Long
Warningcompiler/parser/Lexer.x:1663TXT3Line Too Long
Warninglibraries/base/GHC/Char.hs:20TXT3Line Too Long
Warninglibraries/base/GHC/Enum.hs:299TXT3Line Too Long
Warninglibraries/base/GHC/Enum.hs:303TXT3Line Too Long
Warninglibraries/base/GHC/Enum.hs:341TXT3Line Too Long
Warninglibraries/base/GHC/Enum.hs:347TXT3Line Too Long
Unit
No Unit Test Coverage
Build Status
Buildable 17039
Build 31882: [GHC] Linux/amd64: Patch building
Build 31881: [GHC] OSX/amd64: Continuous Integration
Build 31880: [GHC] Windows/amd64: Continuous Integration
Build 31879: arc lint + arc unit
hvr created this revision.Aug 3 2017, 7:20 AM
hvr edited the summary of this revision. (Show Details)Aug 3 2017, 7:22 AM
bgamari edited edge metadata.Aug 18 2017, 7:36 AM
bgamari added a subscriber: simonpj.

While I agree that this would be a reasonable design if we started from scratch today, I'm a bit weary of changing it at this point. I honestly have no idea what sort of assumptions might break by introducing this change now. We would at very least need to document the change loudly and add a note to infelicities.rst documenting the departure from the Report.

@JaffaCake, @simonpj what is your opinion on this?

Some background: I had a user of a serialisation library I work on complain that the library wasn't roundtripping Strings. It turned out the problem was that the strings weren't valid Unicode strings, containing surrogate code points which can't be represented by UTF-8. Sadly GHC currently lets users construct such string literals without even a warning and Enum, etc. will generate such invalid characters. This patch fixes that.

bgamari requested changes to this revision.Aug 18 2017, 7:42 AM

Bumping out of review queue.

This revision now requires changes to proceed.Aug 18 2017, 7:42 AM
hvr added a comment.Aug 30 2017, 9:43 AM

We would at very least need to document the change loudly and add a note to infelicities.rst documenting the departure from the Report.

Whether this would represent a departure from the Report depends on what "Unicode character" is supposed to mean. The Haskell report doesn't use an exact terminology here IMHO:

The character type Char is an enumeration whose values represent Unicode characters

The character type Char is an enumeration whose values represent Unicode (or equivalently ISO/IEC 10646) characters (see http://www.unicode.org/ for details).

This boils down to whether a single code-point in the surrogate range (U+D800 to DFFF) represents a "Unicode character". Unicode talks about things like

  • Code points (U+D000 .. U+D10FFFF)
  • Scalar Values (all code-points sans [U+D800 .. U+DFFF]
  • Code units (e.g. 8-bit for UTF8, 32-bit for UTF-32 encoding form)
  • Graphemes ("What a user thinks of as a character", can be composed of one or more code-units)
  • Glyph (the font image)

with more less well-defined meanings, but the term "character" seems to be an overloaded term with no exact definition that can mean many things.

austin resigned from this revision.Nov 9 2017, 5:36 PM
lelf added a subscriber: lelf.Sat, Jul 14, 4:20 PM