Fix a bug in 'alexInputPrevChar'
ClosedPublic

Authored by harpocrates on Oct 18 2017, 1:07 AM.

Details

Summary

The lexer hacks around unicode by squishing any character into a 'Word8'
and then storing the actual character in its state. This happens at
'alexGetByte'.

That is all and well, but we ought to be careful that the characters we
retrieve via 'alexInputPrevChar' also fit this convention.

In fact, Trac #13986 exposes nicely what can go wrong: the regex in the left
context of the type application rule uses the '$idchar' character set
which relies on the unicode hack. However, a left context corresponds
to a call to 'alexInputPrevChar', and we end up passing full blown
unicode characters to '$idchar', despite it not being equipped to deal
with these.

Test Plan

Added a regression test case

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
harpocrates created this revision.Oct 18 2017, 1:07 AM

Hmm, tricky. Indeed this look right, although it took a bit of staring to figure out what was going on.

@harpocrates, do you think you understand this code well enough to write a longer Note explaining what is happening here?

  • Add a Note explaining unicode handling in Alex
This revision is now accepted and ready to land.Oct 25 2017, 1:25 PM
This revision was automatically updated to reflect the committed changes.