13383 – UTF-8 occurring when it is not the expected charset can lock up terminal window

UTF-8 occurring when it is not the expected charset can lock up terminal window

Status:

RESOLVED: INVALID

Priority:
Medium

Severity:
minor

Product:
Xfce4-terminal

Component:

General

Comments

Description Mark Brader 2017-02-25 12:25:37 CET

Created attachment 7024 
Illustration of terminal window when locked up, and how it got there

When using xfce4-terminal, I normally work in an environment with the ISO
8859-1 (Latin-1) character encoding and therefore have Preferences ->
Advanced -> Encoding set to "Default (ISO-8859-1)".  Until recently I
was using Fedora 23 Linux with xfce4-terminal 0.6.3, but have recently
changed to Fedora 25 with xfce4-terminal 0.8.4, or to be exact,
Fedora RPM "xfce4-terminal-0.8.4-1.fc25.x86_64".

Although I prefer to work in Latin-1, I sometimes come across text that's
in UTF-8 (for example, in incoming email).  Naturally, I don't expect this
to display correctly, but I expect the terminal to keep working.  But since
I changed to Fedora 25, I find that sometimes it doesn't.  (And If I change
xfce4-terminal's settings so it expects UTF-8, then it works.  So on those
grounds I'm filing this as an xfce4-terminal bug.)

The really weird part is that the effect not only depends on what characters
are involved, but also on how they are being written to the terminal.

Please refer to the attached screenshot "shot.png".  The file "ouch"
is one line from an incoming email message, which contains UTF-8
directional single quotes.  If I "cat" the file in my usual environment,
it displays those characters unhelpfully but does not lock anything up.
If I use "Terminal -> Set Encoding" to change the encoding to UTF-8
and then "cat" the file, the directional quotes are there.

Now, with the encoding on UTF-8, I go into "script" and repeat the "cat"
command.  It works in the same way.

If I switch the encoding back to 8859-1 and "cat" the file again, it
displays as before.

All as expected.  But now look: I go into "script" and repeat the "cat"
command one more time, and this time my terminal window locks up.  The
green blob you see is my terminal cursor.

"Terminal -> Reset" does not unlock it.  "Terminal -> Clear Scrollback
and Reset" does not unlock it.

What DOES unlock it is if I open another window, use "ps" to find the PID
of the relevant process -- in this case, the "sh -i" command invoked by
"script", and kill that process.  If I do that, the output resumes as if
nothing happened.

Besides "script", the other command I've found that leads to the same
failure mode is my usual mail reader -- a version of mailx running on
a BSD UNIX machine.  Which, of course, is how I found the problem.

One more piece of information: apparently it's not the full UTF-8 byte
sequence that causes this effect, but specifically the octal 230 (hex 98)
byte.  In 8859-1 this byte is considered a control character but I don't
know how it's supposed to be used.  So maybe this behavior is a feature,
not a bug... but if so, I can't imagine how it could be useful.  As I say,
in my normal environment that character never used to do anything special.

Comment 1 Egmont Koblinger 2017-02-25 13:08:54 CET

Is this by any chance the same as
https://bugzilla.gnome.org/show_bug.cgi?id=777733
https://bugzilla.gnome.org/show_bug.cgi?id=737792
?

Comment 2 Igor editbugs

2017-02-26 15:40:38 CET

Egmont, thanks for stepping in! To me this one really seems related to vte.

Mark, do you think bugs that Egmont has mentioned are similar to yours?

Comment 3 Mark Brader 2017-03-02 10:31:02 CET

Looking at https://bugzilla.gnome.org/show_bug.cgi?id=777733, it's clearly
related.  Two similar features stand out: [1] it was apparently triggered
by a single C1 control character (though in his example this was hex 90;
in mine it was 98) and [2] terminating the process that was outputting
the characters ended the problem.

There was mention of trying different terminal emulators.  I didn't have
any others on my machine, but just now I tried installing xterm.  I first
tried just starting it with "LC_ALL=en_US.iso88591 xterm" and tried the
same example as in my illustration, and it did not lock up, but it also
omitted the words "Or So It Seemed" that were between the hex 98 and hex 99
characters!  I then used the Shift-Right-Click menu to set "UTF-8 encoding"
and the file displayed properly with directional quotes.  But I then used
Shift-Right-Click again and turned *off* "UTF-8 encoding", and now the file
displayed the way it did in xfce4-terminal!

Anyway, I could not make xterm lock up by catting that file inside "script",
but I *could* by making an ssh connection to the UNIX machine and running
my mail reader.

So in short, xterm behaves differently from xfce4-terminal, but neither one
seems to handle this situation correctly.

Comment 4 Mark Brader 2017-03-02 10:32:00 CET

As for https://bugzilla.gnome.org/show_bug.cgi?id=737792, I downloaded the
file "localtime" and tried "cat localtime" inside my xfce4-terminal -- and
it did not lock up.  I then started "script" and did "cat localtime" again
-- and this time it *did* lock up.  As before, killing the "sh -i" that
script had started was sufficient to unlock things.  But this time, changing
the encoding in xfce4-terminal between 8859-1 and UTF-8 had no effect on
the behavior.

I find this all mystifying; I know nothing about what goes on behind the
scenes in terminal emulators.

Comment 5 Egmont Koblinger 2017-03-02 11:03:37 CET

I've also taken a closer look at your issue since I made my previous comment, and I'm sure these are essentially the same.

The exact details are indeed confusing to the extent that I myself don't exactly know how xterm and vte behave in all these possible circumstances and configurations, I have to look up / experiment.

The core of the problem, in your case, is that you mix two encodings, and something that is a printable character in one is a control character in the other, waiting for its terminating sequence. As such, technically ...

> but I expect the terminal to keep working

... the terminal emulator keeps working, it has just entered a special mode.

What I believe you actually expect is for the emulator not to switch to this special mode, and this is a false expectation from you if the emulator actually encounters these bytes.

VTE could add an API for disallowing C1 control characters, and then frontends (incl. xfce4-terminal) could make a checkbox for it. However VTE tends to be an emulator that does not implement and expose such kinds of various possible behaviors, unless really-really required. Probably xterm has such an option. You might have success with layers like screen or tmux as well, not sure.

> Although I prefer to work in Latin-1

The Linux world has, for a good reason, switched to UTF-8 as the default about 10 years ago. Some new terminal emulators don't even support other encodings. Of course you're free to against the wind, but it doesn't sound a wise decision for me (without knowing your environment of course) and I would recommend to you that you switch to UTF-8 as soon as possible.

> I sometimes come across text that's in UTF-8 (for example, in incoming email)
> [...] a version of mailx running on a BSD UNIX machine

Email clients are responsible for decoding the mail according to its mime type (character set, transfer encoding etc.) and converting to the terminal emulator's charset, so you end up seeing the proper symbols as intended by the sender of the email. Such apps are responsible for not sending out bytes that screw with the terminal emulator. I'm not aware of mailx, maybe that's an ancient piece of crap^H^H^H^Hsoftware not knowing how to handle various character sets.

Of course low-level debugging like grepping are a different story, but that shouldn't be the normal usage.

> I sometimes come across text that's in UTF-8 (for example, in incoming email).
> Naturally, I don't expect this to display correctly

This is soooo wrong! You _should_ expect all your emails to display correctly, no matter what charset it is encoded in. If your environment cannot do this, you should upgrade or switch to a modern one that can do it for you. (And having an UTF-8 terminal emulator is a prerequisite for this if you actually care about seeing out-of-latin1 glyphs.)

> So in short, xterm behaves differently from xfce4-terminal, but neither one
> seems to handle this situation correctly.

Not sure what you expect as the correct behavior. Indeed xterm differs from VTE and konsole, apparently even the authors of these disagree on the desired behavior especially with UTF-8 and C1.

Comment 6 Mark Brader 2017-03-03 10:50:30 CET

First, I've suddenly realized why the behavior of "cat ouch" in the
shell is different from "cat ouch" inside "script" -- this is one of
the things that was really bothering me.  It's because, when this
bad state occurs, it goes away when the process that wrote the C1
control character terminates.  In the one case, that process is "cat",
so it terminates almost instantly; in the other case, it's the shell
started by "script".

> The core of the problem, in your case, is that you mix two encodings,
> and something that is a printable character in one is a control character
> in the other, waiting for its terminating sequence.

As I see it, mixing two encodings is the way that this accident happened
*this* time, and is the reason why it depended on the encoding that I set
in xfce4-terminal.  But the same accident could still happen at any time
that binary output was accidentally directed to the terminal -- as in the
"cat localtime" case.  If I had been working in UTF-8, the binary output
might still contain a C1 control character in UTF-8.


> ... the terminal emulator keeps working, it has just entered a special mode.

> What I believe you actually expect is for the emulator not to switch
> to this special mode, and this is a false expectation from you if the
> emulator actually encounters these bytes.

I would be satisfied if resetting my xfce4-terminal would switch the
emulator *out* of the special mode, so I can unstick it without having
to kill processes.  Is this also an unrealistic expectation?  (If so,
I think we're done here.)

> VTE could add an API for disallowing C1 control characters, and
> then frontends (incl. xfce4-terminal) could make a checkbox for
> it. However VTE tends to be an emulator that does not implement
> and expose such kinds of various possible behaviors, unless
> really-really required.

Is VTE part of xfce or is it someone else's project that xfce4-terminal
depends on?

> Probably xterm has such an option.

In fact the Control-Left-Click menu in xterm has an option "8-bit Controls".
I have not explored what it does.

> I'm not aware of mailx, maybe that's an ancient piece of crap...

It is indeed, but I've been using it for enough decades that I stay
and put up with the nuisances.

Thanks for your other comments.

Comment 7 Igor editbugs

2017-03-03 10:53:34 CET

(In reply to Mark Brader from comment #6)
> Is VTE part of xfce or is it someone else's project that xfce4-terminal
> depends on?

VTE is part of GNOME. It's a terminal widget being used by multiple terminal apps, such as gnome-terminal, xfce4-terminal, terminix, and others.

Comment 8 Egmont Koblinger 2017-03-03 11:25:38 CET

(In reply to Mark Brader from comment #6)

> But the same accident could still happen at any time
> that binary output was accidentally directed to the terminal -- as in the
> "cat localtime" case.  If I had been working in UTF-8, the binary output
> might still contain a C1 control character in UTF-8.

Yup, and it can also happen with the standard 7-bit C0 control characters. That's how terminal emulators have always worked from the very beginning. They are heavily stateful.

> I would be satisfied if resetting my xfce4-terminal would switch the
> emulator *out* of the special mode, so I can unstick it without having
> to kill processes.  Is this also an unrealistic expectation?  (If so,
> I think we're done here.)

I think it's a pretty reasonable feature request to discuss. Not sure how easily implementable, and might have downsides as well, but definitely worth investigating. Filed as https://bugzilla.gnome.org/show_bug.cgi?id=779518.

In the mean time, I recommend as a workaround to have a shell prompt that contains some (harmless) escape sequence at its beginning, since the escape character seems to terminate these sequences and hence get the terminal "unstuck".

Comment 9 Mark Brader 2017-03-03 13:40:32 CET

> Yup, and it can also happen with the standard 7-bit C0 control
> characters. That's how terminal emulators have always worked from the
> very beginning. They are heavily stateful.

C0 control characters sent unexpectedly *to* the terminal may send the
cursor to unexpected places, and escape sequences may do all sorts of
things, but locking it up isn't one of them as far as I know, and whether
it is or not, in my experience resetting the terminal emulator clears any
wrong state.

> Filed as https://bugzilla.gnome.org/show_bug.cgi?id=779518.

Thanks.

> In the mean time, I recommend as a workaround to have a shell prompt
> that contains some (harmless) escape sequence at its beginning, since
> the escape character seems to terminate these sequences and hence get
> the terminal "unstuck".

Huh, so it does.  In fact my usual shell prompt *does* contain an escape
sequence, for a color change, but I normally use a different prompt when
inside a subshell, as in "script", and I turned off the colors when
generating the example above.  Using the colored prompt inside "script"
means that "cat ouch" no longer gets hung.  Thanks again.

(And I find that mailx has an option to set the prompt, but, sadly, the
version on the UNIX machine apparently strips out escape characters for
my own protection.  So I can't use the same trick there.  Oh well.)

Comment 10 Egmont Koblinger 2017-03-03 14:13:31 CET

(In reply to Mark Brader from comment #9)

> C0 control characters sent unexpectedly *to* the terminal may send the
> cursor to unexpected places, and escape sequences may do all sorts of
> things, but locking it up isn't one of them as far as I know, and whether
> it is or not, in my experience resetting the terminal emulator clears any
> wrong state.

Every C1 has a C0 equivalent (it's not true the other way around). 0x80 (or, more precisely, in VTE's and Konsole's UTF-8 mode U+0080 encoded in UTF-8 as two bytes) is the same as ESC @, 0x81 is the same as ESC A etc. 

You keep talking about "locking" and "wrong state", nope, technically nothing is locked or is in a wrong state. Things are in a perfectly valid state, waiting for further input as the parameter to an escape sequence. I understand this sucks from the user's point of view.

Resetting shouldn't make any differentiation between C0 and C1 escape sequences. If one "locks" as you call it, so should the other, and vice versa.

> and I turned off the colors when generating the example above

This might have changed the beahvior ;)

Comment 11 Igor editbugs

2018-03-27 20:12:48 CEST

VTE bug (https://bugzilla.gnome.org/show_bug.cgi?id=779518) has been resolved upstream.

Bug #13383

Reported by:
Mark Brader

Reported on: 2017-02-25
Last modified on: 2018-03-27

People

Assignee:
Igor

CC List:
2 users

Version

Version:
0.8.4

Target Milestone:
Future

Attachments

Illustration of terminal window when locked up, and how it got there (12.16 KB, image/png)
2017-02-25 12:25 CET , Mark Brader

no flags

Additional information