! Please note that this is a snapshot of our old Bugzilla server, which is read only since May 29, 2020. Please go to gitlab.xfce.org for our new server !
Deadlock during loading of folder contents
Status:
RESOLVED: FIXED

Comments

Description Markus Elfring 2017-12-26 17:37:34 CET
I stumbled on the following software situation once more on 2017-12-24. The windows of this file manager did not respond any more as expected. Their status lines displayed the message “Ordnerinhalt wird geladen …” (Loading folder contents...).

elfring@Sonne:~> ps -ef|grep Thunar
elfring   3346  3336  0 10:49 tty2     00:00:02 Thunar --sm-client-id 2e621412a-b89a-4db2-9ba2-bd3cc0e1d1ce --daemon
…
elfring@Sonne:~> strace -p 3346
…
futex(0x55df5419db28, FUTEX_WAIT_PRIVATE, 2, NULL
elfring@Sonne:~> gdb -p 3346
…
(gdb) bt
#0  0x00007fbafaec7fd9 in syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fbafb43febc in g_mutex_lock_slowpath (mutex=mutex@entry=0x55df5419db28) at gthread-posix.c:1313
#2  0x00007fbafb440702 in g_mutex_lock (mutex=mutex@entry=0x55df5419db28) at gthread-posix.c:1337
#3  0x000055df5230aa49 in thunar_thumbnailer_queue_async_reply (proxy=0x55df53804550, handle=2, error=0x0, user_data=0x55df53eb63a0) at thunar-thumbnailer.c:617
#4  0x000055df5230a63e in thunar_thumbnailer_proxy_queue_async_callback (proxy=0x55df53804550, call=<optimized out>, user_data=0x55df5418e790) at ../thunar/thunar-thumbnailer-proxy.h:36
#5  0x00007fbafbcc98a2 in complete_pending_call_and_unlock (connection=connection@entry=0x55df534fa1e0, pending=0x55df543e4da0, message=message@entry=0x55df543e4310) at dbus-connection.c:2331
#6  0x00007fbafbccd21f in dbus_connection_dispatch (connection=connection@entry=0x55df534fa1e0) at dbus-connection.c:4652
#7  0x00007fbafbf124f5 in message_queue_dispatch (source=source@entry=0x55df534fc090, callback=<optimized out>, user_data=<optimized out>) at dbus-gmain.c:90
#8  0x00007fbafb3faf97 in g_main_dispatch (context=0x55df534818b0) at gmain.c:3148
#9  0x00007fbafb3faf97 in g_main_context_dispatch (context=context@entry=0x55df534818b0) at gmain.c:3813
#10 0x00007fbafb3fb1d0 in g_main_context_iterate (context=0x55df534818b0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3886
#11 0x00007fbafb3fb4e2 in g_main_loop_run (loop=0x55df54142fc0) at gmain.c:4082
#12 0x00007fbafd308ad7 in gtk_main () at /usr/lib64/libgtk-x11-2.0.so.0
#13 0x000055df522b3799 in main (argc=<optimized out>, argv=<optimized out>) at main.c:312

The directory contents were correctly displayed when I started another instance of the application “thunar 1.6.13-124.1” on my openSUSE Tumbleweed system.

But a software component got stuck after my login for this XFCE session.


How often can it happen that the call of the macro “_thumbnailer_lock” will be reached in the implementation of the function “thunar_thumbnailer_queue_async_reply” before the corresponding statement “_thumbnailer_unlock” will be completely executed?
https://git.xfce.org/xfce/thunar/tree/thunar/thunar-thumbnailer.c?id=89de3451891c36b1b86cb626de820d11e0f4bd9e#n265
Comment 1 afdw 2020-05-08 01:57:26 CEST
I think I have finally understood the problem here.

I suppose is it a use after free. So here is what I think happens:
1. The Queue method of Tumbler is called: https://gitlab.xfce.org/xfce/thunar/-/blob/112a0753c90a0fe5259f540cf0d6b5fc262bca82/thunar/thunar-thumbnailer.c#L475
2. It returns 0, which is wrong
3. This causes the `job->handle` field to remain at 0: https://gitlab.xfce.org/xfce/thunar/-/blob/112a0753c90a0fe5259f540cf0d6b5fc262bca82/thunar/thunar-thumbnailer.c#L350
4. Next, some other job starts, but does not receive a handle yet. This results in 2 jobs having their `job->handle` field set to 0
5. During this time, the job with handle 0 finishes executing, so it gets deallocated (`handle` function argument here is 0): https://gitlab.xfce.org/xfce/thunar/-/blob/112a0753c90a0fe5259f540cf0d6b5fc262bca82/thunar/thunar-thumbnailer.c#L873 But instead of real job with the handle 0, the one started on step 4 is found (one without a handle assigned), so it gets deallocated
6. Enqueuing of the second job finally finishes:  https://gitlab.xfce.org/xfce/thunar/-/blob/112a0753c90a0fe5259f540cf0d6b5fc262bca82/thunar/thunar-thumbnailer.c#L327 But this job is already deallocated on step 4
7. So already freed memory is accessed and everything goes wrong
I am not sure if these steps are exactly correct.

Here is how it looks in the debugger: after setting these breakpoints in GDB (line numbers as of Thunar commit 112a0753c90a0fe5259f540cf0d6b5fc262bca82):
> break thunar-thumbnailer.c:330
> command
> silent
> printf "thunar_thumbnailer_queue_async_reply: %p\n", job
> continue
> end
> break thunar-thumbnailer.c:865
> command
> silent
> printf "thunar_thumbnailer_thumbnailer_finished: %p %d\n", job, job->handle
> continue
> end
> break thunar-thumbnailer.c:306
> command
> silent
> printf "thunar_thumbnailer_free_job: %p\n", job
> continue
> end
This is a possible output:
> thunar_thumbnailer_thumbnailer_finished: 0x555555a12e40 0
> thunar_thumbnailer_free_job: 0x555555a12e40
> thunar_thumbnailer_queue_async_reply: 0x555555a12e40
> 
> (thunar:1339130): GLib-GObject-WARNING **: 00:09:57.296: invalid uninstantiatable type '(null)' in cast to 'ThunarThumbnailer'
> 
> (thunar:1339130): thunar-CRITICAL **: 00:09:57.296: thunar_thumbnailer_queue_async_reply: assertion '(((__extension__ ({ GTypeInstance *__inst = (GTypeInstance*) ((thumbnailer)); GType __t = ((thunar_thumbnailer_get_type ())); gboolean __r; if (!__inst) __r = (0); else if (__inst->g_class && __inst->g_class->g_type == __t) __r = (!(0)); else __r = g_type_check_instance_is_a (__inst, __t); __r; }))))' failed

I am not really sure not to reproduce the issue, but what I did was repetitively navigating between folders with many files in Thunar and terminating Tumbler; this way the chance of reproducing was fairly high.

I think that just making Tumbler follow the specification should be enough to fix this issue, so here is the bug report (fix included): https://bugzilla.xfce.org/show_bug.cgi?id=16814

Hopefully, all of this makes some sense.
Comment 2 Markus Elfring 2020-05-08 09:15:22 CEST
(In reply to afdw from comment #1)
> 2. It returns 0, which is wrong

Thanks for your discovery.


> 4. Next, some other job starts, but does not receive a handle yet.

I interpreted the implementation of the function “thunar_thumbnailer_queue_async_reply” in the way that there is a specific lock type applied.
I imagined that a deadlock can only happen then if this function is executed once more before an unlock would be performed.
* Does the aspect “reentrancy” matter here?
* Is there a need to use recursive locks?

Is a job queue management relevant for the affected use case?
Comment 3 afdw 2020-05-08 10:09:47 CEST
(In reply to Markus Elfring from comment #2)
> (In reply to afdw from comment #1)
> > 4. Next, some other job starts, but does not receive a handle yet.
> 
> I interpreted the implementation of the function
> “thunar_thumbnailer_queue_async_reply” in the way that there is a specific
> lock type applied.
> I imagined that a deadlock can only happen then if this function is executed
> once more before an unlock would be performed.
> * Does the aspect “reentrancy” matter here?
> * Is there a need to use recursive locks?
I do not think that locking is really the problem here. https://gitlab.xfce.org/xfce/thunar/-/blob/112a0753c90a0fe5259f540cf0d6b5fc262bca82/thunar/thunar-thumbnailer.c#L317-334 :
> static void
> thunar_thumbnailer_queue_async_reply (GObject      *proxy,
>                                       GAsyncResult *res,
>                                       gpointer      user_data)
> {
>   ThunarThumbnailerJob *job = user_data;
>   ThunarThumbnailer    *thumbnailer;
>   GError               *error = NULL;
>   guint                 handle;
> 
>   _thunar_return_if_fail (THUNAR_IS_THUMBNAILER_DBUS (proxy));
>   _thunar_return_if_fail (job != NULL);
> 
>   thumbnailer = THUNAR_THUMBNAILER (job->thumbnailer);
> 
>   _thunar_return_if_fail (THUNAR_IS_THUMBNAILER (thumbnailer));
> 
>   _thumbnailer_lock (thumbnailer);
But the time we got to the `_thumbnailer_lock (thumbnailer);` line, the use after free has already occurred, because the `user_data` is might be a dangling pointer by now, so `thumbnailer = THUNAR_THUMBNAILER (job->thumbnailer);`, which means accessing `((ThunarThumbnailerJob *) user_data)->thumbnailer`, is already wrong.

Here is how this pointer behaves in the debugger (trying to get the type name and trying to inspect data behind it):
> (gdb) p g_type_name(((GTypeInstance *) job)->g_class->g_type)
> $1 = (const gchar *) 0x0
> (gdb) p *((ThunarThumbnailerJob *) user_data)
> $2 = {thumbnailer = 0x7ffff7eb09c0, cancelled = 0, lazy_checks = 1,
> files = 0x555555c01700 = {0x8, <error reading variable>
Compare that to a good pointer (other argument of this function):
> (gdb) p g_type_name(((GTypeInstance *) res)->g_class->g_type)
> $3 = (const gchar *) 0x7ffff73f1279 "GTask"
> (gdb) p *((GTask *) res)
> $4 = {parent_instance = {g_type_instance = {g_class = 0x55555567fd90 [g_type: None]},
> ref_count = 2, qdata = 0x0}, source_object = 0x55555585ce20,
> source_tag = 0x7ffff730e8b0 <g_dbus_proxy_call_internal>, name = 0x0, task_data = 0x0,
> task_data_destroy = 0x0, context = 0x55555568c800, creation_time = 341128963267,
> priority = 0, cancellable = 0x0, callback = 0x5555555f76f0 <thunar_thumbnailer_queue_async_reply>,
> callback_data = 0x5555559ffae0, task_func = 0x0, lock = {p = 0x0, i = {0, 0}},
> cond = {p = 0x0, i = {0, 0}}, thread_cancelled = 0, thread_complete = 0, return_on_cancel = 0,
> completed = 0, had_error = 0, result_set = -1, ever_returned = -1, check_cancellable = -1,
> synchronous = 0, blocking_other_task = 0, error = 0x0, result = {pointer = 0x555555a4cec0,
> size = 93824997445312, boolean = 1436864192}, result_destroy = 0x7ffff7306dc0 <reply_data_free>}
I have seen 3 outcomes of the bug occurring randomly:
1. Just warning in the output
2. Freeze while loading folder contents
3. Segmentation fault
So problems with locking can occur (as this code can lock on some random data from memory), but I think it is not the root cause of the issue.
Comment 4 Theo Linkspfeifer editbugs 2020-05-08 10:57:31 CEST
Is bug 15211 related to this issue?
Comment 5 Markus Elfring 2020-05-08 11:00:32 CEST
(In reply to afdw from comment #3)
> So problems with locking can occur (as this code can lock on some random
> data from memory), but I think it is not the root cause of the issue.
Is this software designed in the way that multiple jobs (like generation of thumbnails) can be executed in parallel?
Comment 6 afdw 2020-05-08 11:14:37 CEST
(In reply to Theo Linkspfeifer from comment #4)
> Is bug 15211 related to this issue?

Most likely this one and https://bugs.launchpad.net/ubuntu/+source/thunar/+bug/1832992 are exactly the same!
Comment 7 afdw 2020-05-08 11:18:38 CEST
(In reply to Markus Elfring from comment #5)
> (In reply to afdw from comment #3)
> > So problems with locking can occur (as this code can lock on some random
> > data from memory), but I think it is not the root cause of the issue.
> Is this software designed in the way that multiple jobs (like generation of
> thumbnails) can be executed in parallel?

Yes, I think it does. In fact, the generation of thumbnails is delegated with asynchronous requests to a DBus service that can have multiple implementations. One from Xfce is Tumbler, and I think the root cause of the problem lies in it.
Comment 8 Markus Elfring 2020-05-08 12:54:23 CEST
(In reply to afdw from comment #7)
Does the file manager check for questionable job handles (according to involved services)?
Comment 9 afdw 2020-05-08 13:10:11 CEST
(In reply to Markus Elfring from comment #8)
> (In reply to afdw from comment #7)
> Does the file manager check for questionable job handles (according to
> involved services)?

No, I do not think it does, it just trusts that thumbnailing service works as intended and that incoming handles are correct. Maybe adding more checks is not a bad idea?
Comment 10 alexxcons editbugs 2020-05-09 00:08:00 CEST
Reading the comments on Bug #16814, I guess this Bug can be closed as a duplicate ?

(In reply to afdw from comment #9)
> Maybe adding more checks is not a bad idea?
If you have an idea on what best to check to stay safe, patches would be very welcome !
Comment 11 afdw 2020-05-10 04:05:24 CEST
(In reply to alexxcons from comment #10)
> Reading the comments on Bug #16814, I guess this Bug can be closed as a
> duplicate ?

They are not exactly duplicates of each other. There are 2 problems here:
1. Tumbler returns 0 handles
2. Thunar does not check for them, which leads to undefined behavior

> (In reply to afdw from comment #9)
> > Maybe adding more checks is not a bad idea?
> If you have an idea on what best to check to stay safe, patches would be
> very welcome !

Merge Request: https://gitlab.xfce.org/xfce/thunar/-/merge_requests/4
Comment 12 alexxcons editbugs 2020-05-10 11:27:21 CEST
(In reply to afdw from comment #11)
> Merge Request: https://gitlab.xfce.org/xfce/thunar/-/merge_requests/4

Thanks alot, much appreciated  !
Comment 13 Git Bot editbugs 2020-05-10 13:06:40 CEST
Anton Danilkin referenced this bugreport in commit 98394ef960884af83b801824757a9a69b9bd8da0

Add checks for 0 handles (Bug #14122)

https://gitlab.xfce.org/xfce/thunar/commit/98394ef960884af83b801824757a9a69b9bd8da0
Comment 14 Git Bot editbugs 2020-05-10 13:12:41 CEST
Anton Danilkin referenced this bugreport in commit 480c0663af64753131182a37c7896988586b607c

Add checks for 0 handles (Bug #14122)

https://gitlab.xfce.org/xfce/thunar/commit/480c0663af64753131182a37c7896988586b607c
Comment 15 alexxcons editbugs 2020-05-10 13:14:54 CEST
Pushed to master and 4.14 branch.
To be released in 1.9.0 and 1.8.15

So I gues the thunar part is fine now.

Thank alot for working on it, afdw !
Comment 16 Theo Linkspfeifer editbugs 2020-05-13 11:25:36 CEST
When I kill tumblerd and navigate to a folder containing an image file, I get this log output:

ThunarThumbnailer: got 0 handle (Queue)
ThunarThumbnailer: got 0 handle (Error or Ready)
ThunarThumbnailer: got 0 handle (Finished)

Is this supposed be written to the session log every time tumblerd is (re)started?
Comment 17 alexxcons editbugs 2020-05-13 14:35:14 CEST
(In reply to Theo Linkspfeifer from comment #16)
> When I kill tumblerd and navigate to a folder containing an image file, I
> get this log output:
> 
> ThunarThumbnailer: got 0 handle (Queue)
> ThunarThumbnailer: got 0 handle (Error or Ready)
> ThunarThumbnailer: got 0 handle (Finished)
> 
> Is this supposed be written to the session log every time tumblerd is
> (re)started?
Yes .. though isnt the text prefixed with "Warning" or "Error" ?  Guess we should do that.

These messges should vanish after tumbler got patched: Bug #16814
https://bugzilla.xfce.org/show_bug.cgi?id=16814

Bug #14122

Reported by:
Markus Elfring
Reported on: 2017-12-26
Last modified on: 2020-05-13

People

Assignee:
Xfce Bug Triage
CC List:
5 users

Version

Version:
1.6.13

Attachments

Additional information