Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
ohNet hang after TimerManager thread crashes
12-04-2013, 07:35 PM
Post: #1
ohNet hang after TimerManager thread crashes
I've been trying to debug an ohNet hang. I've got as far as establishing that it was caused by the TimerManager thread having crashed while holding the TMCB mutex. This causes all other threads to block when they do a Timer::Cancel call (at line 72 of Timer.cpp). The native thread dump shows 5 of these blocked threads. This is running on Linux ARM.

I have a native thread dump with line numbers, a Java thread dump, and a partial trace log (with all log messages except Timer and SsdpMulticast). Which of these would be useful to you for problem determination? The trace log is large (about 50MB).
Find all posts by this user
12-04-2013, 09:55 PM
Post: #2
RE: ohNet hang after TimerManager thread crashes
(12-04-2013 07:35 PM)simoncn Wrote:  I've been trying to debug an ohNet hang. I've got as far as establishing that it was caused by the TimerManager thread having crashed while holding the TMCB mutex. This causes all other threads to block when they do a Timer::Cancel call (at line 72 of Timer.cpp). The native thread dump shows 5 of these blocked threads. This is running on Linux ARM.

I have a native thread dump with line numbers, a Java thread dump, and a partial trace log (with all log messages except Timer and SsdpMulticast). Which of these would be useful to you for problem determination? The trace log is large (about 50MB).

I'm surprised you're seeing a hang that follows a crash - wouldn't a crash abort the process, taking out all the other threads that later hang inside Timer::Cancel? I could imagine other threads hanging if a timer callback threw an exception that caused TimerManager::Fire to exit without calling CallbackUnlock but I'm not sure how this would happen without Thread::EntryPoint catching the exception and aborting the app.

Any more info about the crash would be very useful in helping me understand better. I imagine the native thread dump would be interesting too. There might be more questions to follow later Smile
Find all posts by this user
12-04-2013, 11:07 PM (This post was last modified: 12-04-2013 11:08 PM by simoncn.)
Post: #3
RE: ohNet hang after TimerManager thread crashes
(12-04-2013 09:55 PM)simonc Wrote:  I'm surprised you're seeing a hang that follows a crash - wouldn't a crash abort the process, taking out all the other threads that later hang inside Timer::Cancel? I could imagine other threads hanging if a timer callback threw an exception that caused TimerManager::Fire to exit without calling CallbackUnlock but I'm not sure how this would happen without Thread::EntryPoint catching the exception and aborting the app.

Any more info about the crash would be very useful in helping me understand better. I imagine the native thread dump would be interesting too. There might be more questions to follow later Smile

Perhaps 'crash' is the wrong term. What I meant is that the TimerManager thread is no longer running, and none of the running threads is holding the TMCB mutex. I presume this means something bad happened to the TimerManager thread while it was holding this mutex, and it exited without releasing the mutex and without terminating the whole process.

I'm attaching the native thread dump and the last 5000 lines of the trace log. I'm also attaching the Java thread dump, for reasons explained below.

Important: The Java thread numbers in the trace log do not match the thread numbers in the native thread dump. (This confused me greatly at first.) To see which Java threads match which native threads, you need to use the Java thread dump to do the translation.

This is best explained with an example. The last line of the trace log is:
11:44:44.080 Thread-7: ohNet: OpenHome/Network.cpp:255: >Socket::Receive H = 4479096, MAX = 65536

In the Java thread dump, the matching thread is:
"Thread-7" daemon prio=10 tid=0x00033c00 nid=0x1f52 runnable [0x00000000]

You need to take the nid value of this thread (0x1f52) and translate this to decimal (8018). Now you can find the corresponding thread in the native thread dump by looking for LWP 8018, which is Thread 14:

Thread 14 (Thread 0x4f397490 (LWP 8018)):
#0 0x4002e1b4 in pthread_mutex_lock () from /share/HDA_DATA/.qpkg/MinimServer/libsys/libpthread.so.0
#1 0x4ef76428 in OsMutexLock (aMutex=0xfffffe00) at Os/Posix/Os.c:438
#2 0x4ef6b17c in OpenHome::Mutex::Wait (this=0x1af334) at Build/Include/OpenHome/OsWrapper.inl:53
#3 0x4ef6bce0 in OpenHome::Timer::Cancel (this=0x0) at OpenHome/Timer.cpp:72
#4 0x4ef6bd60 in ~Timer (this=0xfffffe00) at OpenHome/Timer.cpp:93
#5 0x4ef27fa0 in OpenHome::ReaderHttpRequest::Read (this=0x1b5df0, aTimeoutMs=0) at OpenHome/Http.cpp:325
#6 0x4ef098f4 in OpenHome::Net::DviSessionUpnp::Run (this=0x1b5b68) at OpenHome/Net/Device/Upnp/DviServerUpnp.cpp:546
#7 0x4ef5208c in OpenHome::SocketTcpSession::Start (this=0x1b5b68) at OpenHome/Network.cpp:613
#8 0x4ef556e8 in OpenHome::MemberTranslator<OpenHome::SocketTcpSession, void (OpenHome::SocketTcpSession::*)()>::Thunk (
aFb=<value optimized out>) at Build/Include/OpenHome/Functor.h:76
#9 0x4ef6a62c in OpenHome::ThreadFunctor::Run (this=<value optimized out>) at Build/Include/OpenHome/Functor.h:34
#10 0x4ef6aacc in OpenHome::Thread::EntryPoint (aArg=<value optimized out>) at OpenHome/Thread.cpp:142
#11 0x4ef75f14 in threadEntrypoint (aArg=0x14ab28) at Os/Posix/Os.c:512
#12 0x4002a990 in start_thread () from /share/HDA_DATA/.qpkg/MinimServer/libsys/libpthread.so.0
#13 0x4013b5c8 in clone () from /lib/libc.so.6


Attached File(s)
.zip  qnaphang.zip (Size: 33.82 KB / Downloads: 2)
Find all posts by this user
14-04-2013, 12:38 PM
Post: #4
RE: ohNet hang after TimerManager thread crashes
(12-04-2013 11:07 PM)simoncn Wrote:  Perhaps 'crash' is the wrong term. What I meant is that the TimerManager thread is no longer running, and none of the running threads is holding the TMCB mutex. I presume this means something bad happened to the TimerManager thread while it was holding this mutex, and it exited without releasing the mutex and without terminating the whole process.

There are only two ways for TimerManager thread to exit - TimerManager::Stop being called or a timer callback throwing an exception and not catching it. The former will only happen during stack shutdown; the latter will be caught by Thread::EntryPoint which should log an error then abort the process.

The only way I can think of that'd leave ohNet running without the timer thread would be if you set InitParams.setFatalErrorHandler and attempted to keep running after it was called. Do you do this in MinimServer?

(This doesn't change the fact that there'll be a bug in ohNet I need to fix; understanding how you got into this state would help ensure I don't spend too much time looking in the wrong areas for a fix though.)
Find all posts by this user
14-04-2013, 01:42 PM
Post: #5
RE: ohNet hang after TimerManager thread crashes
(14-04-2013 12:38 PM)simonc Wrote:  There are only two ways for TimerManager thread to exit - TimerManager::Stop being called or a timer callback throwing an exception and not catching it. The former will only happen during stack shutdown; the latter will be caught by Thread::EntryPoint which should log an error then abort the process.

The only way I can think of that'd leave ohNet running without the timer thread would be if you set InitParams.setFatalErrorHandler and attempted to keep running after it was called. Do you do this in MinimServer?

Thanks for looking into this. MinimServer does set an ohNet fatal error handler, which logs a message, creates a crash dump, then returns to ohNet. I'm confident that this wasn't called, because no fatal error message was logged and no crash dump was created.

My first implementation of this fatal error handler terminated the process instead of returning to ohNet. This caused problems (with the MinimServer shutdown thread hanging, I think), so I looked at the ohNet code and it seemed to me that ohNet would always do the process termination itself if the MinimServer fatal error handler returned to it. Based on this, I changed the code in MinimServer to return to ohNet after doing the fatal error logging. Is this a valid approach, or have I misunderstood something?
Find all posts by this user
15-04-2013, 08:49 AM
Post: #6
RE: ohNet hang after TimerManager thread crashes
I would be interested to see the full trace log. Is there anywhere you can upload it? If not let us know and we can find an FTP server you can upload to. I would concur with Simon that the only obvious ways for the timer thread to stop running are a call to Stop or an unhandled exception, and I think only an unhandled exception could cause the timer thread to exit without releasing the TMCB mutex. As far as I am aware, if you do not terminate the process in the fatal error handler, ohNet will not do it for you - returning from the handler could be expected to give this behaviour. It would be interesting to look for log entries from "Thread::Entry" to see if any threads exited unexpectedly.

I'm interested that you say that terminating the process in the fatal error handler caused hangs. Are you trying to perform a clean shutdown? I would suggest that a fatal error handler should do as little as possible before calling exit (or System.exit for Java). Since ohNet is in an unexpected and unknown state, it cannot guarantee that it won't hang or crash when you ask it to do anything. If you are calling System.exit in the handler and the process still hangs, there might be an issue with ohNet's Java bindings.
Visit this user's website Find all posts by this user
15-04-2013, 08:52 AM (This post was last modified: 15-04-2013 08:58 AM by simonc.)
Post: #7
RE: ohNet hang after TimerManager thread crashes
(12-04-2013 07:35 PM)simoncn Wrote:  I've been trying to debug an ohNet hang. I've got as far as establishing that it was caused by the TimerManager thread having crashed while holding the TMCB mutex. This causes all other threads to block when they do a Timer::Cancel call (at line 72 of Timer.cpp). The native thread dump shows 5 of these blocked threads. This is running on Linux ARM.

How do you determine that the TimerManager thread has crashed/exited? It is the first thread ohNet creates (in the Environment ctor) so often appears as thread #2 in a dump of native threads. (It may appear later in a dump if your app creates native threads before initialising ohNet.) Your native thread dump doesn't have any gaps in thread numbering to suggest a thread has exited so I wonder if the timer thread is still running as one of the threads without any usable call stack.

In your experience of debugging on QNAP, does this sound plausible? Or are the thread numbers here not consistent through the lifetime of an app? (i.e. If a low numbered thread exits, would you expect all higher numbered threads to have changed their number in the next thread dump? Or will they retain their numbers with a gap left by the exited thread?)


(14-04-2013 01:42 PM)simoncn Wrote:  Thanks for looking into this. MinimServer does set an ohNet fatal error handler, which logs a message, creates a crash dump, then returns to ohNet. I'm confident that this wasn't called, because no fatal error message was logged and no crash dump was created.

My first implementation of this fatal error handler terminated the process instead of returning to ohNet. This caused problems (with the MinimServer shutdown thread hanging, I think), so I looked at the ohNet code and it seemed to me that ohNet would always do the process termination itself if the MinimServer fatal error handler returned to it. Based on this, I changed the code in MinimServer to return to ohNet after doing the fatal error logging. Is this a valid approach, or have I misunderstood something?

The code for CallFatalErrorHandler in Exception.cpp suggests that a user-specified callback is run and no default behaviour is applied. It is left to the callback to abort the process. Can you try doing this again in your callback please? I'm happy to help diagnose any problems this work causes.
Find all posts by this user
15-04-2013, 05:56 PM
Post: #8
RE: ohNet hang after TimerManager thread crashes
I was chatting to Simon and Gregg about this over lunch, and I had another idea of something to check - could your fatal error handler have thrown an exception (e.g. a NullPointerException or IllegalFormatException) when it was invoked by ohNet? I'm not entirely sure what JNI and our bindings do with unhandled Java exceptions, but it's conceivable that the exception might be ignored. (The only things I think it *could* realistically do are to ignore it or terminate the process.) That could be a way for the timer thread to terminate without a fatal error message being logged or a crash dump being created. It should still log the thread termination in Thread::EntryPoint (see https://github.com/openhome/ohNet/blob/m...d.cpp#L156 ), though, which is a reason I'm keen to see the full log. I think it should log something similar to "Thread::Entry(): Thread::Run returned, exiting thread TIMM(300f3804)" if the timer thread ever exits.
Visit this user's website Find all posts by this user
15-04-2013, 06:48 PM
Post: #9
RE: ohNet hang after TimerManager thread crashes
(15-04-2013 05:56 PM)andreww Wrote:  I was chatting to Simon and Gregg about this over lunch, and I had another idea of something to check - could your fatal error handler have thrown an exception (e.g. a NullPointerException or IllegalFormatException) when it was invoked by ohNet? I'm not entirely sure what JNI and our bindings do with unhandled Java exceptions, but it's conceivable that the exception might be ignored. (The only things I think it *could* realistically do are to ignore it or terminate the process.) That could be a way for the timer thread to terminate without a fatal error message being logged or a crash dump being created. It should still log the thread termination in Thread::EntryPoint (see https://github.com/openhome/ohNet/blob/m...d.cpp#L156 ), though, which is a reason I'm keen to see the full log. I think it should log something similar to "Thread::Entry(): Thread::Run returned, exiting thread TIMM(300f3804)" if the timer thread ever exits.

I'm very busy at the moment with something else. I'll send a full response as soon as I can. Some brief points:

1) There are no Thread::Entry messages in the log.

2) For various reasons, I'm sure that my fatal error handler wasn't called. (More details later.)

3) My fatal error handler does have a problem in that it would return after an unhandled exception instead of terminating the process. If it's called by an assertion failure, returning is OK because ohNet then calls abort().

4) The main reason I am currently returning rather than terminating the process from Java is that an abort() call from C++ launches Visual Studio on Windows. This is important for me to be able to debug whatever in ohNet caused the fatal error. If I call System.exit() from Java, Visual Studio won't be launched.

Sorry this is so brief. More later....
Find all posts by this user
16-04-2013, 08:10 PM (This post was last modified: 16-04-2013 08:26 PM by simoncn.)
Post: #10
RE: ohNet hang after TimerManager thread crashes
(15-04-2013 08:49 AM)andreww Wrote:  I would be interested to see the full trace log. Is there anywhere you can upload it? If not let us know and we can find an FTP server you can upload to.

I've sent you a PM with a location where you can download this.

The log was produced with an ohNet debug mask of 0x7FFFFFCF (see line 87), so it doesn't have any Timer log messages (unfortunately).

Quote:I would concur with Simon that the only obvious ways for the timer thread to stop running are a call to Stop or an unhandled exception, and I think only an unhandled exception could cause the timer thread to exit without releasing the TMCB mutex. As far as I am aware, if you do not terminate the process in the fatal error handler, ohNet will not do it for you - returning from the handler could be expected to give this behaviour. It would be interesting to look for log entries from "Thread::Entry" to see if any threads exited unexpectedly.

You are correct that MinimServer's current behaviour of not terminating the process won't do the right thing if the fatal error is caused by an unhandled exception. I've made a test version of ohNet that deliberately throws an unhandled exception in the TimerManager thread, and running MinimServer with this test version produces a fatal error message in the MinimServer log and then returns to ohNet.

By running this test, I've confirmed that an unhandled exception from ohNet does successfully produce a message in the MinimServer log. There is no such message in the full log of the hang, and there are no Thread::Entry messages either. As the debug mask is enabled for Thread::Entry messages, I think this provides clear evidence that the hang wan't caused by an unhandled exception, ulness there's some kind of unhandled exception that could bypass calling the fatal error error handler.

Quote:I'm interested that you say that terminating the process in the fatal error handler caused hangs. Are you trying to perform a clean shutdown? I would suggest that a fatal error handler should do as little as possible before calling exit (or System.exit for Java). Since ohNet is in an unexpected and unknown state, it cannot guarantee that it won't hang or crash when you ask it to do anything. If you are calling System.exit in the handler and the process still hangs, there might be an issue with ohNet's Java bindings.

My orignal code for this called System.exit(). MinimServer has a Java shutdown thread that runs after System.exit() and other process termination actions, and this shutdown thread attempts to shut things down cleanly. For most error conditions, this is the right thing to do if ohNet isn't in a broken state. In this specific case, it causes problems because the shutdown logic makes calls to ohNet.

I could modify this logic to set a flag that ohNet is broken and do a minimal shutdown instead. However, this wouldn't invoke the Visual Studio debugger on Windows. To do that, some native code must issue an abort() call. Would you be willing for ohNet to expose Os::Quit as a call that could be invoked via the Java bindings for this purpose?

(15-04-2013 08:52 AM)simonc Wrote:  How do you determine that the TimerManager thread has crashed/exited? It is the first thread ohNet creates (in the Environment ctor) so often appears as thread #2 in a dump of native threads. (It may appear later in a dump if your app creates native threads before initialising ohNet.) Your native thread dump doesn't have any gaps in thread numbering to suggest a thread has exited so I wonder if the timer thread is still running as one of the threads without any usable call stack.

I have never seen anything like that, but I can't be 100% certain that it couldn't happen.

Quote:In your experience of debugging on QNAP, does this sound plausible? Or are the thread numbers here not consistent through the lifetime of an app? (i.e. If a low numbered thread exits, would you expect all higher numbered threads to have changed their number in the next thread dump? Or will they retain their numbers with a gap left by the exited thread?)

The thread numbers at the left of the native thread dump (1, 2, 3, ... 30) have their numbers changed to automatically fill any gaps when a thread exits. From the Java thread dump, you can see "Thread-378", which indicates that a total of at least 378 threads have been created during the lifetime of the process.

The LWP numbers in the native thread dump ("nid" in the Java thread dump) should be consecutive originally and shouldn't change when threads are created and destroyed, so these might contain gaps that give useful clues. However, some gaps in the sequence might be for threads created by other processes, so it's very difficult to be sure whether a gap is significant or not.

(14-04-2013 01:42 PM)simoncn Wrote:  The code for CallFatalErrorHandler in Exception.cpp suggests that a user-specified callback is run and no default behaviour is applied. It is left to the callback to abort the process. Can you try doing this again in your callback please? I'm happy to help diagnose any problems this work causes.

See my response to Andrew's post with more detail on this. Clearly I need to make a change here. If MinimServer could invoke an ohNet API that calls Os::Quit, this would be a good solution for me.

(15-04-2013 05:56 PM)andreww Wrote:  I was chatting to Simon and Gregg about this over lunch, and I had another idea of something to check - could your fatal error handler have thrown an exception (e.g. a NullPointerException or IllegalFormatException) when it was invoked by ohNet?

I've run a test (see my response to your previous post) that seems to rule out this possibility.

Quote:I'm not entirely sure what JNI and our bindings do with unhandled Java exceptions, but it's conceivable that the exception might be ignored. (The only things I think it *could* realistically do are to ignore it or terminate the process.) That could be a way for the timer thread to terminate without a fatal error message being logged or a crash dump being created. It should still log the thread termination in Thread::EntryPoint (see https://github.com/openhome/ohNet/blob/m...d.cpp#L156 ), though, which is a reason I'm keen to see the full log. I think it should log something similar to "Thread::Entry(): Thread::Run returned, exiting thread TIMM(300f3804)" if the timer thread ever exits.

There is nothing like this in the log.
Find all posts by this user


Forum Jump: