Analyzing asterisk coredumps with gdb

If you have been using an asterisk solution for some time, you probably know that asterisk may crash (yes, it happens!).
And when it crashes (and if you have configured it) asterisk will write a coredump file.

Having been using asterisk in our solution for quite some time (since version 1.4 of asterisk till 13.17.2 in our last LTS Version XiVO Polaris) and having deployed it at hundreds of customers, we have been and still are confronted, from time to time, to an asterisk crash.

When an asterisk crash occurs we basically try to answer 2 questions :

  1. What went wrong ? that is How can we palliate it ?
  2. What did trigger the crash ? that is How can we reproduce it or prevent it ?

In this little article we will try to show via an example how one can use gdb to analyze the asterisk core dump and try to answer these 2 questions.

To analyze an asterisk coredump you’ll need gdb utility and the asterisk program in the exact same version that crashed, plus the asterisk debugging symbols (for a xivo it is described here).

Some gdb basics

First step is to open the coredump: gdb asterisk <core file>

It will display a lot of information ending by:

1
2
3
4
5
6
7
8
9
10
11
12
...
[New LWP 8725]
[New LWP 8731]
[New LWP 8756]
[New LWP 8758]
[New LWP 8927]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/asterisk -g -U asterisk'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f2d45ab6029 in handle_hangup (userdata=0x19198d0, sub=<optimized out>, msg=<optimized out>) at app_queue.c:7321
7321 app_queue.c: No such file or directory.

From this output you understand that the crash occured in app_queue.c.

Next step is to display the backtrace. In gdb you use the bt command:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
(gdb) bt
#0 0x00007f2d45ab6029 in handle_hangup (userdata=0x19198d0, sub=<optimized out>, msg=<optimized out>) at app_queue.c:7321
#1 0x00000000005bdb88 in router_dispatch (data=0x1e05da0, sub=0x18673b0, message=0x185eb90) at stasis_message_router.c:201
#2 0x00000000005b226a in subscription_invoke (message=0x185eb90, sub=0x18673b0) at stasis.c:434
#3 dispatch_exec_async (local=<optimized out>) at stasis.c:715
#4 0x00000000005ca2c8 in ast_taskprocessor_execute (tps=tps@entry=0x18c5b20) at taskprocessor.c:963
#5 0x00000000005d144e in execute_tasks (data=0x18c5b20) at threadpool.c:1322
#6 0x00000000005ca21d in ast_taskprocessor_execute (tps=0x149e120) at taskprocessor.c:965
#7 0x00000000005d1cc0 in threadpool_execute (pool=0x14a2870) at threadpool.c:351
#8 worker_active (worker=0x7f2d880012e0) at threadpool.c:1105
#9 worker_start (arg=arg@entry=0x7f2d880012e0) at threadpool.c:1024
#10 0x00000000005db26c in dummy_start (data=<optimized out>) at utils.c:1235
#11 0x00007f2d9eabe064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#12 0x00007f2d9dda662d in clone () from /lib/x86_64-linux-gnu/libc.so.6

It displays the last function calls in the thread that crashed.
Then you can display details of last frame (corresponding to line #0) with the command frame:

1
2
3
(gdb) frame
#0 0x00007f2d45ab6029 in handle_hangup (userdata=0x19198d0, sub=<optimized out>, msg=<optimized out>) at app_queue.c:7321
7321 in app_queue.c

Which is basically what gdb told us when opening the coredump.

Our last command in gdb showed us that the crash happened at line 7321 of app_queue.c in function handle_hangup(). One can look at the code (for an asterisk shipped by xivo you need to download the source via apt-get source asterisk to have all the patches applied):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
static void handle_hangup(void *userdata, struct stasis_subscription *sub,
struct stasis_message *msg)
{
struct queue_stasis_data *queue_data = userdata;
struct ast_channel_blob *channel_blob = stasis_message_data(msg);
RAII_VAR(struct ast_channel_snapshot *, caller_snapshot, NULL, ao2_cleanup);
RAII_VAR(struct ast_channel_snapshot *, member_snapshot, NULL, ao2_cleanup);
RAII_VAR(struct ast_channel *, chan, NULL, ao2_cleanup);
enum agent_complete_reason reason;

ao2_lock(queue_data);

if (queue_data->dying) {
ao2_unlock(queue_data);
return;
}

if (!strcmp(channel_blob->snapshot->uniqueid, queue_data->caller_uniqueid)) {
reason = CALLER;
} else if (!strcmp(channel_blob->snapshot->uniqueid, queue_data->member_uniqueid)) {
reason = AGENT;
} else {
ao2_unlock(queue_data);
return;
}

chan = ast_channel_get_by_name(channel_blob->snapshot->name);
if (chan && (ast_channel_has_role(chan, AST_TRANSFERER_ROLE_NAME) ||
!ast_strlen_zero(pbx_builtin_getvar_helper(chan, "ATTENDEDTRANSFER")) ||
!ast_strlen_zero(pbx_builtin_getvar_helper(chan, "BLINDTRANSFER")))) {
/* Channel that is hanging up is doing it as part of a transfer.
* We'll get a transfer event later
*/
ao2_unlock(queue_data);
return;
}

caller_snapshot = ast_channel_snapshot_get_latest(queue_data->caller_uniqueid);
member_snapshot = ast_channel_snapshot_get_latest(queue_data->member_uniqueid);

ao2_unlock(queue_data);

ast_debug(3, "Detected hangup of queue %s channel %s\n", reason == CALLER ? "caller" : "member",
channel_blob->snapshot->name);

ast_queue_log(queue_data->queue->name, caller_snapshot->uniqueid, queue_data->member->membername,
reason == CALLER ? "COMPLETECALLER" : "COMPLETEAGENT", "%ld|%ld|%d",
(long) (queue_data->starttime - queue_data->holdstart),
(long) (time(NULL) - queue_data->starttime), queue_data->caller_pos);

update_queue(queue_data->queue, queue_data->member, queue_data->callcompletedinsl,
time(NULL) - queue_data->starttime);
send_agent_complete(queue_data->queue->name, caller_snapshot, member_snapshot, queue_data->member,
queue_data->holdstart, queue_data->starttime, queue_data->caller_uniqueid, queue_data->queue->wrapuptime, reason);
remove_stasis_subscriptions(queue_data);
}

Real line #7321 is line #46 in our extract: we therefore learn that it crashed when calling ast_queue_log() function using structures queue_data and caller_snapshot.

What is interesting with gdb is that you can display further details about the frame (like local variables in the function):

1
2
3
4
5
6
7
8
9
(gdb) info frame
Stack level 0, frame at 0x7f2d3adc1cb0:
rip = 0x7f2d45ab6029 in handle_hangup (app_queue.c:7321); saved rip = 0x5bdb88
called by frame at 0x7f2d3adc1ce0
source language c.
Arglist at 0x7f2d3adc1c48, args: userdata=0x19198d0, sub=<optimized out>, msg=<optimized out>
Locals at 0x7f2d3adc1c48, Previous frame's sp is 0x7f2d3adc1cb0
Saved registers:
rbx at 0x7f2d3adc1c78, rbp at 0x7f2d3adc1c80, r12 at 0x7f2d3adc1c88, r13 at 0x7f2d3adc1c90, r14 at 0x7f2d3adc1c98, r15 at 0x7f2d3adc1ca0, rip at 0x7f2d3adc1ca8

And here with info locals command we display local variables values at execution time:

1
2
3
4
5
6
7
8
(gdb) info locals
queue_data = 0x19198d0
channel_blob = <optimized out>
caller_snapshot = 0x0
member_snapshot = 0x7f2d7401b898
chan = 0x0
reason = CALLER
__PRETTY_FUNCTION__ = "handle_hangup"

And here we see that caller_snapshot and chan are NULL (line 4 and 6 of our extract)! This is most probably the cause of our crash.

More information with gdb

Now we want to find which call scenario led to these NULL values and this crash.
We use gdb again to display the value of member_snapshot with print command. Notice:

  • the set print pretty command to nicely display the structure content,
  • and the star (*) before structure name to display all the structure content.
1
2
3
4
5
6
7
8
9
10
11
12
13
(gdb) set print pretty on
(gdb) print *member_snapshot
$6 = {
__field_mgr_pool = 0x7f2d7401baa0,
name = 0x7f2d7401bac2 "SIP/cobs82-00000001",
uniqueid = 0x7f2d7401bade "1502980085.1",
linkedid = 0x7f2d7401baee "1502980083.0",
appl = 0x7f2d7401bafe "AppQueue",
data = 0x7f2d7401bb0a "(Outgoing Line)",
context = 0x7f2d7401bb1c "default",
exten = 0x7f2d7401bb26 "s",
accountcode = ...
...

And here we find that the channel name is SIP/cobs82-00000001. So now we can trace this call in the asterisk full log to see what happened and try to analyze.

That’s it. I’m not at all an expert in this field, but here’s what I gathered from my little journey with gdb and an asterisk coredump :-)
In the end we were able to find the exact scenario leading to the crash and were also able to provide a fix.

Useful sources on gdb :

Share