«

So you think you can core?

Posted by Will Rosecrans
    Segmentation fault (core dumped)

Bad luck: Your software just crashed. It happens to all of us, and when it does Linux will write a core dump file with some information about what was happening in the address space of the program when it crashed. If you are feeling a little adventurous, you can do some fun and interesting things with exactly how and where those cores get dumped.

    $ man core

The manual is always a good place to start when learning something new. The man page for “core” has lots of details. You can control the behavior by putting values into the files in /proc/sys/kernel/. One of the most useful is /proc/sys/kernel/core_pattern, which lets you control where the core dump file is written and its name. The documentation also mentions one neat feature that even many experienced Linux administrators aren’t very familiar with:

“If the first character of this file is a
pipe symbol (|), then the remainder of
the line is interpreted as a program to be
executed.”

This means that you aren’t limited to just adjusting the parameters that the kernel exposes, you can literally do whatever you want when a program crashes. Naturally I was quite interested in this feature because “whatever I want” is my favorite thing to do! Since most people don’t write their own core handlers, and practical information about doing so is pretty sparse, let’s take a look at some of the things you can do with a custom core handler like the one in use here at Verizon Digital Media Services.

     $ df -h

With a core_pattern set to “/tmp/core-%e-%t” the core dump files will be named for the name of the program that crashed, and a timestamp indicating when the crash happened. This will let you see multiple crashes that have happened so you can see if the program is always crashing in the same spot or if there are several different things happening that make it crash. But since you are keeping multiple core dump files on the disk, you will eventually fill up the disk. Over time, they will accumulate unless you delete them. You can run a cleanup script in cron to clean up the /tmp directory every night, or every hour, or however often you like. But it’s still theoretically possible to fill up the disk very quickly if there are a lot of crashes. And with a large enough network, all the stuff that’s possible in theory tends to happen in practice sooner or later. Indeed, we used to have occasional problems where computers would fill up their disks with core dumps despite having a periodic cleanup process.

You can set your core_pattern to something like “/tmp/core” and you’ll only ever have exactly one core dump on the system. You won’t accumulate a bunch of files since the system will keep overwriting that file over and over when something crashes. The last program to crash will be there in a file called “/tmp/core” regardless of the name of the program that crashed. It’ll be great fun debugging a crashing program when you don’t even know which program was crashing. What’s worse, even a single really large crash can still fill up your disk.

And that leads us to the idea of writing a core handler that simply won’t ever fill up a disk. Ever. No matter what.

      $ echo “|/my/core_handler %” > /proc/sys/kernel/core_pattern

So, you’ve decided to write yourself a custom core handler! You thoroughly read the core man page, and then turn it on with the above command. But what should it actually do? For the simplest implementation, you just need to read from stdin, and write the data to some file. Here’s some python that accomplishes that task. (This blog post is about getting you started with ideas to write your own, so the examples aren’t fully fleshed out and ready to drop into production. Hopefully it’s enough to point you in an interesting direction if you want to tinker with this sort of thing in your own environment.)

def write_core_from_stdin(core_file_path):
  chunk_size = 4096
  with open(core_file_path, 'w') as f:
    while True:
      data = os.read(sys.stdin.fileno(), chunk_size)
      if not data:
        break
      f.write(data)
write_core_from_stdin(/tmp/core-...)

That was easy! But it also doesn’t do anything very clever yet. It literally just writes the core dump to a file, just like if you had set a filename in core_pattern. But with that much in place as a first step, the exciting possibilities are limited only by your imagination! If you want to make it a little less eager to fill up your disk, you can use something like this:

def write_core_from_stdin(core_file_path, size_limit=None):
  with open(core_file_path, 'w') as f:
    chunk_size = 4096
    total_size = 0
    while True:
      total_size += chunk_size
      if size_limit != None and total_size > size_limit:
        syslog.syslog('ERROR: failed to write core dump at:  '+core_file_path+'  It was larger than '+str(size_limit))
        break
      data = os.read(sys.stdin.fileno(), chunk_size)
      if not data:
        break
      f.write(data)

free_space = get_available_space_for_cores()
if free_space > 0:
   write_core_from_stdin(/tmp/core-..., free_space)

You can also check for free space inside the loop that writes the core file if you are really paranoid about something else writing a lot of data after you start writing the core, but it’ll probably slow things down to check repeatedly after every chunk that you read from the pipe. At this point, the core handler will avoid filling up the disk, assuming that your get_available_space function leaves enough margin of error. But you’ll eventually accumulate a bunch of old cores, and you’ll stop writing new cores to disk. One good way to handle this is to clean up old core dump files before you write the new one. You can keep the last five core dumps so you can always have a few to compare, but you’ll never get into a situation where you have hundreds of accumulated old core dump files sitting around using up all your space, no matter how quickly things are crashing. So it looks a little more like this:

cleanup_old_cores(path_to_where_you_store_your_coredumps, limit=5)
free_space = get_available_space_for_cores()
if free_space > 0:
  write_core_from_stdin(/tmp/core-..., free_space)

Implementation of the cleanup function is left as an exercise for the reader. If you care about digging into your core dumps, you can probably manage to delete some old files. I have abundant faith in you, dear reader. The thing to keep in mind is just to do the cleanup right before you write the core dump, rather than in some cron job. This sort of event-driven approach to cleanup is usually a better approach than polling for stuff to do on a fixed interval. You never waste CPU time or IOPS running a periodic cleanup job when there’s nothing to clean up, and you never lack space for your latest core dump because the cron job hasn’t run yet. The old core dumps stay around for as long as possible, but no longer, so you have the biggest window to look at what went wrong.

But what happens if there still isn’t enough free space to write the core dump file after you clean up the old ones? Before you got into this writing a custom core handler business, you probably had some monitoring system that would alert you when it found a bunch of core dumps on one of your systems. If a disk if too full, you don’t get a core file, and your alerting system will insist that nothing is crashing and you’ll be completely oblivious to what’s actually happening. That’s no good. You should probably add a feature that pings a monitoring system regardless of whether or not you write the core dump file to disk. Python has a simple interface for logging things to the syslog, and you can use rsyslog to filter out those messages and forward them to a central server for alerting. You can also set up some sort of server dedicated to handling this sort of information. We use a server with an HTTP API to aggregate the core dump alerts. Using the flags in the core man page, you can pass lots of details about the core dump to your script as command line parameters to use in your logic and your error messages.

import syslog
msg = 'coredump: %s (%s) killed by signal: %s on host: %s at %s core dumped to: %s/%s'%(
  program, pid, sig, host, core_time, core_dir, core_file)
syslog.syslog(msg)
notify_internal_monitoring_system(msg)
cleanup_old_cores(path_to_where_you_store_your_coredumps, limit=5)
free_space = get_available_space_for_cores()
if free_space > 0:
  write_core_from_stdin(/tmp/core-..., free_space)

Now you can reliably get paged at 3:00 am about something crashing even if a disk is full! (You should probably also have an alert about the disk being full, but that’s a different blog post.) Admittedly, this could be something of a mixed blessing.

      $ echo “Food for thought...”

The core handler always runs as root, so it can write a core file anywhere that root can write to, and do anything that root can do. Keep that in mind because when some nefarious non-root user crashes something, they are triggering your program with input that they control. Try not to make your system so flexible that a user can take advantage of your core handler to do things like overwrite system files. Also keep the fact that it runs as root in mind if you want to use an NFS mounted file system for storing core dumps or core metadata. Root won’t normally be permitted to write to an NFS mount.

You also want to be careful that your core handler won’t dump core itself. If you implement yours in a native code using a language like C, be particularly careful about this sort of thing. (Did you ever watch the movie Inception, where people could get trapped in a dream within a dream within a dream? Avoid core dump inception.) The examples in this blog post are in Python because it’s difficult to get the Python interpreter to crash in normal use. Normally a bug in your code will just end the execution of your script, but not cause the interpreter to dump core. Keep your implementation as simple as you can, and don’t do anything too clever with importing unstable modules. The problem can be somewhat mitigated by setting /proc/sys/kernel/core_pipe_limit to 1. This controls how many core dumps can be happening simultaneously, at least then you can’t have a zillion parallel core handlers dumping cores for corehandlers, effectively accumulating to fork bombing the system. Any core dump that occurs while your script is busy with the previous core will be skipped, but will be logged with a “kernel: Skipping core dump” message in the syslog. You can ensure that every core dump is still counted by monitoring the syslog for that message.

In my implementation, core_pipe_limit is set to 0 (unlimited), but the core handler uses locking with flock() to only ever try to write a single core dump to disk at a time. It will still do some of the other tasks that notify monitoring. This helps avoid a situation where the system wouldn’t be able to properly log a core dump of a crashing core handler because it thinks it is already dumping the core for something else. Ignorance is bliss, but it doesn’t lead to bug fixes. Using a lock to ensure that only one process is ever writing a core dump also ensures a limit for how much IO will be dedicated to core dumps on a busy system. You probably want most of your IOPS being used for something other than core dumps.

Once you have your core dumps being aggregated in a centralised place, you can plug the information into whatever system you use for alerting. We use a bot that posts messages about core dumps into a Slack channel, since we use Slack for a lot of internal communication. Here’s an example of a notification we got when we ran into a bug in the perl interpreter:

Even perl crashes sometimes

Share