Cache Trashing in Linux

I work with huge files alot. It’s one of my “hobbies”. Many of these files are about 4.5GB, rar’ed and often par’ed.

Now in case you wonder what RAR and what PAR is… RAR is a compression storage format and PAR stands for Parity Archive which is used to repair “damaged” files. These two file formats are used a lot for transferring huge files across usenet and stuff. But don’t tell anyone I told you that.

So why this page?
Well, I have a problem with the way Linux deals with these files.

Linux loves to cache I/O. It likes it so much, that it’ll try to cache virtually everything you throw at it. Generally, this is a good thing. Caching files means that the file that’s on the harddisk is placed inside the system memory. And trust me, system memory is a lot faster than any harddisk, so recurrent access to these files will be a LOT faster.

Ofcourse, a 4.5GB file will never fit inside of my 512MB system memory. So why try to cache it? I don’t know. Maybe Linux doesn’t know it’s dealing with an insanely huge file that has no benefit of being cached.

So what if it’s cached, why should I care?
Well, Imagine that you’ve loaded all your favorite apps. These apps their files are cached in the system memory, so starting an application for the second time will be a lot faster than the first time.

However, when unrar’ing or repairing a 4.5GB file, all of this cache is simply trashed, because part of this huge file will be stored in memory, replacing the cache that was there already. As a result, switching between programs, starting an application you closed 5 minutes ago, etc., will be horribly slow because they are no longer cached.

Okay, whatever. So what does whining about it do?
I tried to find a “fix” for this problem. After some research, I learned of the ability to deal with files in a way that prevents the kernel from caching the file. This stuff is low-level, and required source code modification. Luckily, I have all the sources that I need, so I was ready to try some things.

So what’s the trick?
Well it’s actually really simple. When opening a file using the open() system call, you pass it the O_DIRECT flag. This way, it’ll bypass the kernel cache. For example:

int fd = open("hugefile", O_DIRECT);
/* process the file */
close (fd);

Many applications use fopen() instead, which is a file stream. To still benefit from O_DIRECT, you can do this:

int fd = open("hugefile", O_DIRECT | O_CREAT | O_RDWR | O_LARGEFILE);
FILE *f = fdopen(fd, "w+"); /* used to be fopen("hugefile, "w+"); */
/* work on f */
fclose(f);
close(fd);

Ofcourse, you’ll need to change it a bit for the application that you want to modify.

Applications I changed

  • unrar
  • par2cmdline

Application I still need to change

  • nzbperl

Unfortunately, nzbperl uses perl. Not sure how I will deal with this yet.

The result?
I can now continue using my laptop as if nothing is happening, when unrar’ing a huge archive or repairing one.

So, with this little article I showed how I improved my own experience. I will not claim that this solution works for everyone the way it worked for me. If you keep unrar’ing a file over and over again, there will be benefits in caching the file. In that case, don’t change anything.

Hopefully someone else will benefit from it too.

Disclamer
This page is very rough. There may be errors, and you should never treat this as a reliable source or reference. If you’d like to point out errors I’ve made, feel free to email me at remenic@gmail.com.