Override Hadoop’s Default Compression Codec Selections

If you're using a standard input format, Hadoop will automatically choose a codec for reading your data by examining the file extension. So, for example, if you have a file with a ".gz" extension, Hadoop will recognized that it's gzipped and load it with the gzip codec. This is all well and good... until you're trying to work with a bunch of compressed files that don't have the proper extension. Then the feature suddenly becomes a burden.

Codec Logo

Apparently, "codec" was the name of a 1980's grocery store in France with a hideously '80s logo.

I recently found myself in just this situation, and scoured the internet looking for tips on how to override Hadoop's codec choices. I couldn't find any good resources, so I went digging in the source to build the solution myself. Hopefully, this post will save somebody else the trouble!

Continue reading