If you're using a standard input format, Hadoop will automatically choose a codec for reading your data by examining the file extension. So, for example, if you have a file with a ".gz" extension, Hadoop will recognized that it's gzipped and load it with the gzip codec. This is all well and good... until you're trying to work with a bunch of compressed files that don't have the proper extension. Then the feature suddenly becomes a burden.
I recently found myself in just this situation, and scoured the internet looking for tips on how to override Hadoop's codec choices. I couldn't find any good resources, so I went digging in the source to build the solution myself. Hopefully, this post will save somebody else the trouble!
How Hadoop (Usually) Chooses Codecs
Note: this post assumes the use of the new "mapreduce" API (as opposed to "mapred"). Haven't checked if things would be quite the same in "mapred."
As you're likely aware, Hadoop comes with a handful of input formats for reading different types of data. Each of these input formats is responsible for providing a "record reader," which hands individual records out of the file to the Map step in a MapReduce task. For common formats like the TextInputFormat, the record reader is just a version of the LineRecordReader class, which handles a bunch of functionality common to most record readers.
As it turns out, one of those bits of common functionality is picking a codec, which happens when the "initialize" method is called. See the code for this method here:
As you can see starting about 1/3 of the way through this code, the method creates a Hadoop "CompressionCodecFactory" and then passes the path of the current file to its "getCodec" method. This, as you might imagine, returns a codec for reading the file, and things proceed from there. Of course, if you open up the source for CompressionCodecFactory and take a look at getCodec, you'll find that it's simply making its decision based on known file extensions. So, all we have to do is bypass this little check in the LineRecordReader!
Thankfully, as the CompressionCodecFactory docs point out, the codec factory also has a method "getCodecByClassName" that lets us request a codec using the class of codec we're interested in. We'll exploit this to grab our codec of interest.
Getting the Codec We Want
For the purposes of this walk-through, I'm going to assume we want to force Hadoop to use the gzip codec for all of our files, regardless of the extension. Modify the code to suit your needs, whether that's parsing the directory structure, grabbing a different codec, whatever.
The first thing we need to do is build our own record reader. In your MapReduce project, simply copy and paste the code for the LineRecordReader class (grab the Hadoop source here), giving it a new name (like "GzipLineRecordReader" or "Tom"). You can leave everything in the class untouched, except for the initialize method. (Note that you can't simply extend LineRecordReader, because you'd need to access private variables, which isn't possible...)
Then, simply replace the initialize method with something like the above... (you'll need to import org.apache.hadoop.io.compress.GzipCodec). You'll notice that, rather than passing the file path to getCodec, we're passing getCodecByClassName the name of the GzipCodec class. This will get us the codec we're interested in, and we use the same code as usual to set up the decompressor and input stream. So, now we've got a LineRecordReader capable of reading gzip files with any extension!
Of course, we still need to actually use this fancy new record reader in an input format. That's as simple as overriding the createRecordReader method in an extended version of the input format in question. An example for TextInputFormat is shown below - notice how the createReordReader method returns a NewRecordReader object. That's all there is to it!
Hope this post gets somebody on their merry way without having to quite as much digging in the source as I did! Let me know if you've got any questions, and good luck out there!