Many of the challenges of using Hadoop with small files are well-documented. But there's one thing I haven't seen a lot of discussion about: optimizing the maximum split size for the CombineFileInputFormat and its derivatives (e.g., CombineTextInputFormat). But this is actually a pretty big issue - improper configuration can cause out-of-memory errors or degraded performance.
Thankfully, with a little bit of knowledge about your input data and your cluster, you can determine a value for this setting that will keep your jobs running along happily.
Just FYI before we get started... everything I'm talking about in this post has been tested with Hadoop 2.6.0 and the mapreduce framework. If you go to apply it to other versions (especially the old mapred framework), your mileage may vary.
Non-optimal configurations for the maximum split size can cause problems in (at least) two ways. First, memory errors. In a Hadoop job, the actual input splits are calculated by the Hadoop client, which runs on the Master node. By default, this client gets 1GB of RAM. If you've got a large enough collection of files, this client can actually run out of RAM simply trying to keep track of all your file locations and assign them to splits, resulting in errors like "java.lang.OutOfMemoryError: GC overhead limit exceeded." Of course, you can adjust "HADOOP_HEAPSIZE" to give the client more RAM, but this only goes so far - eventually you're limited by the amount of RAM in the master node.
The second problem is that poor configurations may result in a non-optimal number of splits being created. For example, if a maximum input split size isn't specified at all, Hadoop will combine all files in a rack into a single split. That's unlikely to be the optimum configuration, especially if your job is CPU intensive.
The really simple solution here is simply to say "set your maximum input split size." This is easy enough to do... simply add something like this to your job's Main method:
Of course, the real key is picking the value to use for the maximum size (which is expressed in bytes). But this isn't all that hard either. Essentially, you're simply going to need to optimize two different things.
First, you want to set your max split size small enough that you don't cause your Hadoop client to run out of RAM by keeping track of too many files at the same time. In my (limited) experience, the Hadoop client can reliably handle somewhere around 80,000 files under its default RAM limit of 1 GB. You should experiment a bit, as this might depend on the length of your file names, the configuration of your cluster, etc. But we'll stick with 80,000 for the purposes of illustration.
Essentially, you want to set a max split size that will result in splits being created from a maximum of 80,000 files at a time. This requires some simple math - the maximum split size you'd want to use is your average file size times 80,000. So, if your average file is 1.6 KB, you can set a max split size of 128 MB. Easy.
The second requirement is that you should set a split size small enough to match whatever you've decided is your target number of mappers (e.g., the number of CPU cores in your cluster). For this, you'll need to know (at least roughly) the total size of your files. The max split size you want is the total size of your files divided by your target number of mappers. So, if you have 4 gigs of files and you want to have at least 16 mappers, you'll need a max split size of 256 MB. Again, easy.
Of course, these calculations are very likely to produce two very different numbers. Simply take the smaller of the two, and you're good to go.
That's it! Please leave a comment if this has been helpful, or if you notice something that's out of place. I appreciate the feedback!