Optimizing Split Sizes for Hadoop’s CombineFileInputFormat

Many of the challenges of using Hadoop with small files are well-documented. But there's one thing I haven't seen a lot of discussion about: optimizing the maximum split size for the CombineFileInputFormat and its derivatives (e.g., CombineTextInputFormat). But this is actually a pretty big issue - improper configuration can cause out-of-memory errors or degraded performance.

An Elephant never forgets. But, apparently, it can run out of memory.

Thankfully, with a little bit of knowledge about your input data and your cluster, you can determine a value for this setting that will keep your jobs running along happily.

Continue reading