Programming and Debugging Large-Scale Data Processing Workflows
Abstract:This talk gives an overview of my team's work on large-scale data processing at Yahoo! Research. The talk begins with overviews of two data processing systems we helped develop: PIG, a dataflow programming environment and Hadoop-based runtime, and NOVA, a workflow manager for Pig/Hadoop. The bulk of the talk focuses on debugging, and looks at what can be done before, during and after execution of a data processing operation:
- Pig's automatic EXAMPLE DATA GENERATOR is used before running a Pig job to get a feel for what it will do, enabling certain kinds of mistakes to be caught early and cheaply. The algorithm behind the example generator performs a combination of sampling and synthesis to balance several key factors---realism, conciseness and completeness---of the example data it produces.
- INSPECTOR GADGET is a framework for creating custom tools that monitor Pig job execution. We implemented a dozen user-requested tools, ranging from data integrity checks to crash cause investigation to performance profiling, each in just a few hundreds of lines of code.
- IBIS is a system that collects metadata about what happened during data processing, for post-hoc analysis. The metadata is collected from multiple sub-systems (e.g. Nova, Pig, Hadoop) that deal with data and processing elements at different granularities (e.g. tables vs. records; relational operators vs. reduce task attempts) and offer disparate ways of querying it. IBIS integrates this metadata and presents a uniform and powerful query interface to users.