Personal tools

Skip to content. | Skip to navigation

This Logo Viewlet registered to qPloneSkinTechlight
You are here: Home Keynote Speaker

Keynote Speaker

Programming and Debugging Large-Scale Data Processing Workflows

Christopher Olston, Bionica Human Computing


This talk gives an overview of my team's work on large-scale data processing at Yahoo! Research. The talk begins with overviews of two data processing systems we helped develop: PIG, a dataflow programming environment and Hadoop-based runtime, and NOVA, a workflow manager for Pig/Hadoop. The bulk of the talk focuses on debugging, and looks at what can be done before, during and after execution of a data processing operation:
  • Pig's automatic EXAMPLE DATA GENERATOR is used before running a Pig job to get a feel for what it will do, enabling certain kinds of mistakes to be caught early and cheaply. The algorithm behind the example generator performs a combination of sampling and synthesis to balance several key factors---realism, conciseness and completeness---of the example data it produces.
  • INSPECTOR GADGET is a framework for creating custom tools that monitor Pig job execution. We implemented a dozen user-requested tools, ranging from data integrity checks to crash cause investigation to performance profiling, each in just a few hundreds of lines of code.
  • IBIS is a system that collects metadata about what happened during data processing, for post-hoc analysis. The metadata is collected from multiple sub-systems (e.g. Nova, Pig, Hadoop) that deal with data and processing elements at different granularities (e.g. tables vs. records; relational operators vs. reduce task attempts) and offer disparate ways of querying it. IBIS integrates this metadata and presents a uniform and powerful query interface to users.


Christopher Olston is a web data researcher, and co-founder of Bionica Human Computing. His past affiliations include Yahoo! Research (principal research scientist) and Carnegie Mellon (assistant professor). He holds computer science degrees from Stanford (2003 Ph.D., M.S.; funded by NSF and SGF fellowships) and UC Berkeley (B.S. with highest honors). While at Yahoo, Olston won the 2009 SIGMOD Best Paper Award and co-created Apache Pig, which is used for large-scale data processing by LinkedIn, Twitter, Yahoo and others, comes in Cloudera's standard Hadoop bundle, and is offered by Amazon as a cloud service.
Document Actions