Skip to content

Hadoop Streaming Explained: A Look at Data Processing with Unix Utilities

Comprehensive Educational Hub: This platform encompasses a vast array of learning resources, catering to diverse fields such as computer science, programming, scholastic education, professional development, commerce, software tools, and competitive exam preparation, among others.

Understanding Hadoop Data Streaming Processes
Understanding Hadoop Data Streaming Processes

Hadoop Streaming Explained: A Look at Data Processing with Unix Utilities

Hadoop Streaming, a powerful feature available since Hadoop version 0.14.1, allows developers to write Map-Reduce programs using various programming languages like Ruby, Perl, Python, C++, and more. This feature overcomes the original limitation of Hadoop MapReduce being Java-only, enabling rapid prototyping and leveraging legacy code written in many languages without changing the underlying Hadoop infrastructure.

The Hadoop Framework is completely written in Java, but programs for Hadoop are not necessarily coded in Java. Hadoop Streaming can read data from standard input (STDIN) and write to standard output (STDOUT). In Hadoop Streaming, the list of key-value pairs is fed to the Map phase, and the Mapper generates intermediate key-value pairs. These external mapper processes, written in different languages like Python, can be used with Hadoop Streaming to process key-value pairs.

The Reducer in Hadoop Streaming processes intermediate key-value pairs. External reducer processes are separate and work through STDIN and STDOUT. For instance, a compiled C++ program can serve as a Reducer that aggregates and summarizes the sort-shuffled mapper output received via STDIN and outputs its result back via STDOUT.

The input reader in Hadoop Streaming reads input data and produces a list of key-value pairs. To read specific types of data, a particular input format must be created with these input readers. Different types of data can be read using Hadoop Streaming, such as .csv files, database tables, image data (.jpg, .png), and audio data.

The Hadoop Streaming feature allows developers to write Map-Reduce programs in their preferred languages. This flexibility is achieved by treating user programs as black-box mappers and reducers that exchange data through standard streams. No special Hadoop API bindings are required in these languages since the communication is purely through standard input/output streams.

In the Map phase, a Python script can be a Mapper that reads lines from STDIN, processes them, and outputs intermediate key-value pairs. Similarly, Ruby scripts can be plugged in as mapper or reducer stages, communicating via standard streams.

The output generated by external reducers is gathered and stored to HDFS. The processed intermediate key-value pairs are fed to an external reducer through the shuffle and sorting process. Hadoop Streaming launches the user’s scripts as external processes on the cluster nodes, which receive input data line-by-line via STDIN, process it, and output key-value pairs to STDOUT to be consumed in the MapReduce pipeline.

In summary, Hadoop Streaming enables data processing in different programming languages by treating user programs as black-box mappers and reducers that exchange data through standard streams, allowing Python, C++, Ruby, and other languages to participate fully in Hadoop MapReduce workflows.

Read also:

Latest

Exploration of Data Files and Computational Methods

Algorithms and Digital Files Explored

In her latest article published in the government bulletin, AI expert Marina Mechanika explores the potential and constraints of digital assistants in government offices. Balancing the dreamy ideals of automated case officers and the practical realities, she probes the capabilities of...