基于MapReduce框架的词频统计应用。python
时间: 2024-12-28 18:14:49 浏览: 8
### Python MapReduce Framework Word Frequency Counting Application Example
In a typical implementation using the Hadoop Streaming API with Python, two primary components are involved: Mapper and Reducer scripts. For word frequency counting:
Mapper script reads input lines from standard input (stdin), splits each line into words, and outputs key-value pairs where keys represent individual words while values equal "1". This process can be illustrated as follows[^1]:
```python
#!/usr/bin/env python3
import sys
for line in sys.stdin:
# Remove leading/trailing whitespace characters such as '\n'
line = line.strip()
# Split the line into words based on space delimiter
words = line.split()
# Iterate over all extracted words
for word in words:
# Write tab-separated tuples to stdout
print(f"{word}\t1")
```
Reducer receives these intermediate results through stdin again but now grouped by identical keys (words). The task here involves summing up counts associated with every unique term before printing out final tallies.
The reducer code snippet is given below:
```python
#!/usr/bin/env python3
from collections import defaultdict
import sys
current_word = None
count_sum = 0
word = None
# Read data from STDIN one line at time
for line in sys.stdin:
# Strip off any extra spaces/newlines
line = line.strip()
# Parse incoming tuple consisting of 'key\tvalue' format
try:
word, count = line.rsplit('\t', 1)
# Convert string representation back into integer type
count = int(count)
except ValueError:
continue
if current_word == word:
count_sum += count
else:
if current_word:
print(f'{current_word}\t{count_sum}')
current_word = word
count_sum = count
if current_word == word:
print(f'{current_word}\t{count_sum}')
```
To execute this program within an environment supporting Hadoop streaming jobs, save both pieces above under filenames `mapper.py` and `reducer.py`, respectively; ensure they have executable permissions set properly via Unix shell commands like `chmod +x mapper.py`. Then submit job configuration files along with paths pointing towards your custom mappers/reducers written earlier when invoking hadoop jar command.
阅读全文