hadoop实战--单词串的统计-liurhyme-ChinaUnix博客

hadoop实战--单词串的统计

1.运行简单计数程序

首先准备两个文本文件，在命令行中输入执行命令：

echo "hello hadoop word count">/tmp/test_file1.txt

echo "hello hadoop,I'm a vegetable bird">/tmp/test_file2.txt

将两个文件复制到dfs里，执行命令

bin/hadoop dfs -mkdir test-in （创建文件夹test-in）

bin/hadoop dfs -copyFromLocal /tmp/test*.txt test-in （复制两文件到test-in）

bin/hadoop dfs -ls test-in （查看是否复制成功）显示如下列表：

Found 2 items
-rw-r--r-- 1 hadoop supergroup 24 2011-01-21 18:40 /user/hadoop/test-in/test_file1.txt
-rw-r--r-- 1 hadoop supergroup 34 2011-01-21 18:40 /user/hadoop/test-in/test_file2.txt

注：这里的test-in其实是HDFS路径下的目录，绝对路径为“hdfs://localhost:9000/user/hadoop/test-in”

运行示例，执行如下命令

bin/hadoop jar hadoop-mapred-examples-0.21.0.jar wordcount test-in test-out （将生成结果输出到test-out）屏幕显示：

11/01/21 18:50:16 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/01/21 18:50:17 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
11/01/21 18:50:17 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/01/21 18:50:17 INFO input.FileInputFormat: Total input paths to process : 2
11/01/21 18:50:17 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
11/01/21 18:50:17 INFO mapreduce.JobSubmitter: number of splits:2
11/01/21 18:50:18 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null
11/01/21 18:50:18 INFO mapreduce.Job: Running job: job_201101211705_0001
11/01/21 18:50:19 INFO mapreduce.Job: map 0% reduce 0%
11/01/21 18:50:35 INFO mapreduce.Job: map 100% reduce 0%
11/01/21 18:50:44 INFO mapreduce.Job: map 100% reduce 100%
11/01/21 18:50:47 INFO mapreduce.Job: Job complete: job_201101211705_0001
11/01/21 18:50:47 INFO mapreduce.Job: Counters: 33
FileInputFormatCounters
BYTES_READ=58
FileSystemCounters
FILE_BYTES_READ=118
FILE_BYTES_WRITTEN=306
HDFS_BYTES_READ=300
HDFS_BYTES_WRITTEN=68
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
Job Counters
Data-local map tasks=2
Total time spent by all maps waiting after reserving slots (ms)=0
Total time spent by all reduces waiting after reserving slots (ms)=0
SLOTS_MILLIS_MAPS=22290
SLOTS_MILLIS_REDUCES=6539
Launched map tasks=2
Launched reduce tasks=1
Map-Reduce Framework
Combine input records=9
Combine output records=9
Failed Shuffles=0
GC time elapsed (ms)=642
Map input records=2
Map output bytes=94
Map output records=9
Merged Map outputs=2
Reduce input groups=8
Reduce input records=9
Reduce output records=8
Reduce shuffle bytes=124
Shuffled Maps =2
Spilled Records=18
SPLIT_RAW_BYTES=242

查看执行结果：

bin/hadoop dfs -ls test-out 显示：

Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2011-01-21 18:50 /user/hadoop/test-out/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 68 2011-01-21 18:50 /user/hadoop/test-out/part-r-00000

查看最终统计结果：（执行命令）

bin/hadoop dfs -cat test-out/part-r-00000 显示统计结果，统计了每次词在文件中出现的次数

a 1
bird 1
count 1
hadoop 1
hadoop,I'm 1
hello 2
vegetable 1
word 1