hadoop实战--单词串的统计
1.运行简单计数程序
首先准备两个文本文件,在命令行中输入执行命令:
echo "hello hadoop word count">/tmp/test_file1.txt
echo "hello hadoop,I'm a vegetable bird">/tmp/test_file2.txt
将两个文件复制到dfs里,执行命令
bin/hadoop dfs -mkdir test-in (创建文件夹test-in)
bin/hadoop dfs -copyFromLocal /tmp/test*.txt test-in (复制两文件到test-in)
bin/hadoop dfs -ls test-in (查看是否复制成功)显示如下列表:
- Found 2 items
- -rw-r--r-- 1 hadoop supergroup 24 2011-01-21 18:40 /user/hadoop/test-in/test_file1.txt
- -rw-r--r-- 1 hadoop supergroup 34 2011-01-21 18:40 /user/hadoop/test-in/test_file2.txt
注:这里的test-in其实是HDFS路径下的目录,绝对路径为“hdfs://localhost:9000/user/hadoop/test-in”
运行示例,执行如下命令
bin/hadoop jar hadoop-mapred-examples-0.21.0.jar wordcount test-in test-out (将生成结果输出到test-out)屏幕显示:
- 11/01/21 18:50:16 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
- 11/01/21 18:50:17 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
- 11/01/21 18:50:17 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
- 11/01/21 18:50:17 INFO input.FileInputFormat: Total input paths to process : 2
- 11/01/21 18:50:17 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
- 11/01/21 18:50:17 INFO mapreduce.JobSubmitter: number of splits:2
- 11/01/21 18:50:18 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null
- 11/01/21 18:50:18 INFO mapreduce.Job: Running job: job_201101211705_0001
- 11/01/21 18:50:19 INFO mapreduce.Job: map 0% reduce 0%
- 11/01/21 18:50:35 INFO mapreduce.Job: map 100% reduce 0%
- 11/01/21 18:50:44 INFO mapreduce.Job: map 100% reduce 100%
- 11/01/21 18:50:47 INFO mapreduce.Job: Job complete: job_201101211705_0001
- 11/01/21 18:50:47 INFO mapreduce.Job: Counters: 33
- FileInputFormatCounters
- BYTES_READ=58
- FileSystemCounters
- FILE_BYTES_READ=118
- FILE_BYTES_WRITTEN=306
- HDFS_BYTES_READ=300
- HDFS_BYTES_WRITTEN=68
- Shuffle Errors
- BAD_ID=0
- CONNECTION=0
- IO_ERROR=0
- WRONG_LENGTH=0
- WRONG_MAP=0
- WRONG_REDUCE=0
- Job Counters
- Data-local map tasks=2
- Total time spent by all maps waiting after reserving slots (ms)=0
- Total time spent by all reduces waiting after reserving slots (ms)=0
- SLOTS_MILLIS_MAPS=22290
- SLOTS_MILLIS_REDUCES=6539
- Launched map tasks=2
- Launched reduce tasks=1
- Map-Reduce Framework
- Combine input records=9
- Combine output records=9
- Failed Shuffles=0
- GC time elapsed (ms)=642
- Map input records=2
- Map output bytes=94
- Map output records=9
- Merged Map outputs=2
- Reduce input groups=8
- Reduce input records=9
- Reduce output records=8
- Reduce shuffle bytes=124
- Shuffled Maps =2
- Spilled Records=18
- SPLIT_RAW_BYTES=242
查看执行结果:
bin/hadoop dfs -ls test-out 显示:
- Found 2 items
- -rw-r--r-- 1 hadoop supergroup 0 2011-01-21 18:50 /user/hadoop/test-out/_SUCCESS
- -rw-r--r-- 1 hadoop supergroup 68 2011-01-21 18:50 /user/hadoop/test-out/part-r-00000
查看最终统计结果:(执行命令)
bin/hadoop dfs -cat test-out/part-r-00000 显示统计结果,统计了每次词在文件中出现的次数
- a 1
- bird 1
- count 1
- hadoop 1
- hadoop,I'm 1
- hello 2
- vegetable 1
- word 1