mahout算法源码分析之Parallel Frequent Pattern Mining(一)实战
本系列分析Parallel Frequent Pattern Mining源码,本篇作为第一篇,首先进行实战,实战参考mahout官网内容。这里主要是测试sequential和mapreduce模式下对数据处理的耗时分析,使用数据为:retail.dat,前面几条数据如下:
[plain]
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32
33 34 35
36 37 38 39 40 41 42 43 44 45 46
38 39 47 48
38 39 48 49 50 51 52 53 54 55 56 57 58
Parallel Frequent Pattern Mining 主程序对应的源代码是org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver。直接空参数或者使用-h参数调用FPGrowthDriver类,得到下面的算法调用参数:
[java]
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths> comma separated archives to be unarchived
on the compute machines.
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-files <paths> comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-libjars <paths> comma separated jar files to include in
the classpath.
-tokenCacheFile <tokensFile> name of the file with the tokens
Job-Specific Options:
--input (-i) input Path to job input
directory.
--output (-o) output The directory pathname for
output.
--minSupport (-s) minSupport (Optional) The minimum
number of times a
co-occurrence must be
present. Default Value: 3
--maxHeapSize (-k) maxHeapSize (Optional) Maximum Heap
Size k, to denote the
requirement to mine top K
items. Default value: 50
--numGroups (-g) numGroups (Optional) Number of
groups the features should
be divided in the
map-reduce version.
Doesn't work in sequential
version Default Value:1000
--splitterPattern (-regex) splitterPattern Regular Expression pattern
used to split given string
transaction into itemsets.
Default value splits comma
separated itemsets.