seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles,从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息,分别是:(1)DocumentTokenizer(2)WordCount(3)MakePartialVectors(4)MergePartialVectors(5)VectorTfIdf Document Frequency Count(6)MakePartialVectors(7)MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息:
[java]
Usage:
[--minSupport <minSupport> --易做图yzerName <易做图yzerName> --chunkSize
<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma
<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>
--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>
--overwrite --help --sequentialAccessVector --namedVector --logNormalize]
Options
--minSupport (-s) minSupport (Optional) Minimum Support. Default
Value: 2
--易做图yzerName (-a) 易做图yzerName The class name of the 易做图yzer
--chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000 MB
--output (-o) output The directory pathname for output.
--input (-i) input Path to job input directory.
--minDF (-md) minDF The minimum document frequency. Default
is 1
--maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf) vectors
to be used, expressed in times the
standard deviation (sigma) of the
document frequencies of these vectors.
Can be used to remove really high
frequency terms. Expressed as a double
value. Good value to be specified is 3.0.
In case the value is less then 0 no
vectors will be filtered out. Default is
-1.0. Overrides maxDFPercent
--maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF.
Can be used to remove really high
frequency terms. Expressed as an integer
between 0 and 100. Default is 99. If
maxDFSigma is also set, it will override
this value.
--weight (-wt) weight The kind of weight to use. Currently TF
or TFIDF
--norm (-n) norm The norm to use, expressed as either a
float or "INF" if you want to use the
Infinite norm. Must be greater or equal
to 0. The default is not to normalize
--minLLR (-ml) minLLR (Optional)The minimum Log Likelihood
Ratio(Float) Default is 1.0
补充:软件开发 , Java ,