sam_cunnin at yahoo
Nov 2, 2011, 10:33 AM
Post #1 of 1
My objective is to be able to classify news documents to these classes:
Mahout In Action - Bayes/CBayes Classification returns NaN
Sports, Entertainment, Politics, Business, etc. Here are the steps I took:
- Used prepare20newsgroups command (page 277 - Mahout In Action) to prepare
the training data set (one long document ~5MB per class).
- Moved training dataset to HDFS and ran trainclassifier command (page 278)
and created the model
- Moved the model from HDFS to local FS and ran Classify.java (at
on a sample document
- The result is NaN for all classes. It apparently can't assign any classes
to this document. Finally it is labeling with default category: unknown.
I know the program works with 20news dataset. I also know I am training
correctly and my dataset is pretty realistic. What might be the reason that
it can not classify? I tried a few other documents. The result is the same.
NaN. Just to note, when I run prepare20newsgroups command on the training
documents, it puts a single target variable and a single line of document,
which is very long such that (Sports - tab - a long single document) Would
this be the reason? Because I know the 20news dataset has a number of
repeated target variables with a number of documents in it.
Please help. Thanks,
View this message in context: http://lucene.472066.n3.nabble.com/Mahout-In-Action-Bayes-CBayes-Classification-returns-NaN-tp3474535p3474535.html
Sent from the Lucene - General mailing list archive at Nabble.com.