What is a Feature Selection ?

Feature Selection: The goal of feature selection is to pick features that you believe will increase the ability of your classifier to separate data between the different categories. Example of feature selection in case of email is

  1. Contents of the subject line.

  2. Other email header data points

  3. Body of the email (Don’t use stop words)

 Should we consider date of the email sent, obviously no because it is not going to help to classify whether an email is a spam or not.  Also don’t use the words which appear very rarely.  

Classification in Mahout can be executed sequentially or via MapReduce. All of the Mahout supervised learning algorithms can be executed in MapReduce, but only a few of them have parallel execution models. Naïve Bayes is one of them.

Naive Bayes classifier:In simple terms, a naive Bayes classifier assumes that the value of a particular feature (example words in a sentance) is unrelated to the presence or absence of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of the presence or absence of the other features.

To classify emails, a naïve Bayes classifier examines words that occur in emails and checks to see if they’re more likely to occur in spam or ham categories. If a word is used frequently in an email, and the spam category has also observed a high frequency of that word (and the ham category has seen the word less frequently), then the word is deemed to be more spam than ham. Bayes theorem comes into play once each word has had a spam and ham probability calculated, and combines them together to form the overall email probability of ham or spam.


