Learn how to apply a Naïve Bayes classification model to solve a Natural Language Processing (NLP) problem in Python in this article.
Here are the steps we will cover:
- Download a sample dataset
- Split the dataset into test and train data
- Vectorize the data
- Build and measure the accuracy of the model
For example, we will use a publicly available dataset for spam detection with 5,572 SMS messages labeled as ham (legitimate) or spam. Here’s how we’ll approach it:
Step 1: Download the dataset from this site and extract the files.
data:image/s3,"s3://crabby-images/b462c/b462c47a8fa8cfa114b26dac4659ff498c332c40" alt="naive_bayes_1"
Step 2: Import the text dataset and provide column names.
data:image/s3,"s3://crabby-images/f0737/f07372dedae54c835d2433a2125fb52c6d678422" alt="naive_bayes_2"
Step 3: Convert labels (ham and spam) to numbers (0 and 1).
data:image/s3,"s3://crabby-images/16b5f/16b5ff0985c7fccee0a3288e6938a18228a64efa" alt="naive_bayes_3"
Step 4: Split the dataset into test and train.
data:image/s3,"s3://crabby-images/c1ac9/c1ac99427dd47e54d3ef5b8e527a0911cb84c3bf" alt="naive_bayes_4"
Step 5: Vectorize the data to convert words to numerical structures. You can read more on this here.
data:image/s3,"s3://crabby-images/9ed10/9ed104c5d780c03a55db83a36985a0c3ce3ab00e" alt="naive_bayes_5"
Step 6: Vectorize the training dataset.
data:image/s3,"s3://crabby-images/1ff28/1ff28784eb84a63aa8660309754f65f74628c4da" alt="naive_bayes_6"
Step 7: Vectorize the test dataset.
data:image/s3,"s3://crabby-images/61273/6127351a1f8f7ab7cdaf7ae9ffd44b26bb5ab98c" alt="naive_bayes_7"
Step 8: Build the Naïve Bayes classification model. If you want to learn more about Naïve Bayes, check out this post.
data:image/s3,"s3://crabby-images/6eb45/6eb45970aa33bee0135d358158bd985793e320c9" alt="naive_bayes_8"
Step 9: Measure the accuracy on the test data.
data:image/s3,"s3://crabby-images/a3a92/a3a924ee18ced424c38affc43ca48735e0ab1de6" alt="naive_bayes_9"
References
I have used the codes from the following sites and modified wherever needed:
https://radimrehurek.com/data_science_python/
https://www.ritchieng.com/machine-learning-multinomial-naive-bayes-vectorization/
https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
Further reference materials:
https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/
https://pythonprogramming.net/naive-bayes-classifier-nltk-tutorial/
https://www.geeksforgeeks.org/applying-multinomial-naive-bayes-to-nlp-problems/
https://towardsdatascience.com/naive-bayes-document-classification-in-python-e33ff50f937e
I personally found this post very helpful: https://www.ritchieng.com/machine-learning-multinomial-naive-bayes-vectorization/
You can find sample datasets on this site: https://blog.cambridgespark.com/50-free-machine-learning-datasets-natural-language-processing-d88fb9c5c8da