Machine learning isn’t rocket science it could be easy if you lean it the right way, but learning the right way is a challenge.
In Supervised Machine Learning we train a model (machine) with large dataset of factual data and based on this data it can then predict the output for a given input.
Suppose we have dataset of number and its square we can train a model with this data and then ask model to predict the square of any given number accurately.
Note here we are not giving the model any equation nor it tries to derive one, it just predicts the output based on what it has learned from the dataset.
Its written in Python and Scikit Learn library.
Step 1 : Reading Dataset
For this example, I took a JSON dataset with ReviewText as a feature and used reviewRating to derive sentiments i.e positive or negative review.
Comment (Feature) | Rating (Feature) | Sentiments (Label) |
This is a good book | 5 | POSITIVE |
This book is bad | 1 | NEGATIVE |
Below is the code snippet that does the above step.
dataset_file_name = "./BooksReview.json" class Sentiment: NEGATIVE = "NEGATIVE" POSITIVE = "POSITIVE" class Review: def __init__(self, reviewText, reviewScore): self.reviewText = reviewText self.reviewScore = reviewScore self.sentiments = self.getSentiments() def getSentiments(self): if self.reviewScore <=3: return Sentiment.NEGATIVE else: return Sentiment.POSITIVE import json reviews = [] with open(dataset_file_name) as dataset: for line in dataset: review = json.loads(line) reviews.append(Review(review['reviewText'], review['overall']))
Step 2 : Splitting Dataset
The second step is to split features and labels from the dataset and also in case you want to test the model using the same dataset you will have to separate test data as well.
Below is the code snippet that does it
from sklearn.model_selection import train_test_split train, test = train_test_split(reviews,test_size=0.01, random_state=42) train_x = [x.reviewText for x in train] train_y = [y.sentiments for y in train] test_x = [x.reviewText for x in test] test_y = [y.sentiments for y in test]
must say line 2 above would scare any Java developer, well its python anything is possible.
Step 3 : Vectorosing Dataset
Next step is to vectorise the training data and the way we do that is as below.
Let’s try to understand how we will vectorise comments in our example from the table above.
a) We start by creating an array of all the words present in the dataset. This is programmatically done using the fit() method provided in the sikit library.
[“this”,”is”,”good”,”book”,”bad”]
b) After creating “Bag of Words” that’s what it is usually referred to as. Our next step is to vectorize review comments.
That is done by iterating over each comment and checking the count of each word appearing in the comments against our bag of words.
If we had below comments to vectorise
["This is a good book", "This book is bad"]
then the transformed vector array for it would be
[ [ 1, 1, 1, 1, 0 ], [ 1, 1, 0, 1, 1 ] ]
The above step is handled by the transform() method but by using the fit_transform() method you can achieve both the objective in one.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer #vectorizer = CountVectorizer() vectorizer = TfidfVectorizer() train_input_vector = vectorizer.fit_transform(train_x) test_input_vector = vectorizer.transform(test_x) test_input_vector_2 = vectorizer.transform(["not good","very good"])
Note that the test data needs to be transformed before we could send it to model for prediction, but we don’t have to fit test data since we test individual comment and not group of comments and thus creating bag of words is not required.
Step 4 : Predicting Using Classifiers
Here is where all the magic happens and is the final step, sklearn provides you with different classifiers rather algorithms that does the predictions.
Now in the below segment, we will be testing the model with our review comments and then the model would predict the entered review is positive or negative.
We will be running the test on SVM, Linear regression, and Decision Tree classifiers.
SVM Classifier
from sklearn import svm clf_svm = svm.SVC(kernel='linear') clf_svm.fit(train_input_vector, train_y) prediction = clf_svm.predict(test_input_vector_2[1]) print(prediction)
Decision Tree
from sklearn.tree import DecisionTreeClassifier clf_deciTree = DecisionTreeClassifier() clf_deciTree.fit(train_input_vector, train_y) clf_deciTree.predict(test_input_vector_2[1])
Logistic Regression
from sklearn.linear_model import LogisticRegression clf_log = LogisticRegression() clf_log.fit(train_input_vector, train_y) clf_log.predict(test_input_vector_2[1])
Some useful links
My Github repo for this project: https://github.com/thatsrohitnaik/Machine-Learning-Comment-Review
Fit_transform method explanation : https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
Decision Tree Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
SVM Classifier : https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
Linear Regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Hope you liked the post !!
Happy Coding.