Supervised Machine Learning with Scikit Learn Library in Python: Covers Dataset Reading, Splitting, Vectorising, and Predicting Using Classifiers

Machine learning isn’t rocket science it could be easy if you lean it the right way, but learning the right way is a challenge.

In Supervised Machine Learning we train a model (machine) with large dataset of factual data and based on this data it can then predict the output for a given input.

Suppose we have dataset of number and its square we can train a model with this data and then ask model to predict the square of any given number accurately.

Note here we are not giving the model any equation nor it tries to derive one, it just predicts the output based on what it has learned from the dataset.

Its written in Python and Scikit Learn library.

Step 1 : Reading Dataset

For this example, I took a JSON dataset with ReviewText as a feature and used reviewRating to derive sentiments i.e positive or negative review.

Comment (Feature)	Rating (Feature)	Sentiments (Label)
This is a good book	5	POSITIVE
This book is bad	1	NEGATIVE

Below is the code snippet that does the above step.

dataset_file_name = "./BooksReview.json"

class Sentiment:
    NEGATIVE = "NEGATIVE"
    POSITIVE = "POSITIVE"
class Review:
    def __init__(self, reviewText, reviewScore):
        self.reviewText = reviewText
        self.reviewScore = reviewScore
        self.sentiments = self.getSentiments()
        
    def getSentiments(self):
        if self.reviewScore <=3:
            return Sentiment.NEGATIVE
        else:
            return Sentiment.POSITIVE

import json

reviews = []

with open(dataset_file_name) as dataset:
    for line in dataset:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall']))

Step 2 : Splitting Dataset

The second step is to split features and labels from the dataset and also in case you want to test the model using the same dataset you will have to separate test data as well.

Below is the code snippet that does it

from sklearn.model_selection import train_test_split

train, test = train_test_split(reviews,test_size=0.01, random_state=42)

train_x = [x.reviewText for x in train]
train_y = [y.sentiments for y in train]

test_x = [x.reviewText for x in test]
test_y = [y.sentiments for y in test]

must say line 2 above would scare any Java developer, well its python anything is possible.

Step 3 : Vectorosing Dataset

Next step is to vectorise the training data and the way we do that is as below.

Let’s try to understand how we will vectorise comments in our example from the table above.

a) We start by creating an array of all the words present in the dataset. This is programmatically done using the fit() method provided in the sikit library.

  [“this”,”is”,”good”,”book”,”bad”]

b) After creating “Bag of Words” that’s what it is usually referred to as. Our next step is to vectorize review comments.

That is done by iterating over each comment and checking the count of each word appearing in the comments against our bag of words.

If we had below comments to vectorise

["This is a good book", "This book is bad"]

then the transformed vector array for it would be

[ [ 1, 1, 1, 1, 0 ], [ 1, 1, 0, 1, 1 ] ]

The above step is handled by the transform() method but by using the fit_transform() method you can achieve both the objective in one.

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#vectorizer = CountVectorizer()
vectorizer = TfidfVectorizer()

train_input_vector = vectorizer.fit_transform(train_x)
test_input_vector = vectorizer.transform(test_x)
test_input_vector_2 = vectorizer.transform(["not good","very good"])

Note that the test data needs to be transformed before we could send it to model for prediction, but we don’t have to fit test data since we test individual comment and not group of comments and thus creating bag of words is not required.

Step 4 : Predicting Using Classifiers

Here is where all the magic happens and is the final step, sklearn provides you with different classifiers rather algorithms that does the predictions.

Now in the below segment, we will be testing the model with our review comments and then the model would predict the entered review is positive or negative.

We will be running the test on SVM, Linear regression, and Decision Tree classifiers.

SVM Classifier

from sklearn import svm

clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_input_vector, train_y)
prediction = clf_svm.predict(test_input_vector_2[1])

print(prediction)

Decision Tree

from sklearn.tree import DecisionTreeClassifier

clf_deciTree = DecisionTreeClassifier()
clf_deciTree.fit(train_input_vector, train_y)
clf_deciTree.predict(test_input_vector_2[1])

Logistic Regression

from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_input_vector, train_y)
clf_log.predict(test_input_vector_2[1])

Some useful links

My Github repo for this project: h ttps://github.com/thatsrohitnaik/Machine-Learning-Comment-Review

Fit_transform method explanation : https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Decision Tree Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

SVM Classifier : https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Linear Regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Hope you liked the post !!

Happy Coding.