#Gaming #VideoGame #MachineLearning #AI #NLP #NaturalLanguageProcessing #Amazon #Reviews #Python #DataAnalysis #FeatureEngineering
The very first part of any Machine Learning or Data Analysis is retrieving data. (https://jmcauley.ucsd.edu/data/amazon/)
About the Data: The data has details of the user ratings and review of different Video Game products on Amazon before 2016.
(GitHub for the ipynb file: https://github.com/Nath9319/VideoGame_Sentiment_Analysis_Amazon_Review/blob/main/VideoGame_Sentiment_Analysis_Amazon_Review.ipynb)
First 5 rows of the Data:
Now the first step towards building a Machine Learning Approach is cleaning the data.
As we are only analyzing the sentiment of the text we would not need most of the features here. But we also need to make sure that we do not end up dropping, any important features which would give a better results.
Data Analysis:
First lets check how the Ratings counts are, for Verified and non Verified accounts.
So from this we can conclude two things.
1) Most of the users are satisfied with what they have got, as probably in case of gaming the users buying the game would know/read review of the game before hand.
2) For the Non-Verified users we see a similar trend, however the margin is less. Which could also mean impulse buy or impromptu approach for the customer. Upon further analysis on this non-verified users and with more information we might be able to deduce a way to increase this impromptu users. However here we would be only performing the NLP on our data.
Here we would create our dependent variable.
Our data has a total of 5 ratings from 1 - 5. Where 5 being the highest and 1 being the lowest. To identify the sentiment of the reviews/texts, I need to segregate the data into positive and negative texts. Usually the reviews below 3 are positive and reviews above 3 is positive. Hence I have dropped the 3 reviews and transformed the data into a binary classification dataset. Now the distribution looks like:
I have Mainly used the text data in this model, and added some features of my own.
Some of the features are:
Number of Negative words, Number of Negative words, number of special characters, etc.
To check the amount of information added to the model, the feature importance of the created features are as below.
The created features are adding some amount of information to the model.
Although the features are adding some amount of information to our data. Some of these features are highly correlated.
Text Cleaning:
In case of all almost all the text's available in the internet. This also needs to be cleaned for us to proceed further.
Hence I deconstructed all the phrases and removed all the links and removed all the links and special characters from the text data available.
The next step would be stemming of the data, as the representation of the data is not important here, we are only going to understand the sentiment of the text.
To understand how the words in both the classes are different form each other. I used WordCloud representation to understand it better. There is not much of a difference in the text summary, however there is a lot of difference when it comes to the actual review.
Negative Reviews:
Positive Reviews:
The Next step is to convert our text into numbers.
There are a number of ways of going about this, two of them that I have used here are TF-IDF and BagOfWords. However to find which one could be the best fit in our models we could use visualization techniques like TSNE or PCA. Then according to which one may proceed with the results.
Metric: As this is an Imbalanced dataset, I decided to go with ROC-AUC instead of the Accuracy. However I have also kept an eye out for the Precision and Recall and also Log-Loss to see how closely the classes are getting misclassified.
Performance: The Model is performing best on 2Gram BOW transformation.
I have tried many different classification algorithms, however XGBoost gave the best results with 0.98 AUC Score.
I would recommend everyone who is reading this to try it our for themselves. My GitHub repo might be able to help you out with more on feature engineering and model selection. I have tried to make this as simple as I can. Please reach out to me for any feedback.
Comments