Mongodb For Machine Learning

Earlier this year (May 2018) Microsoft announced ML.NET, an open source and cross-platform machine learning framework built for .NET developers. It is exciting news to be able to integrate custom machine learning with .NET/C# applications. Although ML.NET is still in preview release version 0.5.0 at the time of writing, you can test drive it to explore the potential power of the framework.

There are already a number of tutorials for ML.NET available from Microsoft and third parties. However, the example data sources are mostly flatfiles in the format of TSV (Tab Separated Values). This post is written for the plethora of datasets available in JSON format, unstructured datasets from web events, or perhaps datasets that are already stored in MongoDB.

This post is going to focus on how to develop ML.NET classification sentiment analysis using data stored in MongoDB. This post is based on Microsoft’s Tutorial: Use ML.NET in a sentiment analysis binary classification with notable differences:

MongoDB’s dynamic nature enables its usage in database manipulation tasks in Machine Learning applications. It is an efficient and easy way to carry out an analysis of datasets and databases. The output of the analysis can be used in training machine learning models.

The training dataset is in JSON format.
It reads from MongoDB as its data source instead of a file.
It uses .NET Core (Ubuntu/Linux).

The full code example and data can be found on github.com/sindbach/mlnet_mongodb. I would recommend reviewing Microsoft’s tutorial for more information.

The Data

A good machine learning journey always starts with a good dataset. The dataset used is from Yelp Dataset Challenge. The data is provided by Yelp as part of their dataset challenge, which ends 31st December 2018. The data is ~2.9GB in size and, most importantly, in JSON format.

Part of the dataset that is of interest is in the yelp_academic_dataset_review.json file. The sentiment analysis model will be trained based on the Yelp reviews to predict whether a review has a positive or negative sentiment.

The following is an example JSON structure from the file:

There are two important fields from the structure: text and stars. The text field contains a user’s review comment, and the stars field contains an indication whether the review is positive or not.

The Database

Time to load the review data into a database. For this post, the data will be loaded into MongoDB Atlas, a cloud hosted database-as-a-service for MongoDB. You can follow MongoDB’s tutorial to create an Atlas FREE tier if you would like to test the data loading as well.

The data can be loaded to MongoDB Atlas using mongoimport. For example, the following command will import a file called yelp_academic_dataset_review.json into the review collection in the yelp database:

Once the import has completed, use either the mongo shell or MongoDB Compass to check the data.

There’s one more preparation that needs to be performed before jumping into the code. Since we’re trying to create a binary classification, we need a binary value to determine whether a review is positive / 1 or negative / 0. Fortunately every document contains a star rating, a range of 1 to 5 where a value of 1 indicates a negative review and a value of 5 is a positive review.

The MongoDB Aggregation Pipeline can be used to add a new field called sentiment to the dataset where the value is based on the stars rating. The sentiment value will be determined with the following logic: any review with a stars value greater than 3 is positive, and any value equal or less than 3 is negative.

For example, use the $addFields stage to add the new field and $out stage to store the output into a separate collection:

Note: You can also find a small portion of the JSON data on github.com/sindbach/mlnet_mongodb: data. The training data consists of 5000 positive reviews and 5000 negative reviews.

The Code

This post will be using .NET Core, a free and open-sourcemanagedframework for Windows, macOS and Linux. The only two dependencies for the project are :

MongoDB .NET/C# driver version 2.7.0
ML.NET version 0.5.0

The SentimentData class is modified as follows to serialize and/or deserialize the review document structure from MongoDB:

BsonIgnoreExtraElements ignores all fields in the document except for id, sentiment (mapped to Label), and text. These are the fields we will use for training. Next, we instantiate a MongoClient object to connect to MongoDB using a connection string URI:

Using the MongoClient object, we can access the data in the yelp database and review_train collection:

The ML.NET LearningPipeline requires an enumerable object which we can easily get by invoking Find() on collection:

To test the sentiment analysis model, we’ll fetch four current reviews displayed on Yelp for restaurants in Sydney Australia:

“Very bad service and low quality of coffee too. Waiting for so long even tried to rush them already.”
“This place is amazing!! I had the classic cheese burger with fries. Hands down the best burger I have ever had”.
“If I could give zero stars I would. Terribly overpriced. Dried over cooked barramundi with no seasoning or flavor at all”.
“Small menu but the food is quite good. It’s fast and easy, one of the better options around the area. We had the seafood laksa and seafood Pad Kee Mao”.

The prediction results are:

Note: You can find the full code example on github.com/sindbach/mlnet_mongodb: sentiment.

Loading and reading data from MongoDB as a ML.NET data source is quite trivial. The potential of utilising ML.NET to integrate machine learning with datasets stored in MongoDB is exciting, and I’m looking forward to future releases of ML.NET.