As machine learning has become more accessible to businesses and the number of products currently available has risen in the market, the question is regularly asked of us as leaders in data and analytics to recommend, or at least provide insight, into some of those products. Machine learning has become more accessible and there are many products currently available. In determining a use case, it was decided to use Amazon Web Services and Microsoft Azure as both are comparable in the market as well as being cloud based offerings, allowing business a lower setup time and ongoing costs. The reason for selecting Azure and Amazon was based on the familiarity of working on Microsoft and Amazon Web Services products.
Both Microsoft and Amazon can achieve outcomes in different ways. Any perceived strengths and weaknesses highlighted here are based on my personal experience using the products. Do note that there are frequent enhancements carried out by both Microsoft and Amazon. This document is based on research conducted in September 2016. It is possible that there may have been some changes since the time of writing this.
Use case – The use case is centred around a common retail-related problem. A business offers services to clients. The sales promotion activities within that business was not targeted to specific groups of clients. This was because the business didn’t have any in-depth knowledge of which services to push to which segment of clients. Below is a comparison of Amazon’s and Microsoft’s Machine Learning solutions based on my findings when trying to solve the above mentioned use case.
Algorithms – Azure Machine Learning allows multiple ways to try and solve a given problem
Azure has a plethora of algorithms to choose from when compared to Amazon which only had one. A simple google search for an Azure Machine Learning cheat sheet will provide you with good guidance on which algorithms would be best for a particular problem. However, Amazon is limited to Logistic regression as confirmed in the Amazon Machine Learning Frequently Asked Questions page (https://aws.amazon.com/machine-learning/faqs/). While Logistic regression has its place, it is only ideal when there is a single decision boundary. Most use cases will require trialling multiple algorithms to achieve a good outcome.
Findings – If you need to predict if a client will make another purchase or not, then logistic regression would be ideal. However, when needing to understand which services some clients are more likely to purchase, more capability from the algorithms are needed to be able to generate a good outcome. This is where Azure lead the way.
Azure Machine Learning Cheat Sheet
Categorise Data – Amazon Machine Learning categorises the source data for you
Amazon automatically pre-processes the data and categorises each field. The possible categories are Categorical, Numeric, Binary or Text. If a user wishes to, they can change the category of the data which may have an effect on the final outcome. If the data contains a row identifier, it can be declared and will be ignored when making predictions.
A binary field is one which can only have one of two values. In Amazon, binary fields can only have one combination of the following (Case insensitive);
- Y, N
- Yes, No
- True, False.
It is not capable of understanding any other combination even if only two values exist in the field (e.g. paid, unpaid).
In Azure, there is no option to classify data fields. The classification is done automatically. While this is very intuitive, I would prefer to have the ability to manipulate fields as required. It is worth noting that Azure had no difficulty understanding that a field containing only two values was a binary data set. This reduced the amount of data manipulation that was required before creating a model and made the output easier to read.
AWS automatically proposed data categories
Findings – I had a column which had an indicator for purchases made in the morning and afternoon (AM/PM). Amazon wasn’t able to see this as a binary field and forcing it to be a binary field and trying to predict if a service was required in the morning or afternoon caused an error.
Source Data – Amazon gives good visibility of the content within the source data
Amazon grouped the fields into the categories identified above. The target field is the field that we are attempting to predict using machine learning. It was possible to view each field and understand the correlation to the target field. The distribution of categorical data was useful by identifying the top 10 attributes and the number of occurrences. A bar chart helped to understand the distribution of attributes in each field.
AWS source data visualisation of a single field
Azure too has the ability to show each field and the distribution of attributes within it. It only showed the top 10 attributes for each field and therefore it was not easy to understand the proportion of the data that did not make the top 10. The correlation of a field to the target was not shown in the example that I worked on. This is most probably due to the availability of multiple algorithms. The correlation of a field to the target field would be different for each algorithm.
Azure source data visualisation
Findings – There were more than ten types of popular services. I was unable to understand if the services that didn’t make the top ten was significant or not in Azure. This was crystal clear in Amazon because it created a grouping of all services which were not in the top ten on the chart and displayed it as ‘Others’.
Training a Model – Amazon has a fixed 70/30 split when training a model; Azure allows you to select your desired split
In Amazon, the data split between the training and evaluation data set was fixed at 70% for training and 30% for evaluating. The only method of changing this was to use a different data set for evaluating the data. Based on your use case you may need a different split. In Amazon, you need to do this by manually creating training and learning datasets.
In Azure, it was possible to specify the desired split between the data available for training and scoring the model. I would consider this to be an essential feature as the split would need to change based on the problem that you are trying to solve.
Amazon can only do a 70/30 split
Findings – A 70/30 split didn’t cause an issue. However, the ability to trial different splits in Azure was useful for me to understand its impact.
Evaluating a Model – Amazon automatically evaluates the machine learning model
Evaluation happens automatically in Amazon; it occurs immediately after the Machine Learning model is created. In a binary classification scenario, Amazon lets the user change the trade-off, False positive rate, Precision, Recall and Accuracy. Each of these attributes have an effect on all the other attributes. It is easy to visualise the outcome immediately by tweaking them.
In Azure, evaluation can be added into the experiment as necessary. In a binary classification scenario, it was only possible to change the trade-off threshold which in turn impacts other factors.
Amazon lets you change all metrics
Findings – I would consider it to be critical that a machine learning model is evaluated while in development. Doing some manipulation in the data improved the quality of the model. Many iterations are usually required before landing on a good model.
Predictions – Making predictions using Machine Learning
Both Amazon and Azure have options to manually test, batch process and create endpoints for real-time predictions. Batch processing is a more common method of creating predictions from machine learning. It is usually done using a large set of records which usually takes some time to process.
Testing predictions in Amazon – The easiest way to test a model is to manually enter some values and get a prediction. This can be done by either typing in the values in to a web form or pasting values separated by commas. The predicted label is displayed on the screen but to see the confidence of the prediction, you need to filter through some code. An excerpt of an example output is below;
Amazon batch predictions – To do this, you first need to convert the data to a CSV (Comma Separated Values) format and upload it to Amazon S3 (Simple Storage Service). Predictions can be created based on this file and will be output to another CSV file in S3. The output file was compressed and saved as a GZ file which is commonly used in Unix environments. A drawback in the output received was that there was no way of finding out which row of the source data matched the output. The only way that I found to get around this was to add a row number column when creating the machine learning model. The row number appeared in the predictions. To marry up the two, you would need to use a method such as VLOOKUP which was a bit frustrating. Trial and error proved that the rows in the output were in the same sequence of the rows in the source file.
Amazon endpoint – An endpoint creates an Amazon web service which can be accessed via an API (Application Programming Interface). This is useful when there is a requirement to get predictions on a real-time or ad-hoc basis. Amazon claims that a query will be responded to in 100 milliseconds. They also claim to be able to process up to 200 queries per second. Any extra queries will be queued up and responded to. Higher capacities can be accommodated by contacting Amazon.
Testing predictions in Azure – You are presented with a test option where you can type in data to a web form. There was no option to simply paste comma separated values into the webpage like in Amazon. However, the option to download a customised excel workbook was very intuitive. This workbook consisted of parameters and predicted values side by side. It was very easy to use and responses were received within a second. This method is great if the data set has a small number of fields but may become progressively harder to use if there are many fields.
Azure batch predictions – Similar to the test option, this option also made use of an Excel workbook. However, it was more flexible allowing you to pick a cell range for the input and output data. I setup my output data on another sheet on the same workbook. I tested 300 rows of predictions and got the results back within one second.
Endpoint in Azure – The endpoint in Azure appeared much more user friendly. There was an API help page which had sample code for a request and response using the data set which you are working on. It also contained sample code for C3, Python and R which would reduce the complexities of having to write code from scratch.
Azure batch predictions in Excel
Findings – In my use case, I didn’t require real-time predictions. I used batch predictions and it was a hassle to have to manually marry up the original data with the predictions from Amazon.
Costs – Azure’s better features and user interface costs more. Up to five times more!
Please note that pricing may vary based on many factors. These include but are not limited to region, the complexity of the solution, the computing tier chosen, the size/nature of your organisation and any other negotiations entered into with either Microsoft or Amazon.
I performed some high level cost calculations on the cost based on the data set used. For an Amazon solution, it costs approximately USD 100 per month when using 20 hours of computing time and 890,000 predictions. Real-time predictions cost more than batch predictions. However, the difference was not significant for the model that I used (USD 104.84 for real-time vs USD 97.40 for batch predictions).
The same solution using Azure came up to an estimated total cost of just under USD 500.
The prices are my observations only and should always be confirmed with the licence provider. For more information on pricing, please see the links below;
Audience – Amazon appears to be more focused towards technical people
Overall, Amazon appeared to be focused at users who are more technically minded and more comfortable with programming. When learning to use Amazon Machine Learning, I only came across one example and therefore, I would consider this to be very limited.
On the other hand, Azure Machine Learning appeared more suited for power users within a business, as well as the technically minded. It provides a more familiar graphical drag-and-drop interface. These components are pre-programmed and are grouped such that they are easy to understand. The examples provided are a great way of getting accustomed to the tool and there are good sample datasets available.
Azure would appeal to most users
If you are more into coding and you have technical resources to support you and you can manage with only using the Logistic Regression algorithm, then Amazon would be ideal for you. With its lower cost and range of other cloud services it is definitely worth considering. For the rest of us, Azure is clearly the best choice.
It is worth remembering that the many algorithms and ease of use comes with a price tag attached to it.
Findings – I almost felt spoilt for choice with the amount of algorithms in Azure. I spent a lot of time trying to get a good model working in Amazon. The process was much easier when I used the same dataset in Azure.
In conclusion, I was able to gain an understanding of which groups of clients are most inclined to purchase certain services using Azure. This makes it possible to target specific clients and make the best use of the marketing budget. The ability to easily view the predictions along with the fields used for the predictions was very helpful. The predictions will be used to target clients who are more likely to purchase the services. In turn this will create a better outcome for both the business and the its’s clients. Overall, Azure proved to be a more mature offering for someone who wanted to solve a particular use case and trying to avoid deep coding.
We will keep a close eye on the Amazon offering to see when it catches up.