Intelligent algorithms can automatically recognize patterns and correlations in data. They are then able to use the extracted knowledge to make decisions or calculate predictions. This sounds like a great potential unleashable at the push of a button without human intervention. But is that the case?
The human factor
The success of a learning algorithm relies on many factors – especially the human factor. After all, it is us who determine the problem to be solved, select the appropriate algorithm and collect the necessary data. The data must match – not only the task itself, but also the chosen algorithm itself and the underlying mathematical model. A high accuracy of fit and data quality are therefore the top priorities in machine learning – the human being is the ceremonial master.
Man & machine are only strong as a team
For machine learning we therefore need a deep understanding of data that can be gained through visual and statistical analysis. In the best case, we have expert or industry knowledge of the environment of the task to be solved. Only then can we select suitable algorithms and process the data accordingly. At the end of this phase there is a complete pipeline for the data flow – from the storage location of the data to the predictions or decisions obtained.
But anyone who thinks that this is the whole story is mistaken. This is where the actual work begins. The characteristics in the data can and should be used to create further new characteristics. We call this procedure Feature Engineering – it is accompanied by the visual discovery of the data as well as by a constant analysis of the results and strongly influences how successful the algorithm can actually be in completing the task. In this continuous improvement process, human know-how is the biggest driver – even when further learning algorithms are used.
Added value of data analysis for customers
The added value of machine learning for companies can best be demonstrated using a concrete example. For this purpose, the sales data of an online retailer from Great Britain were examined. Since storage space is expensive and on-time delivery is important for competitiveness, it would be a tremendous benefit to be able to calculate how many units of a product are sold per day. Using meaningful data, this task can be solved with the help of machine learning and the retailer’s warehousing can be optimized accordingly.
The showcase data contains information about the transactions carried out during a whole year – from December 2010 to the same month of 2011:
- the exact date of the transaction,
- the ID and the country of the customer,
- a description of the purchased product and a stock code ID as warehouse number,
- the amount of units purchased,
- the price of a unit of a product,
- an invoice number.
The first impression counts
The first step is to obtain an overview of the quality of the data. Are there any missing entries and why? Which characteristics do the attributes show and how are they distributed? Are there outliers in these distributions and why? This initial analysis already provides an understanding of how the data was collected and which challenges need to be mastered in the further course. The data from our online retailer shows many conspicuous features. For example, the customer ID is missing in 25% of all entries. Are there any guest orders without registration that are not flagged specifically? A small proportion of the data (0.3 %) lacks a description of the product purchased. In addition, both the customer ID is missing and the price of a product unit is zero. This is very odd and indicates that the trader seems to have lost track of some transactions.
A more detailed analysis of the product descriptions reveals that the data also includes cancellations, claims, special payments and refunds. However, these are not clearly marked and must be filtered out of the data at great expense. For example, you can find terms such as “damaged”, “??” or “found” in the descriptions that have nothing in common with the name of the associated product.
The historical values of the products sold per transaction and the prices of the product units are also particularly important. What happens if they are saved incorrectly? This could disrupt the learning process of the algorithm. An indicator for such errors are extreme outliers in the data, which can also be found at the online retailer. In order to be able to work successfully with the data, all entries with anomalies were excluded.
Possibilities for improvement
Following the first analysis, we have already gained a very important insight: If retailers wants to use machine learning for forecasting their product sales, they have to develop a consistent approach for specific transactions. They need easily identifiable, uniform codes for claims, returns, cancellations and special payments. The first step towards the successful use of machine learning is taken when quality is already assured during data collection. But that’s not all! By identifying the anomalies described above, the retailer is in a position to manage his warehousing processes more efficiently and effectively.
The Machine Learning Pipeline
A pipeline describes the flow of data from the source to the forecast and includes all the stages that can be calculated automatically. This includes cleansing and processing the data, but also the learning phase of the selected algorithms and the check of the learning objectives. The latter is necessary in order to check how well the acquired knowledge can be generalized to previously unknown data. In the showcase, all data before December, 6th 2010 were used for the learning phase and all data after that for validation.
Catboost was selected to predict the daily sales figures of the individual products. This algorithm is based on an ensemble of consecutive decision trees. Within this group, each decision tree tries to further reduce the error of its predecessor. This procedure is called gradient boosting and can also be applied to other machine learning algorithms. Algorithms that use this method have proven to be particularly powerful in many applications.
As with many learning algorithms, however, the success of Catboost also depends on external parameters that can be set by the human being. For example, you can choose the maximum number of decision trees the ensemble should consist of or how strongly the individual trees may react to the errors of their predecessor. It is often not easy to optimally select these so-called hyperparameters, since there are various automated methods available to achieve a good solution. Searching for them can cover the entire pipeline like an umbrella and include all work steps.
Feature Engineering: Wisdom through repetition
As mentioned at the beginning, it is necessary to do feature engineering until Catboost can optimally read out the correlations in the data. The pipeline is therefore not a linear process, but contains a continuous cycle to improve the learning success and the quality of the predictions or decisions. In the case of the online trader, even simple features based on the following questions can help:
- How many products of a warehouse id were sold on the same weekday last week and the week before last?
- What is the difference?
- What is the already fixed average price of a product unit for the coming week per warehouse number?
- What was the price in the past?
- How many transactions were carried out on average per warehouse id in the past months?
- How many different customers have carried out transactions for a warehouse id in the past?
This catalogue of questions can be supplemented and expanded further and further. This calls for human creativity – although approaches already exist to automate feature engineering. With only a few new features, the success of Catboost can be significantly increased.
In addition, the algorithm provides us with valuable information: Which features in the data are particularly important and which interact very strongly with each other? We learn, for example, that the dealer’s pricing policy and the already known average sales figures for the pre-Christmas period are particularly important for forecasting. The retailer can use this knowledge in a targeted manner to set prices for his products. Furthermore, they serve as a starting point to generate new features in the data and to further improve the forecasts.
And what happens next? New features are created until either the possibilities are exhausted or until we are satisfied with the result. During this process we gain deeper and deeper insights into our data. What needs do our customers have? Which products are currently in vogue? Can we group customers according to their purchasing behavior? In the end, we do not just get forecasts, but can better understand processes and significantly optimize them. Only through the interaction between man and machine the potential of machine learning can be fully exploited in the future.