As part of my studies I did an internship at Micromata in the field of Data Science. With the help of machine learning models, we investigated how food can be sorted into groups, so-called clusters, based on its properties. For this purpose, we have worked on realistic and sometimes erroneous data, namely the data from Open Food Facts.
What is the Open Food Facts Project?
The Open Food Facts app is an open-source app that allows you to obtain the most important information about food available in the supermarket at a glance. For example, users can scan the barcode of a grocery item and directly access the information on the ingredients list, nutrition table and other information; for instance, a Nutri-Score can be used to see how healthy a product is, visualized as a traffic light. If the app does not yet know a product, it can be added by a user.
The procedure by which products are added to the database is a major source of error. Since all data must be entered manually (i. e. every single nutritional value), input errors can quickly occur – whether due to typing errors or the decreasing motivation of the user. Thus, the data contain many missing or faulty entries, decreasing both the use and the fun of using the App quickly.
How can we help Open Food Facts with Machine Learning?
It is essential to prevent such errors so that the app can provide users with accurate Nutri-Scores and recommendations – e. g. hints about allergens. Errors could be prevented, for example, by requesting feedback from users, such as: “Is this product pasta?” This would also allow existing errors to be corrected. The requirement for this is a grouping of all foods according to their nutritional values into so-called clusters.
We decided to work with the Gaussian Mixture Model in order to find incorrect entries in the data and classify products according to naturally occurring patterns. This learning algorithm is able to cluster and detect anomalies in data. To train the model, we used the most common and meaningful features of the nutrition table: fat, carbohydrates, sugar, salt, proteins and energy.
How can we find natural patterns in the data?
The Gaussian Mixture Model searches for hidden structures in the data. In our case, these are the product categories we are looking for, such as pasta, candy, or beverages. To find these structures, the model needs an indication of how many hidden categories to expect. In this way, the number of groupings is determined in advance. We chose 20 groups because we did expect many coarse product categories, but only few overlaps and correlations between the individual groups.
The advantage of the Gaussian Mixture Model compared to some other models is the so-called “soft clustering”: The algorithm does not assign an a single group to a product, but calculates the “responsibility” of each modelled cluster for this point and finally assigns it to the cluster with the highest responsibility.
In our clusters, we found interesting groupings. For example, a cluster of nuts & seeds and a cluster of yogurt & milk, as you can see in the Word Clouds.
How can we find special entries in the data?
After we were able to manually exclude obvious input errors – e. g., negative values or entries larger than 100g per 100g – we searched for entries that deviated strongly from most values in the database. These outliers can be rare, special products or even errors, which is why they should be checked. With the Gaussian Mixture Model, we determined how dense or frequent the data appeared in the nutritional space and thus also where rare data or errors could be found. This allowed us to find abnormal data entries, like rare products such as protein bars or errors such as butter flavorings without calories and excessively high energy input.
The Open Food Facts app can be improved considerably by finding abnormal products. In addition, feedback questions to users can be used to check rare or faulty products. In this way, mistakes could be corrected or prevented as early as in the input stage.
What did I learn during my internship?
My internship gave me a deep and exciting insight into the functionality and practical benefits of machine learning. I was also able to see how data can be examined visually and statistically, and at the same time to improve my programming skills. It was also a lot of fun to participate in an open source project and to make a charitable contribution to all users of the Open Food Facts app. If you want to get a deeper look into the project, you can follow the analysis on the data science platform Kaggle.