Globus is a fashion retailer that besides its stores (such as the flagship at Zurich Bahnhofstrasse), also sells fashion products through their online shop
The online shop has thousands of images representing the products in the fashion department together with their descriptions. Before the delivery of the project, the process of classifying these images according to their place in the merchandise hierarchy of Globus (ex:. men -> jackets -> real leather) and describing their features (ex:color, material, etc.), was to a large extent done by manual work. The overall aim of this project was to use Image Recognition to automate as much of this process as possible. To this end, different Neural Network architectures were being investigated and fine-tuned. A related goal was to identify which product categories (ex: women’s clothing, men’s clothing, purses, shoes, ...) are most suited for feature prediction. That depended both on a properly designed and trained model, but also on the quality and quantity of the corresponding input data for these categories.
This project was a second iteration of such an image classifier. The model previously developed by SIT Academy students used a ResNet-50 architecture to produce a two-output Neural Network. One output branch predicted the merchandise hierarchy of a given input image and the other predicted the probability distribution of that image over all product features. The model was trained on a dataset of more than 35’000 images downloaded from a Globus-provided API endpoint in JSON format. The JSON files contained information for both the merchandize hierarchy and product features. In the current project, our students tried to improve the accuracy of the second output branch with the goal of predicting the product features.
The first milestone that has been achieved was to train a Neural Network model by using the product category with the most amount of data with one output and try to maximize the prediction of product features as best as possible. The model output was being constrained to only the features relevant for the chosen product category. This constrained-output model served as a benchmark.
The second step was to investigate a model trained on multiple product categories with unconstrained output. This meant that the prediction of output characteristics would be made over all possible product characteristics in the input data. The goal was to check whether such an unconstrained-output model can produce accurate product features, which are also consistent. Consistency in this context means that features not belonging to that product category should have close to zero probability to be predicted (e.g. a man’s jacket should not have features for a woman’s purse). The unconstrained-output model was being compared to the benchmark to make a recommendation as to whether multiple constrained-output models (ex:. one model per product category) or a single unconstrained-output model (ex: one model for all product categories) are the way to go.
Next, a post-processing step was investigated, where the predicted feature probabilities were being adjusted for consistency. The final step was to investigate whether using the commodity hierarchy for each image as input, in addition to the actual product image, leads to better and more consistent feature predictions in the constrained and unconstrained models.
Results included detailed documentation of all data preprocessing steps and all model versions tested for each project step. Final recommendations on how to fully achieve feature predictions through a trained model were also provided. Finally, it can be said that a very high prediction rate was achieved. With our students' model, 9 out of 10 features of an object were recommended accordingly which was an excellent result. After all, there are 72 different shades of white, which, naturally, couldn’t all be in the training process.