Sales Forecasting


Konstantin Burkin


Introduction


This project was written in Google Colab, using Python version 3.7.13. This work is available in my Github repository, where it is possible to download Colab Notebook, check the code and reproduce my work.

Assignment

Predicting the number of sales is an important business problem for grocery stores, fast food restaurants, and delivery systems. Predicting the number of orders can not only speed up the delivery process, but also simplify the company's logistics, maintain the quality of products with a short shelf life and increase the business's rating, thereby increasing its competitiveness. Thus, predictive analysis allows business to repeatedly reduce the consumption of resources and operating costs.

The goal of this project is to create predictions to evaluate the number of sales in Delivery Club for a week in February. This week was described by "test" dataframe, that contained information about the date, weather conditions, product types, stores, and their locations. The history of the sales for previous 7 months is available in the "train" dataframe. The predictions should be made with any kind of ML model. However, predictions should not exceed 4.1 points of mean absolute error metric. Both dataframes describing Delivery Club sales from 160 stores in 10 russian cities were provided by the supervisor.

The result of this project is a dataframe containing id of products and corresponding number of predicted sales. The accuracy of predictions was evaluated using Yandex.contest platform.

Brief outline

  • Setup notebook environment
    • Import data and libraries
    • Dataframe conversion to English format
  • Exploration of data
    • Data types, number of unique values, description of columns, missing values analysis
    • Statistical evaluation of numeric columns, initial correlation between columns
  • Visualisation of patterns in the dataframe
  • Data transformation. Visualisation of dependencies between sales column and other columns
  • Comparison of ML models to choose the one with minimal absolute error
  • Building predictions and visualisation the result


Import of data and Python libraries


The following data analysis includes several Python libraries for data analysis, builduing ML models, ploting data, etc:
  • Numpy
  • Pandas
  • Plotly
  • Sklearn
  • Seaborn
  • Google colab
Two dataframes were downloaded and read (train.csv and test.csv) from my Github repository. Both dataframes are available on my Github page. The train dataframe is used to examine data, find patterns and factors that correlate with the number of sales, add new variables and test ML models. The test dataframe is used to build predictions that will be evaluated later in Yandex.contest.

The head of the dataframe

id date city_name store_id category_id product_id price weather_desc humidity temperature pressure sales
1 2021-07-29 Moscow 1 1 1 4.79 partly cloudy, light rain 61.9375 23.1875 741.0000 26
2 2021-07-30 Moscow 1 1 1 4.79 partly cloudy, light rain 70.2500 22.1875 740.3125 37
3 2021-07-31 Moscow 1 1 1 4.79 partly cloudy 52.6250 21.8125 741.6250 25
4 2021-08-01 Moscow 1 1 1 4.79 cloudy, light rain 87.4375 20.0625 743.3125 26
5 2021-08-02 Moscow 1 1 1 4.79 partly cloudy 66.1875 23.4375 739.6250 22


Exploratory Data Analysis


Statistical data description

Two dataframes (train.csv and test.csv) have common list of 11 variables. "Sales" variable is present only in train dataframe.

Table of variables

Variable Description Datatype NAs Unique values
1 id Unique identifier representing a bundle (product_id, store_id, date)
There is only one id, it does not repeat in the data
int64 0 666676
2 date
Date of sale object 0 200
3 city_name Name of the city where the sale took place object 0 10
4 store_id Unique identifier for each store int64 0 160
5 category_id Product category int64 0 9
6 product_id Unique identifier for each type of product int64 0 32
7 price Price of the product float64 0 29
8 weather_desc Weather description object 0 16
9 humidity Humidity in the city on the day of sale float64 0 916
10 temperature Temperature in the city on the day of sale float64 0 505
11 pressure Atmosphere pressure in the city on the day of sale float64 0 344
12 sales Number of product sales (this is what I should predict) int64 0 249

There are 666 676 observations in the train dataframe and 24 836 observations in the test dataframe.
The train dataframe describes data in 7 months period, from June 29 to February 13.
The test dataframe describes data in one week period from February 14 to February 20.
The dataframe contains information from 10 cities:
Moscow, St. Petersburg, Krasnodar, Samara, Nizhny Novgorod Rostov-on-Don, Volgograd, Voronezh, Kazan, and Yekaterinburg.

The table presented describes columns with numeric data. Further, the table below shows the Pearson correlation between sales column and other numeric columns.

Statistical description of numeric data

price humidity temperature pressure sales
mean 5.1 74.3 4.9 751 10
min 1.9 13.8 -24.0 710 0
max 18.6 100.0 34.3 779 275
25% 3.0 59.8 -3.3 745 2
50% 4.1 79.7 4.4 751 5
75% 6.0 92.4 11.8 758 12

Pearson correlation between sales and other columns

Column name sales
sales 1.00
product_id 0.14
humidity 0.13
pressure -0.06
temperature -0.07
id -0.08
store_id -0.09
category_id -0.11
price -0.19

Data visualisation

All of the graphs presented in this work are interactive. It is possible to choose one out of many lines shown on the graph and choose the interval across both axes to explore the data. To go back to initial state of the graph it is necessary to double-click the graph or click "Reset axes" button at the top right corner.

It seems intuitive to start looking for patterns with data exploration of how the cumulative number of sales changes each day in each city. The graph below shows 7-months period in 10 cities. There are several features that are evident:
  • First feature that can be seen in the graph is that the number of sales noticably changes throughout the week in each city.
  • Second thing is that in different cities there is a different rate of sales. In larger cities, like Moscow, the sales rate is the highest.
  • Third, the noticeable sharp decrease in sales can be observed around New Year, which is an important national holiday in Russia, during which people generally stay at home with their families and most of the stores are closed. This effect cannot be considered as an outlier and should be extrapolated to other major public holidays. However, no such holidays are present in February.
  • In the end, the tendency can be noticed that the total number of sales is started to rise in the winter.
  • The next step is to observe another time dependency, but with aggragted number of sales in each month. On the graph below it can be seen that in every city the total number of sales slowly droped at the beggining of autumn and then started to grow again at the beggining of winter. On this graph the difference between total number of sales in different cities is much clearer. This fact is not unusual, since the cities in Russia are quite disproportionate in population size and salary levels (supplementary information is shown in the table below). Information about population size and salary levels was found here and here.

    Population and salary levels in cities

    City Population, M Salary, K
    1 Moscow 12,66 111,1
    2 St. Petersburg 5,38 76,0
    3 Krasnodar 0,95 40,8
    4 Nizhny Novgorod 1,26 41,5
    5 Volgograd 1,00 38,1
    6 Kazan 1,26 44,9
    7 Samara 1,14 42,9
    8 Rostov-on-Don 1,14 39,1
    9 Voronezh 1,05 40,9
    10 Yekaterinburg 1,50 48,4
    Since the number of sales drastically changes from sity to city, it is interesting to see the difference between mean daily sales in each city. Moreover, the gap between mean number of sales each day of the week should be large. It is important to check if there is any strong correlation in price vs sales and weather vs sales. In the set of graphs below it can be seen that mean of sales is low at the beginning of the week and then the sales rise up by the weekend. The dependence is weak, but noticable, so it is reasonable to add additinal variable "Day of the week" to the dataframe and fill it with number of the day of the week (from 1 to 7). Moreover, another variable "Weekend" was added with values of 0 and 1, showing if the day is weekend or not. The rate of sales of products varies depending on their price. There are a lot of products with low prices. That means that the cheaper the product the more often it gets selled. On the left bottom side of the graph dependence can be observed in mean daily sales in each city. The column of city names is converted to numeric format with "one-hot encoding" approach since it has substansive correlation with sales. Lastly, there is a weak dependence between sales and weather conditions. Consequently, conversion to numeric format with "one-hot encoding" approach was undertaken. Although it is hard to interpret this dependence properly, and moreover some weather types are more frequent than others in different cities.
    To build predictions of sales it is necessary to find data that correlates strongly with the number of sales. For that several lag features were added to the dataframe. Three columns were added with sales information from three previous weeks for each product from certain store. Another column was added with the average value of aforecited variables. Furthermore, mean of all-time sales of the product for a certain day of the week in each city for each store was added. Correlation matrix was obtained for the new dataframe. New columns show high correlation with the number of sales and should significantly improve predictions of future sales.
    For further evaluation of ML models mean absolute error (MAE) is used as quality metric. MAE shows how many times the forecast is wrong on average. This metric is very easy to interpret. For example, metric value equal to 5 means that the model is on average wrong by 5. The metric cannot be negative, since absolute values of errors are taken. For an ideal model this metric will be equal to 0. The metric is not sensitive to outliers.
    $$MAE = \frac1N \sum ^{N}_{i=1} |y_i-\hat y_i|$$


    Building ML models


    Two dataframes were obtained (df_train, df_test), with all the modifications described previuosly. First three weeks in df_train were droped, since they have NAs in lag feature columns. First, several ML models were trained on df_train and the best model with minimal MAE score was chosen. For this, df_train was divided into 4 parts: y_train, x_train, y_test, x_test.
    The ML models that were used in the following work are listed below. For each model, MAE scores were calculated.

    ML models:

    • k-nearest neighbors (KNN)
    • Linear Regression
    • Decision Tree Regression
    • Ensemble models
      • Random forest
      • Gradient Boosting
      • Mean of all regressors
    MAE scores of used ML models are presented in the table below. Predictions and factual data are visualised in the graph. It can be seen that Decision Tree model has the best predictions and that is supported by minimal MAE score.
    Model MAE
    Decision Tree 3.07
    Linear Regression 3.78
    KNN 4.17
    Random Forest 3.09
    Gradient Boosting 3.63
    Mean of all regressors 3.37


    Prediction


    For predictions of df_test data Decision Tree model is used. To visualise predicted values, a graph was created that shows the mean daily sales in each city. This graph looks similar to what was observed earlier in exploratory data analysis: the same difference of rates of sales between cities; and the same tendency throughout the week. Below, there is the head of the dataframe with number of predicted sales.

    The first 5 rows of the dataframe with predicted sales

    id prediction
    666677 17
    666678 28
    666679 26
    666680 22
    666681 25


    Results


    In this project, sales predictions were made for the stores across 10 russian cities for the 3rd weeek of February.
    • Exploratory analysis revealed patterns in data, which can be easily interpreted with russian population and salary distribution statistics.
    • Correlation between data and sales rates was improved by use one-hot encoding for object data, like city names.
    • Correlation was further improved by use of lag features of sales. The highest correlation was observed for mean number of sales for each day of the week for products in each store (day_product_mean) and mean of three lag days (lag_days_mean): 0.90 and 0.88 respectively.
    • 6 ML models were tested and evaluated using MAE score. The Decision Tree model had the lowest MAE score 3.07 and was chosen for predictions.




    Main page