Sales Forecasting

Konstantin Burkin

Introduction

This project was written in Google Colab, using Python version 3.7.13. This work is available in my Github repository, where it is possible to download Colab Notebook, check the code and reproduce my work.

Assignment

Predicting the number of sales is an important business problem for grocery stores, fast food restaurants, and delivery systems. Predicting the number of orders can not only speed up the delivery process, but also simplify the company's logistics, maintain the quality of products with a short shelf life and increase the business's rating, thereby increasing its competitiveness. Thus, predictive analysis allows business to repeatedly reduce the consumption of resources and operating costs.

The goal of this project is to create predictions to evaluate the number of sales in Delivery Club for a week in February. This week was described by "test" dataframe, that contained information about the date, weather conditions, product types, stores, and their locations. The history of the sales for previous 7 months is available in the "train" dataframe. The predictions should be made with any kind of ML model. However, predictions should not exceed 4.1 points of mean absolute error metric. Both dataframes describing Delivery Club sales from 160 stores in 10 russian cities were provided by the supervisor.

The result of this project is a dataframe containing id of products and corresponding number of predicted sales. The accuracy of predictions was evaluated using Yandex.contest platform.

Brief outline

Setup notebook environment

Import data and libraries
Dataframe conversion to English format

Exploration of data

Data types, number of unique values, description of columns, missing values analysis
Statistical evaluation of numeric columns, initial correlation between columns

Visualisation of patterns in the dataframe
Data transformation. Visualisation of dependencies between sales column and other columns
Comparison of ML models to choose the one with minimal absolute error
Building predictions and visualisation the result

Import of data and Python libraries

The following data analysis includes several Python libraries for data analysis, builduing ML models, ploting data, etc:

Numpy
Pandas
Plotly
Sklearn
Seaborn
Google colab

Two dataframes were downloaded and read (train.csv and test.csv) from my Github repository. Both dataframes are available on my Github page. The train dataframe is used to examine data, find patterns and factors that correlate with the number of sales, add new variables and test ML models. The test dataframe is used to build predictions that will be evaluated later in Yandex.contest.

The head of the dataframe

id	date	city_name	store_id	category_id	product_id	price	weather_desc	humidity	temperature	pressure	sales
1	2021-07-29	Moscow	1	1	1	4.79	partly cloudy, light rain	61.9375	23.1875	741.0000	26
2	2021-07-30	Moscow	1	1	1	4.79	partly cloudy, light rain	70.2500	22.1875	740.3125	37
3	2021-07-31	Moscow	1	1	1	4.79	partly cloudy	52.6250	21.8125	741.6250	25
4	2021-08-01	Moscow	1	1	1	4.79	cloudy, light rain	87.4375	20.0625	743.3125	26
5	2021-08-02	Moscow	1	1	1	4.79	partly cloudy	66.1875	23.4375	739.6250	22

Exploratory Data Analysis

Statistical data description

Two dataframes (train.csv and test.csv) have common list of 11 variables. "Sales" variable is present only in train dataframe.

Table of variables

№	Variable	Description	Datatype	Unique values
1	id	Unique identifier representing a bundle (product_id, store_id, date) There is only one id, it does not repeat in the data	int64	666676
2	date	Date of sale	object	200
3	city_name	Name of the city where the sale took place	object	10
4	store_id	Unique identifier for each store	int64	160
5	category_id	Product category	int64	9
6	product_id	Unique identifier for each type of product	int64	32
7	price	Price of the product	float64	29
8	weather_desc	Weather description	object	16
9	humidity	Humidity in the city on the day of sale	float64	916
10	temperature	Temperature in the city on the day of sale	float64	505
11	pressure	Atmosphere pressure in the city on the day of sale	float64	344
12	sales	Number of product sales (this is what I should predict)	int64	249

There are 666 676 observations in the train dataframe and 24 836 observations in the test dataframe.
The train dataframe describes data in 7 months period, from June 29 to February 13.
The test dataframe describes data in one week period from February 14 to February 20.
The dataframe contains information from 10 cities:
Moscow, St. Petersburg, Krasnodar, Samara, Nizhny Novgorod Rostov-on-Don, Volgograd, Voronezh, Kazan, and Yekaterinburg.

The table presented describes columns with numeric data. Further, the table below shows the Pearson correlation between sales column and other numeric columns.

Statistical description of numeric data

	price	humidity	temperature	pressure	sales
mean	5.1	74.3	4.9	751	10
min	1.9	13.8	-24.0	710	0
max	18.6	100.0	34.3	779	275
25%	3.0	59.8	-3.3	745	2
50%	4.1	79.7	4.4	751	5
75%	6.0	92.4	11.8	758	12

Pearson correlation between sales and other columns

Column name	sales
sales	1.00
product_id	0.14
humidity	0.13
pressure	-0.06
temperature	-0.07
id	-0.08
store_id	-0.09
category_id	-0.11
price	-0.19

Data visualisation

All of the graphs presented in this work are interactive. It is possible to choose one out of many lines shown on the graph and choose the interval across both axes to explore the data. To go back to initial state of the graph it is necessary to double-click the graph or click "Reset axes" button at the top right corner.

It seems intuitive to start looking for patterns with data exploration of how the cumulative number of sales changes each day in each city. The graph below shows 7-months period in 10 cities. There are several features that are evident:

First feature that can be seen in the graph is that the number of sales noticably changes throughout the week in each city.

Second thing is that in different cities there is a different rate of sales. In larger cities, like Moscow, the sales rate is the highest.

Third, the noticeable sharp decrease in sales can be observed around New Year, which is an important national holiday in Russia, during which people generally stay at home with their families and most of the stores are closed. This effect cannot be considered as an outlier and should be extrapolated to other major public holidays. However, no such holidays are present in February.

In the end, the tendency can be noticed that the total number of sales is started to rise in the winter.

The next step is to observe another time dependency, but with aggragted number of sales in each month. On the graph below it can be seen that in every city the total number of sales slowly droped at the beggining of autumn and then started to grow again at the beggining of winter. On this graph the difference between total number of sales in different cities is much clearer. This fact is not unusual, since the cities in Russia are quite disproportionate in population size and salary levels (supplementary information is shown in the table below). Information about population size and salary levels was found here and here.

Population and salary levels in cities

№	City	Population, M	Salary, K
1	Moscow	12,66	111,1
2	St. Petersburg	5,38	76,0
3	Krasnodar	0,95	40,8
4	Nizhny Novgorod	1,26	41,5
5	Volgograd	1,00	38,1
6	Kazan	1,26	44,9
7	Samara	1,14	42,9
8	Rostov-on-Don	1,14	39,1
9	Voronezh	1,05	40,9
10	Yekaterinburg	1,50	48,4

Since the number of sales drastically changes from sity to city, it is interesting to see the difference between mean daily sales in each city. Moreover, the gap between mean number of sales each day of the week should be large. It is important to check if there is any strong correlation in price vs sales and weather vs sales. In the set of graphs below it can be seen that mean of sales is low at the beginning of the week and then the sales rise up by the weekend. The dependence is weak, but noticable, so it is reasonable to add additinal variable "Day of the week" to the dataframe and fill it with number of the day of the week (from 1 to 7). Moreover, another variable "Weekend" was added with values of 0 and 1, showing if the day is weekend or not. The rate of sales of products varies depending on their price. There are a lot of products with low prices. That means that the cheaper the product the more often it gets selled. On the left bottom side of the graph dependence can be observed in mean daily sales in each city. The column of city names is converted to numeric format with "one-hot encoding" approach since it has substansive correlation with sales. Lastly, there is a weak dependence between sales and weather conditions. Consequently, conversion to numeric format with "one-hot encoding" approach was undertaken. Although it is hard to interpret this dependence properly, and moreover some weather types are more frequent than others in different cities.

To build predictions of sales it is necessary to find data that correlates strongly with the number of sales. For that several lag features were added to the dataframe. Three columns were added with sales information from three previous weeks for each product from certain store. Another column was added with the average value of aforecited variables. Furthermore, mean of all-time sales of the product for a certain day of the week in each city for each store was added. Correlation matrix was obtained for the new dataframe. New columns show high correlation with the number of sales and should significantly improve predictions of future sales.

For further evaluation of ML models mean absolute error (MAE) is used as quality metric. MAE shows how many times the forecast is wrong on average. This metric is very easy to interpret. For example, metric value equal to 5 means that the model is on average wrong by 5. The metric cannot be negative, since absolute values of errors are taken. For an ideal model this metric will be equal to 0. The metric is not sensitive to outliers.

$$MAE = \frac1N \sum ^{N}_{i=1} |y_i-\hat y_i|$$

Building ML models

Two dataframes were obtained (df_train, df_test), with all the modifications described previuosly. First three weeks in df_train were droped, since they have NAs in lag feature columns. First, several ML models were trained on df_train and the best model with minimal MAE score was chosen. For this, df_train was divided into 4 parts: y_train, x_train, y_test, x_test.
The ML models that were used in the following work are listed below. For each model, MAE scores were calculated.

ML models:

k-nearest neighbors (KNN)
Linear Regression
Decision Tree Regression
Ensemble models
- Random forest
- Gradient Boosting
- Mean of all regressors

MAE scores of used ML models are presented in the table below. Predictions and factual data are visualised in the graph. It can be seen that Decision Tree model has the best predictions and that is supported by minimal MAE score.

Model	MAE
Decision Tree	3.07
Linear Regression	3.78
KNN	4.17
Random Forest	3.09
Gradient Boosting	3.63
Mean of all regressors	3.37

Prediction

For predictions of df_test data Decision Tree model is used. To visualise predicted values, a graph was created that shows the mean daily sales in each city. This graph looks similar to what was observed earlier in exploratory data analysis: the same difference of rates of sales between cities; and the same tendency throughout the week. Below, there is the head of the dataframe with number of predicted sales.

The first 5 rows of the dataframe with predicted sales

id	prediction
666677	17
666678	28
666679	26
666680	22
666681	25

Results

In this project, sales predictions were made for the stores across 10 russian cities for the 3rd weeek of February.

Exploratory analysis revealed patterns in data, which can be easily interpreted with russian population and salary distribution statistics.
Correlation between data and sales rates was improved by use one-hot encoding for object data, like city names.
Correlation was further improved by use of lag features of sales. The highest correlation was observed for mean number of sales for each day of the week for products in each store (day_product_mean) and mean of three lag days (lag_days_mean): 0.90 and 0.88 respectively.
6 ML models were tested and evaluated using MAE score. The Decision Tree model had the lowest MAE score 3.07 and was chosen for predictions.

Main page