The head of the dataframe
id | date | city_name | store_id | category_id | product_id | price | weather_desc | humidity | temperature | pressure | sales |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2021-07-29 | Moscow | 1 | 1 | 1 | 4.79 | partly cloudy, light rain | 61.9375 | 23.1875 | 741.0000 | 26 |
2 | 2021-07-30 | Moscow | 1 | 1 | 1 | 4.79 | partly cloudy, light rain | 70.2500 | 22.1875 | 740.3125 | 37 |
3 | 2021-07-31 | Moscow | 1 | 1 | 1 | 4.79 | partly cloudy | 52.6250 | 21.8125 | 741.6250 | 25 |
4 | 2021-08-01 | Moscow | 1 | 1 | 1 | 4.79 | cloudy, light rain | 87.4375 | 20.0625 | 743.3125 | 26 |
5 | 2021-08-02 | Moscow | 1 | 1 | 1 | 4.79 | partly cloudy | 66.1875 | 23.4375 | 739.6250 | 22 |
Two dataframes (train.csv and test.csv) have common list of 11 variables. "Sales" variable is present only in train dataframe.
Table of variables
№ | Variable | Description | Datatype | NAs | Unique values |
---|---|---|---|---|---|
1 | id | Unique identifier representing a bundle (product_id, store_id, date) There is only one id, it does not repeat in the data |
int64 | 0 | 666676 |
2 | date |
Date of sale | object | 0 | 200 |
3 | city_name | Name of the city where the sale took place | object | 0 | 10 |
4 | store_id | Unique identifier for each store | int64 | 0 | 160 |
5 | category_id | Product category | int64 | 0 | 9 |
6 | product_id | Unique identifier for each type of product | int64 | 0 | 32 |
7 | price | Price of the product | float64 | 0 | 29 |
8 | weather_desc | Weather description | object | 0 | 16 |
9 | humidity | Humidity in the city on the day of sale | float64 | 0 | 916 |
10 | temperature | Temperature in the city on the day of sale | float64 | 0 | 505 |
11 | pressure | Atmosphere pressure in the city on the day of sale | float64 | 0 | 344 |
12 | sales | Number of product sales (this is what I should predict) | int64 | 0 | 249 |
There are 666 676 observations in the train dataframe and 24 836 observations in the test dataframe.
The train dataframe describes data in 7 months period, from June 29 to February 13.
The test dataframe describes data in one week period from February 14 to February 20.
The dataframe contains information from 10 cities:
Moscow, St. Petersburg, Krasnodar, Samara, Nizhny Novgorod Rostov-on-Don, Volgograd, Voronezh, Kazan, and Yekaterinburg.
Statistical description of numeric data
price | humidity | temperature | pressure | sales | |
---|---|---|---|---|---|
mean | 5.1 | 74.3 | 4.9 | 751 | 10 |
min | 1.9 | 13.8 | -24.0 | 710 | 0 |
max | 18.6 | 100.0 | 34.3 | 779 | 275 |
25% | 3.0 | 59.8 | -3.3 | 745 | 2 |
50% | 4.1 | 79.7 | 4.4 | 751 | 5 |
75% | 6.0 | 92.4 | 11.8 | 758 | 12 |
Pearson correlation between sales and other columns
Column name | sales |
---|---|
sales | 1.00 |
product_id | 0.14 |
humidity | 0.13 |
pressure | -0.06 |
temperature | -0.07 |
id | -0.08 |
store_id | -0.09 |
category_id | -0.11 |
price | -0.19 |
Population and salary levels in cities
№ | City | Population, M | Salary, K |
---|---|---|---|
1 | Moscow | 12,66 | 111,1 |
2 | St. Petersburg | 5,38 | 76,0 |
3 | Krasnodar | 0,95 | 40,8 |
4 | Nizhny Novgorod | 1,26 | 41,5 |
5 | Volgograd | 1,00 | 38,1 |
6 | Kazan | 1,26 | 44,9 |
7 | Samara | 1,14 | 42,9 |
8 | Rostov-on-Don | 1,14 | 39,1 |
9 | Voronezh | 1,05 | 40,9 |
10 | Yekaterinburg | 1,50 | 48,4 |
ML models:
Model | MAE |
---|---|
Decision Tree | 3.07 |
Linear Regression | 3.78 |
KNN | 4.17 |
Random Forest | 3.09 |
Gradient Boosting | 3.63 |
Mean of all regressors | 3.37 |
The first 5 rows of the dataframe with predicted sales
id | prediction |
---|---|
666677 | 17 |
666678 | 28 |
666679 | 26 |
666680 | 22 |
666681 | 25 |