Kaggle Playground Competition - Rainfall Prediction

Kaggle Playground Competition - Rainfall prediction

_{Click here to access the related GitHub repository}

Introduction

For my first time experimenting with a Kaggle playground series competition, I wanted to build a simple model and use a simple dataset, to apply skills I have learned in online courses on DataCamp, or even more recently, completing the Google Data Analytics certificate on Coursera. Therefore I recommend anyone wanting to apply skills learned from theoretical classes to try and play around with the Kaggle Playground Series. This project will be a detailed presentation of the step by step analysis and model building. The complete Jupyter Notebook can be accessed here.

1. Project

1.1. Data

The data is available on Kaggle as part of the March 2025 Kaggle Playground Series competition. We are provided with a sample csv file for submission (see 1.2. Objective), and a dataset already split between a 'train' (with the target metric) and a 'test' (without the target metric).The data consists of calendar and meteorological data. There is no extra information about what they mean exactly. Here are the 13 fields, with my best guess as to what they represent:
description

1.2. Objective

The objective of the project and model is to predict whether or not there is rainfall for each day of the 'test' csv file. More specfically, we are to predict the probability of rain as well. This means that we will be looking to not only focus on the binary result (rainfall/no rainfall) but also on the probability in between 0 and 1. Therefore, we probably will not favor a metric like precision, recall or accuracy above any other, but rather try to get a high f1-score.

1.3. Plan

Since this is a binary regression problem we have two main possible avenue of modelling here: a logisitic regression model, and a decision-tree based model (decision tree and random forest). However, since we are also required to provide probabilities for the target metric 'rainfall', we will focus on a logisitic regression model as a limitation of decision tree-based models is their capability to provide probability estimates.

2. Dataset exploration

2.1. Data import

First we import the main libraries (pandas, seaborn, matplotlib, sklearn) and tools we need for the project. Then we also import and load separately the train and test csv files.

2.2. Initial data exploration and cleaning

Next, we will:

look to understand the variables
check and correct the datatypes
rename the columns
check for missing values
check for duplicates
check for outliers
check the target label distribution

We will do this for both the train and test dataset. Any modification or change done on out will potentially need to be done on the other.
The data looks to be in the same format for both the train and test datesets. The data types look correct and there does not appear to be any missing values except for one entry of winddirection in the test dataset. The train data covers 6 years, while the test one covers 2.
There are no missing values except for on entry of winddirection in the test dataset. We could assume it was because there was no wind on that day. How every the data shows windspeed = 17.2 on that day. We could drop the row, however we are tasked to present the rainfall probability for every single day in the test dataset. We have a few options to inpute the missing value (which might not even matter if we don't use winddirection in our model). Imputing with the mode, or most common winddirection, would make sense. The value is 70. However, it seems that at this time of the year, the most common wind direction is 220 degrees. We will therefore impute and replace the missing value for winddirection in the test dataset with the value 220.
For a more rigorous approach of outliers consideration we combine both the training and testing data and "flag" the outliers corresponding to either the train dataset or the test one. We get the representation below. description

As we can see there are quite a few outliers especially for the cloud metric. We will need to consider if we drop it or modify when we actually build our model since some models perform considerably worse with outliers. Another observation is that there is no disproportionate representation of outliers in the train or in the test dataset, meaning the sets were most likely separated randomly instead of any engineered or deliberate split.
Finally let's check the distribution of the target label 'rainfall' 0 or 1 to see if there is an over-representation of one of the two classes. We have about 3 times as many entries for rainfall (1) than no rainfall (0) with a 75-25 split. However, that is not enough to qualify this dataset as skewed or inbalanced.

2.3. Advanced data exploration

Next we are going to create a number of visualizations to appreciate the distribution of the data and maybe draw some early observations as to what might influence rainfall or not. We will now focus on out train dataset since it is the one we will use for model building. We will also assess for multicolinearity, as it is a prerequisite (to avoid!) for model building. Finally we will get some ideas for feature engineering.
Mintemp, temperature and maxtemp:

As we can see, mintemp, maxtemp and temperature seem to be highly positively correlated. We will likely not need to keep all three metrics. Furthermore, the distribution of the rainfall labels seems to be pretty uniform accross all possible temperatures, not leading to any conclusion of any obvious impact of the temperature on rainfall.
A potentially interesting metric, the temperature change or total temperature delta during any given day (max minus min) does not seem to be correlated with the temperature. However, there also does not appear to be any obvious impact of this metric over the target metric.

Humidity, pressure and cloud coverage:

The 'pressure' histogram above shows that the data is generally normally distributed. There also does not seem to be any obvious link between pressure and rainfall though the comparative percentage do seem to change a little at high pressures, with rainfall being comparatively less frequent at higher pressures (but there are also fewer data points).
The humidity histogram show data skewed towards higher values of humidity. There also seem to be a correlation between higher values of humidity and increased chance of rainfall.
Likewise, the cloud histogram show data skewed towards higher values of cloud coverage and there also seem to be a correlation between higher values of cloud coverage and increased chance of rainfall.

Let's check if cloud coverage and humidity look to be correlated.

As shown on the scatterplot above, the correlation between cloud coverage and humidity is not strong but it is present, with higher values of humidity concurring with higher values of cloud coverage. As discussed before such higher value also seem to correlate with increased chance of rainfall.

Sunshine, Dewpoint, Windspeed and Winddirection
Let's build similar histograms for those 4 metrics that might have a less obvious correlation with rainfall. description

Sunshine: as expected a lower amount of sunshine hours is correlated with a higher chance of rainfall. We will look below at the correlation between sunshine and cloud coverage.
Dewpoint: the distribution of dewpoint is not normal but skewed to the right. Here we see there does not look to be any obvious correlation with rainfall though lower values of dewpoint seem to reduce the chance of rainfall (but there are fewer entries)
Windspeed: the distribtuion is close to being normal though slightly skewed to the left. Here again, the correlation with rainfall is not obvious though stronger winds seem to increase the chance of rainfall. We could look to engineer a binary or categorical (and ordinal) feature with wind strength, taking example from the Beaufort scale (eg. light breeze for 6-11 km/h, moderate for 20-28 km/h etc.)
Winddirection: the distribution is far from normal and it makes sense as we would expect the wind is mostly coming from some principal directions. We could reclassify the values in a categorical (but non-ordinal) feature with for example mostly east, mostly north, etc. However the proportions of rainfall do not seem to be correlated with wind direction.

Wind - categorical approach: Let's create categorical variables to deal with wind speed and direction. The highest windspeed is 59.5 km/h. According to the Beaufort scale, that is category 7 "moderate gale". Instead of creating 8 categories we can simplify a bit and arbitrarily create 4: 1 for under 5 km/h (cat 0 and 1 in Beaufort scale), 2 for 6-19 km/h, 3 for 20-38 km/h and 4 for 39-61 km/h.
For winddirection, we could create also 8 categories (N, NE, E, etc.) but will simplify and create 4: North (0) (0-45 and 315-360), East (1) (45-135), South (2) (135-225), West (3) (225-315).
description
Converting continuous measures to categorical discrete ones can help to reduce the noise and draw better conclusions. Here we can see that:

the windspeed does seem to be correlated with rainfall, with stronger category winds being correlated with higher probabilities of rainfall
the wind direction does not seem to have much impact on the rainfall, with the proportions being very similar for dominant wind direction. The rainfall chance is slightly increased for easterly winds.

Day: Finally let's have a look at the dates. It is unlikely that there is more and less chance of rainfall depending on the day of the week. But it seems intuitive, even though we don't we don't know where the data was sampled, that there would be months of the year that are more prone to rainfall. Since there are 365 days in our dataset, we can assume day 1 is Jan 1st and can re-classify the days in months.
description

Correlation matrix: We plot a final correlation matrix to analyse the different metrics we have and summarize. description
Insights:

As discussed before, some of the metrics are correlated with on another: maxtemp, temperature, mintemp and dewpoint are all strongly correlated. To simplify the model we will likely only keep one feature of this group of 4.
Pressure is also strongly negatively correlated with this group of 4. We could look to drop it but for explanatory purposes, it could make sense to keep it.
There is also some moderate correlation between temperatures and sunshine though it is less strong.
As observed before, cloud coverage and humidy are moderately correlated. Sunshine and cloud are strongly (negatively) correlated. To simplify the model, we could look to keep one of the two. As cloud a marginally stronger correlation with rainfall than sunshine does, we could keep cloud. For explanatory purposes we could also look to keep both.
Obviously, windspeed_cat and windiection cat have strong correlation with the corresponding continuous variables. We will keep the engineered features.
Finally, looking at potentially strong predictors correlated with rainfall (among continuous variables), humidity, cloud and sunshine seem to be by far the strongest ones. We will keep all 3. Pressure, temperature and wind parameters seem to have a lesser correlation. We will simplify there.

2.4. Feature selection

Before moving on to the modelling phase we need to select our final set of features and replicate the engineered features on our test dataset. We will keep the following set:

'pressure'
'temperature'
'humidity'
'cloud'
'sunshine'
'temp_var'
'windspeed_cat'
'winddirection_cat'
'month'

and the target variable 'rainfall' for the train dataset.

2.5. Outliers treatment

Since we are using a logistic regression model and it is quite sensitive to outiliers wwe will need to remove the outliers. We need to check the outliers over the overall dataset though, not just the train one, but of course, we will only remove the outliers in the train section. We need to remove rows in the train dataset that have outliers in any of the columns. That means the 'pressure', 'humidity', 'cloud' and 'temp_var' columns. We will keep this train dataset separately and try to train a model with the two dataframes, with and without outliers, and see how they perform.

2.6. Encoding

Next, we need to encode the non-numeric metrics. It's only windspeed_cat (which is ordinal, so it can stay like this) and winddirection_cat, which needs to be encoded as it is not ordinal. And we need to apply this encoding to the test dataset as well.

3. Model fitting

We are now (almost) ready to fit models. We have already a training and a testing dataset that are split. However, we will want to validate and calculate our model performance so we will further split our training dataset into a training and a validation dataset that we will hold out for performance evaluation.

3.1. Data scaling

The last stage before fitting models to finish preprocessing the data is to scale the fields. Pressure, temperature, humidity all have values on very different scales. Although scaling is not absolutely necessary when fitting a Logistic Regression model, it is still recommended.
We will use StandardScaler as it is often the preferred choice for Logistic Regression as it centers the features at 0 and scales them to unit variance. We could also have used MinMaxScaler. We do need to be careful with two things:

the scaler fitted to the train the data should then be applied identically to the test data
the ordinal metrics, windspeed_cat and month, does not need to be scaled

To recap, we now have:

X_final, y and df_test for the model that includes all training data (with outliers)
X_final_no_outliers, y_no_outliers, df_test_no_outliers for the model without outliers

3.2. Train/validation splits

We further split X_final, y on one hand and X_final_no_outliers, y_no_outliers on the other hand to have a train dataset and validation dataset.

3.3. Cross validation for regularization

Lastly, regularization can help prevent overfitting for the logistic regression model. We have two options:

L1 (or Lasso): can be used but tends to provide sparse models. It can be useful when try to perform feature selection and identifying the most important features.
L2 (or Ridge): is more commonly used for Logistic Regression as it adds a penalty term proportional to the square of the model coefficients.

To simplify, we will only try the no-regularization and the L2 regularizations options. Furthermore, we have different possbilities for how to calibrate the Grid Search CV object and measure the performance. Here we are interested in the probabilities of rainfall, not just the binary output 0 or 1. We will therefore not measure performance and choose the best model according to the roc_auc_score, but rather the 'neg_log_loss' (negative log-loss), which measures the cross-entropy between the true labels and the predicted probabilities. The lower the log-loss, the better the model's ability to produce accurate probabilities.

3.4. Model fitting and evaluation

We fit the models and generate a table to compare the results using both the data with and without outliers. This is the results table we get:

Looking at the results above we can see that:

in both cases the validation scores are slightly lower than the testing ones. Only slightly though, so there is no need to worry
there is no significant difference between using the dataset with outliers or the one without. The neg_log_loss is acutally slightly lower (so better) with the outliers and so is the roc_auc score.

We will continue and decide to keep the model with outliers for simplicity and to cover a wider range of values. Before making the final predictions on the test dataset, using the best_model_with_outliers, we will look at three final summaries: confusion matrix, roc_auc_curve and classification report.

The confusion matrix shows that the model does seem to predict quite a few false positives (47 days where it did not rain but was predicted to rain) but considerably less false negatives (16 days where it was predicted to rain but it didn't) considering the split of the data being 75%-25% rainfall days vs. non-rainfall days. The model seems to be pretty good at predicting rainfall but less so as predicting no rainfall. Let's confirm that with the classification report.

The classificaiton report shows that the model we went with achieved a precision of 85%, recall of 86%, f1-score of 85% and accuracy of 86%. These are all pretty good scores. The difference in metrics between predicted rainfall and predicted no rainfall do confirm that the model is better at predicting rainfall than the absence of it.

Finally let's calculate predicition probabilities on the validation dataset using our model and display the roc-auc curve. The roc_curve() function is used to compute the false positive rate (FPR) and true positive rate (TPR) at various threshold settings. The auc() function is then used to calculate the area under the ROC curve (ROC-AUC), which is a measure of the model's discriminative ability. As we can see, the model is much better than a random guessing one (blue line) but the ideal model would have an even sharper curve in the beginning. This chart confirms our observation from before that our model is not perfect to limit false positives.

4. Predicting on the test set

After all this work we can go ahead and use our model to predict on the test dataset. It is not a perfect model but it is pretty good for now. We could (and should) iterate with adding or removing some features and engineering some new ones. For now we will use it to predict on the left out test dataset. The resulting file is available on my repository.

5. Bonus - Relative feature importance

We can plot the relative feature importance chart showing the main metrics that had an influence on model training and fitting and their relative strength. As shown here, cloud coverage and sunshine (negatively closely related) are the biggest preictors. Humidity, temperature and pressure, also correlated follow. Our categorical engineered features, as well as the month and temperature variation, did not seem to have a high importance.

Thomas Klein