What is random_state?
random_state = 0 or 42 or none
Maybe you all have used random_state
before if you are in Machine Learning while splitting dataset into training and testing or in any hyperparameter. Most people use the value of random_state
as 0 or 42. But do you really know what is exactly it? Then let’s get started and get to know what is it.
What is it?
In Scikit-learn, it controls the shuffling applied to the data before applying the split. We use it in train_test_split
for splitting data into training and testing dataset. It takes one of the following values.
- None (Default)
It uses the global random state instance fromnumpy.random
. If we call the same function withrandom_state=none
then it will produce different results in every execution. - An integer
If we use any integer value forrandom_state
then it will produce the same result for an integer value. If we change the value ofrandom_state
, then only the result will be different.
random_state
can not be negative!!!
Let’s see how it works
Suppose we have dataset of 10 numbers from 1 to 10, and now if we want to split it into training dataset and testing dataset and size of testing dataset is 20% of the whole dataset.
Training dataset has 8 and testing dataset has 2 data samples. So we ensure that a random process will output the same result every time, which makes code reproducible. Because if we do not shuffle dataset then it will produce different datasets every time and it is not good to train model with different data every time.
For all random datasets, each assign with a random_state
value. It means one random_state
value has a fixed dataset. It means every time we run code with random_state
value 1, it will produce the same splitting datasets.
See the below image for better intuition.
- Here I wanna clear you about one thing. I see most of the people are using
random_state = 42
, even I have used too. As per the above image, there’s one fixed shuffled dataset forrandom_state
value 42. It means whenever we use 42 asrandom_state
, it’ll return a shuffled dataset.
So, 42 is not a special number for random_state
.
Let’s see how it can be used in splitting the dataset.
Here, we use a dataset of Wine Quality and Linear Regression model. Keep it simple because our main is random_state
, not accuracy.
In the above code, for random_state
0, mean_squared_error
is 0.384471197820124. If we try different values for random_state,
then error will be different every time.
- For
random_state = 1
, getmean_squared_error
0.38307198158142 - For
random_state = 69
, getmean_squared_error
0.47013897077423 - For
random_state = 143
, getmean_squared_error
0.42062134425032
But how many random states are possible? 🤔
I have done an experiment on how many unique datasets do we get by shuffling the original dataset?
I took a dataset of 5 data (Just simple 1, 2, 3, 4, 5) and split into in training and testing dataset 2000 times, with random_state
1 to 2000. I stored these 2000 shuffled datasets in the list. And in that list I found 120 unique dataset.
It means we can assume that for the dataset of 5 data, there’s a total 120 unique combination of dataset we got, it means we can set random_state
in between 0 to 119. You can set random_state
greater than 119 then it will give dataset from one of that 120 unique datasets.
As per my understanding, I can explain it in this way:
Why do we need it?
- Imagine we are working with house price prediction. In the dataset, as we go from up to bottom, the number of bedrooms is increasing or the area of the apartment is increasing. This is called bias data.
If we just split data without shuffling, it will give good performance on the training dataset but poor performance on testing. For that, we need to do shuffling of data. So, if we userandom_state
then we can avoid this problem. - When we split data, we want consistence in the result of the dataset every time. This means if we rerun the code, then the training and testing dataset will remain the same every time. Suppose I want you to get the same results that I got when you try out my code. For that
random_state
is required. - For different
random_state
we can see the difference in performance. As per, we see above in the example, differentrandom_state
give differentmean_squared_error
. It means if you select randomly the value ofrandom_state
and if you are lucky, then the error score will be minimized for thatrandom_state
.
Other uses of random state
- K means
In KMeans,random_state
determines random number generation for centroid initialization. We can use an Integer value to make the randomness deterministic. Also, it is useful when we want to produce the same clusters every time. - Random Forest
In Random Forest Classifier and Regression,random_state
controls the randomness of the bootstrapping of the samples used when building trees and the sampling of the features to consider when looking for the best split at each node. - Decision Tree
In Decision Tree Classifier or Regression, when we want to find the best features that controls the randomness of splitting nodes,random_state
is helpful. It will describe the structure of the tree. - Randomized Search CV
- Stratified KFold
random_state
is used in many other places in Machine Learning, but most of the time it uses for the randomness of data or shuffling.
So that’s it for random_state
!
Check out my other blog on Machine Learning here 👉 Medium.
Thanks for reading it. If you like it, then give it a clap and share it.
Follow for more on Medium, I’ll share more Machine Learning stuff here. Here’s my Twitter, follow and connect with me there and feel free to DM.
More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter and LinkedIn. Check out our Community Discord and join our Talent Collective.