Train-Test Split And Cross-Validation: Visual Illustrations & Examples

By Dr. Shripad Bhat
January 28, 2024

Creating an optimal model that strikes a balance between underfitting and overfitting requires careful consideration. To assess how well our model performs on new, unseen data, we employ the train-test split technique and cross-validation.

Train-Test Split:

To evaluate a model’s performance, we split the dataset into a training set (approximately 70-90% of the data) and a test set (10-30%). In sklearn, the train_test_split function from sklearn.model_selection is utilized for this purpose.

Key Parameters:

The “stratify” option ensures a consistent ratio of class labels in both the training and test sets, crucial for balanced data. The “test_size” option determines the size of the test set (e.g., 0.2 represents 20%). The “shuffle” option, set to True by default, shuffles the data before splitting.

Cross-Validation (CV):

When tuning hyperparameters, using the test set multiple times can lead to overfitting. To address this, we employ cross-validation. A validation set is used to fine-tune hyperparameters, and different methods like K-fold CV are employed to ensure robust evaluation.

3-fold CV

CV Methods:

Leave One Out (LOO)

Leave P Out (LPO)

ShuffleSplit

KFold

StratifiedKFold (for classification tasks with class imbalance)

Stratified Shuffle Split

GroupKFold (considering groups in the data)

Cross-Validation for Time Series Data:

Time series data, characterized by patterns, requires special consideration. Randomly splitting such data is not meaningful. Scikit-learn’s TimeSeriesSplit approach and other methods like Nested CV and hv-block CV are discussed for accurate evaluation.

Cross-validation for Timeseries data

Conclusion

In this post, we have explained: Train-test split

Cross-validation

Different approaches of CV

Cross-validation for Timeseries data

Click Here to Enroll for an Online Certification Course in Data Science