Pursuit of the Ideal Machine Learning Algorithm
In a recent research study, twelve open-source datasets were utilised, representing a diverse mix of datatypes and complexity. The focus of the research was on random forest and xgboost models, as these algorithms are typically known for their superior performance in predictive analytics.
The research employed a 0.3% accuracy threshold for sorting models. All performance results, totalling over 25,200 predictions, were based on the test data.
Performance Comparison between Random Forest and XGBoost
When comparing default configurations versus hyperparameter tuning, Random Forest performed well out-of-the-box without extensive tuning and is less sensitive to hyperparameter changes. It provides a good balance of accuracy and generalization with default settings, often making it the baseline strong performer.
On the other hand, XGBoost generally shows significant performance improvements when hyperparameters are carefully tuned and optimised. Its advanced set of parameters and its capacity for fine-grained control mean that hyperparameter tuning plays a crucial role in unlocking its full predictive power.
In multiple studies, tuned XGBoost models have greatly outperformed logistic regression and other classical models in accuracy and recall, indicating better model generalization and prediction quality. However, Random Forest may still outperform XGBoost in some balanced prediction contexts or when interpretability and computational simplicity are prioritised.
Extreme Ensembles and the Universal Model
To further enhance predictive performance, the research used voting classifier and stacking classifier for extreme ensembles. The results showed that there is no single best performing model at the top rank across these datasets, but there is a universal model, the XGB_SVM_LOG STACK extreme ensemble, that appeared out of the 'noise' - the Almost-Free Lunch.
From a minimal sample ratio of 12 up to roughly 100+, the Universal Model, the XGB_SVM_LOG STACK, is recommended. Above 100+, switch to the hypertuned XGB model.
It's important to note that the hypertuned XGB model only performs well with a large sample ratio, while the extreme ensemble requires no tuning.
The Importance of Data Quality
The research emphasised the Prime Directive of data quality, stating that spending time on improving data quality is more important than exploring yet another algorithm because with poor quality data, all algorithms will have a learning disability.
All missing values were imputed with MissForest, except for Telco Churn, whose missing values were dropped.
Future Research
The research plans to explore the sample ratio at which the hypertuned XGB model lands in the top rank consistently. More research is required to develop comparative analyses between Performance Probability Graphs, perhaps based on their unique ability to separate error components from the modeling process.
Additionally, the research aims to investigate the sample-quality source of the PPG max range, given that two of the three models in the extreme ensemble are robust to outliers.
The research concluded that there is no perfect model, but here are two that might work across all datasets in a consecutive series - the Almost-Free Lunch is here.
References
- Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., ... others. (2015). Xgboost: extreme gradient boosting.
- Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems?
- Delmaster, R., & Hancock, M. (2001). Data Mining Explained.
- Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H.-T. (2012). Learning from Data.
- [1], [2], [3], [4], [5] - Various papers on machine learning and predictive analytics.
Technology, such as the XGBoost, can significantly improve performance in predictive analytics when hyperparameters are carefully tuned and optimized. On the other hand, artificial-intelligence algorithms like Random Forest provide a good balance of accuracy and generalization with default settings, often serving as the baseline strong performer.