Syntax Sunday: scikit-learn for NHL Stats
Welcome
For this week's #SyntaxSunday, we will be exploring scikit-learn, a widely used open-source machine learning library for Python. This powerful library offers a wide range of algorithms and tools for building and deploying machine learning models with ease.
Anybody can take advantage of scikit-learn's intuitive interface, comprehensive documentation, and seamless compatibility with other Python libraries such as NumPy and Pandas. This library offers a wide range of tools, from data preprocessing to model evaluation, making the machine learning process easier for both beginners and experienced data scientists!
After gaining experience with the NHL API for the Hockey Stats and Analysis Expert GPT, I will once again utilize it to demonstrate how to implement scikit-learn. We'll apply Linear Regression, Random Forest Regression, and Gradient Boost Regression to forecast point totals for the top 100 current NHL players.
I won't dive to much into the code in this post, as I have create a GitHub Repo with all the code and documentation. This will be more of a high level overview of scikit-learn and how to use it effectively!
Exploring Different Types of Regression Models
scikit-learn offers a variety of regression models. For this example, we will focus on linear regression, random forest regressors, and gradient boosting regressors. The techniques used can be applied to other models in a similar way. You can learn more about these models here.
Linear Regression is like drawing a straight line through a series of points on a graph to make predictions about future points. It uses existing points to calculate the line's angle and position, helping to predict outcomes such as someone's potential height based on their age or potential earnings based on education level in real life.
The Random Forest Regressor operates like a collaborative team of decision-making experts with unique skills. Each expert, or tree, analyzes specific aspects of the problem and provides a prediction. These individual predictions are then combined through a voting process to yield the final prediction. Similar to seeking advice from multiple knowledgeable individuals for a more informed decision, the random forest algorithm aggregates insights from numerous trees to enhance accuracy in predicting phenomena such as house prices or stock values.
Gradient Boosting Regression is like baking a perfect cake, it involves starting with a simple recipe and making continuous improvements until it's just right. Instead of flour and sugar, we work with data and predictions. The gradient component focuses on pinpointing prediction errors and gradually correcting them, allowing us to learn from mistakes and refine our predictions. This method helps fine-tune our predictions for the most accurate results possible.
Utilizing NHL Stats API for Regression Model Example
To demonstrate the practical application of these regression models, we will utilize the NHL (National Hockey League) stats API to collect data on player point totals.
Through a simple example, we will illustrate how these regression models can be used to predict point totals for the top 100 players this season. This works great because it is roughly half way (~40 games) through the NHL season, so we can use the models to predict player point totals for the remainder of the year also.
First a few notes:
Full source code is available at: https://github.com/bloodlinealpha/scikit-learn-nhl
Download the predicted data excel files from the output and output_tuned folders
Try out the visualizer at: https://bloodlinealpha.com/nhl/points-prediction/
Here is a summary of the step taken:
Queried the top 100 players by points for the current season (as of January 6, 2023) from the NHL API
Queried the last 5 season statistics for each player (2018/2019 -2022/2023) from the NHL API
Cleaned and manipulated the last 5 season data
Trained and tested 3 Regression models (Linear Regression, Random Forest, and Gradient Boost) using last 5 season data to predict player Point Totals.
Repeated the training and testing using GridSearchCV to tune various parameters for each model to again predict player Point Totals.
Predicted player Point Totals for each model by using the current seasons statistics (2023-2024) ~ 40 games
Extrapolated the statistics and predicted player Point Totals for the remainder of the season
Output the prediction results to excel files for easy viewing
Created a simple visualizer to view the prediction results: https://bloodlinealpha.com/nhl/points-prediction/
I know it seems like a lot, but when you break it down into steps it is relatively simple. The toughest part is always dealing with the data. Ensuring you have accurate data is what is most important because as the old saying goes:
Garbage in = Garbage out
Potential Issues
One data issue I encountered was overfitting, this is where the model learns the training data so well that it adversely affects its ability to apply to new, unseen data. I noticed this problem because my R-squared (a higher R2 value signifies a better fit of the regression model to the data) value was equal to 1. Further investigation revealed data leakage due to using columns (such as points per game, goals, assists, etc.) that are used to calculate/derive the total points.
It's crucial to fully grasp and interpret the data you're working with.
Thoughts
As you can see scikit-learn provides any incredibly simple way to create regression models, as well as a variety of other machine learning models! The steps are relatively simple as well:
Gather data
Clean and manipulate the data
Train and test the model
Predict and validate using real world data
Visualize and analyze the results
Limitations
Understanding the limitations of your model is essential for making informed decisions. While machine learning models can provide valuable insights, it's important to remember that they are not magic tools. These models learn based on the data provided to them, which means they are only as good as the quality and relevance of that data used when training.
Assess the inputs and outputs of your model
Consider factors such as bias, noise, and incomplete information
By acknowledging these limitations, you can refine your approach and make more accurate predictions or decisions based on the model's outputs.
Next Steps...
Links to the visualizer, project repo and all future examples will be on: https://bloodlinealpha.com/
If you have any questions about using scikit-learn or the NHL API send me a LinkedIn message or email me at: bloodlinealpha@gmail.com.
Syntax Sunday
KH
Last updated