Syntax Sunday: Predicting NHL Game Outcomes with scikit-learn
Last updated
Last updated
In this edition of #SyntaxSunday, we are using scikit-learn again, this time to showcase Classification. I will outline some general steps for gathering and transforming data related to National Hockey League (NHL) games, with a focus on predicting "winners" and "losers" for each game. As always, full source code with be available on GitHub so you can give this a try yourself!
Also be sure to check out my other article here, which uses scikit-learn to predict NHL player point stats using regression models.
Machine learning excels at predicting sports outcomes through Classification by recognizing patterns in data. Classifier models categorize data based on specific features, using labeled training data to make predictions.
Algorithms like Support Vector Machines (SVM), Random Forest, Naive Bayes, and Neural Networks analyze data to distinguish between categories. The choice of algorithm depends on factors such as dataset size and feature types.
For this example, I utilized three classifier models:
Random Forest Classifier: A powerful machine learning algorithm that utilizes multiple decision trees to make predictions. By combining the outputs of these trees, it can provide more accurate and robust results compared to individual decision trees alone.
Support Vector Machines (SVM) Classifier: A machine learning algorithm that separates data points by creating a hyperplane with the maximum margin to distinguish between different classes. This method is effective for classification tasks by accurately categorizing various groups within the dataset.
Multilayer Perceptron Classifier: An artificial neural network comprising multiple layers, an input layer for data input, hidden layers for learning patterns, and an output layer for predictions. Activation functions introduce non-linearity and backpropagation adjusts weights during training to enable learning from labeled data and making classifications in tasks like image recognition and natural language processing.
I chose these models to showcase a range of classifiers to work with!
When creating a machine learning model, there are generally a few main steps:
Gather the data
Clean and transform the data
Train and test the model
Predict and validate using real world data
Visualize and analyze the results
Majority of your time will be spent gathering, cleaning, and transforming the data. To do this, Pandas DataFrames are heavily relied upon.
With DataFrames, you can load data from various file formats, clean it by managing missing values or duplicates, manipulate it, and analyze it using statistical functions effortlessly.
Check out this article to learn more about DataFrames.
Depending on your data, there are many different ways to create features for your models. But, it is crucial to avoid overfitting and data leakage in model training. This will usually be noticeable as the model score will be 1 or 100% (perfect) when running the test splits after training.
Overfitting results from a model being overly complex and memorizing data rather than understanding patterns.
Data leakage often stems from features that may influence the prediction outcome.
Below are the key steps I took to create this example...
Follow along with the source code here...
1.) The first step was to gather the data from the NHL API. Unfortunately it is undocumented, so you must use the Network tab in within the developer tools (F12) of your browser to analyze the fetch requests.
By visiting https://www.nhl.com/stats/teams and using this method, I identified the necessary query.
I created a few functions within init.py to query the NHL games results for the past 10 seasons (2013-2023) and for the current season (2023-2024). I also queried the the current season schedule.
I then stored these results as CSV files within the csv folder.
2.) Next, I imported the CSV files and began manipulating the data in model.py.
To keep this example simple, I removed all unnecessary columns, created a few new features (columns), and encoded all non-numeric columns.
See the "_cleaned" files in the csv folder to see a snapshot of how the data was manipulated before encoding.
When working on this step, try to be creative. While it may be time-consuming and involve lots of trial and error, it is crucial for your model's performance. Ensuring the data is relevant to your prediction goals is key.
Using too much unrelated data can harm performance and add unnecessary complexity. Take the time to thoroughly understand your data options.
scikit-learn also provides a variety of tools that allow you to tune the hyper-parameters your models, so be sure to implement those as well.
3.) I created and tested 3 classification models (Random Forest, SVM, and MLP), storing them in the joblib folder for later use.
The target feature in this dataset is "wins", where 1=win and 0= loss (Binary Classification). This is the value we will be predicting using each model.
4.) Using test.py, I collected and processed data up to Feb 26, 2024 for the current season, then tested each model with new data to assess its performance.
These results are stored in test folder. The model predictions yielded similar results to those seen during training. It's crucial to note that I used the actual stats as feature values as the games had completed. This eliminates the need for assumptions or calculations, and is more accurate.
5.) Again in test.py, I gathered the schedule of remaining games, transformed the data, and tested it against each model.
This step involved extensive data manipulation to ensure that the DataFrame for making predictions aligned with the format used during model training.
In this situation with limited data, we must think creatively and make assumptions or calculations to establish feature values.
I simply used the average values from the current season.
Alternatively, you can update values after each game, use past season averages or employ other machine learning models for value determination.
Try a variety of techniques and see what works best!
This again, is very important as it can drastically change the result of the predictions made by your model.
6.) View the Results! Predictions are in the csv/test folder.
Predictions for the remainder (Feb 26 - Apr 18) of the season: current_season_predictions.csv.
Predictions for all games up to Feb 26, 2024: 2023_2024_02-26_predictions.csv.
scikit-learn does a great job in helping simplify the process of creating machine learning models for Classification problems such as this. There are a variety of models to choose from, so you should be able to find one that fits your needs!
The hardest part of this process (and most) machine learning projects is data manipulation!
There are so many variables affecting the outcome of an NHL game, you really need to take your time, understand all the data available to you, and determine whether it is helping or hurting your model.
This example involved a lot more data manipulation than I had planned for, but it was necessary as I wanted to show others how they could use the NHL API to as their data source for these models.
All data manipulation was done within Python and using Pandas.
Pandas does make it easy to quickly output CSV or excel files, so you do not always have to view data via the console.
The NHL API data is somewhat limited and you "get what you get" so, supplementing with other data sources can significantly improve the quality of the models being developed.
Improving a model not only involves identifying key features, but also finding new ways to leverage them efficiently. By thinking outside the box and using the appropriate features, you can enhance your models precision and overall performance.
Being creative when making predictions is also crucial due to a lack of necessary data. Finding innovative ways to gather information is essential and almost always required.
Links to the project repo, files, and all future examples will be on: https://bloodlinealpha.com/
If you have any questions about using scikit-learn or the NHL API send me a LinkedIn message or email me at: bloodlinealpha@gmail.com.
Syntax Sunday
KH