Hooked On Data's submission | The Salmon Prize Project

Hooked On Data

View Team

Predictions

Ugashik River's Sockeye run

8,548,516

Quesnel's Sockeye run

423,716

Kvichak River's Sockeye run

8,636,944

Stellako's Sockeye run

114,148

Egegik River's Sockeye run

6,471,012

Raft River's Sockeye run

26,692

Igushik River's Sockeye run

2,653,726

Naknek River's Sockeye run

5,374,252

Wood River's Sockeye run

11,063,164

Chilko River's Sockeye run

1,148,870

All of Columbia River's Sockeye run

153,428

Nushagak River's Sockeye run

6,686,194

Alagnak River's Sockeye run

5,484,220

Stuart River's Late Sockeye run

329,777

Prediction method

Submitted on Jul 01, 2025

Machine Learning-Based Prediction of Salmon Returns Using Environmental and Spawner Data

Abstract

We developed a machine learning framework to predict salmon returns for individual rivers using a combination of return data, spawner counts, and environmental variables. Our approach was based on annual observations from year y to y–5, with the goal of predicting salmon returns in year y+1. We applied time series-aware data splits by river, using the first 80% of samples for training and the last 20% for testing. To enhance model performance and generalizability, we tested a suite of configurations that included different machine learning algorithms, subsets of predictive features, time-lagged variables, and optionally an ARIMA model applied to the residuals of the machine-learning model. Each model was trained separately for the three major river systems. The final model for each river ("winner model") was selected based on test-set performance (minimum Mean Absolute Percentage Error), while ensuring limited overfitting by removing models with high divergence in R² between training and test sets. These winner models were retrained on all data up to 2024 and used to generate final predictions for 2025.

Supporting Documents

best_models_per_river.csv

Features.docx

Retrospective_Analysis.csv

Prediction Model

Submitted on Jul 01, 2025

Description

We evaluated five machine learning algorithms: (1) Random Forest Regressor, (2) HistGradientBoostingRegressor, (3) XGBoost Regressor, (4) Linear Regression, and (5) Polynomial Regression (Linear Regression with degree-2 polynomial features). Each model was trained using annual samples consisting of features observed from year y to y–5, with the target variable being total salmon returns in year y+1. Feature selection was performed using SelectKBest, considering either all features, the top 20, the top 10, or the top 6, based on their statistical association with the target. We also tested the inclusion of time-lagged features (lags of 1–5 years). The feature set included variables from three categories: • Return and spawner data from Angler’s Atlas and Gottfried Pestal, including: o Total_Returns and 18 age-class return features (AgeClass_0.1, AgeClass_1.2, etc.) o Spawner counts for brood years y–2 to y–4 and their sum (total_spawners_y_minus_2_to_4) • Oceanographic and climate indices (from the pacea R package and NOAA): o Pacea_ALPI_Anomaly, npi_mean_NovMar, oni_mean_DecFeb, mei_mean_AprSep, npgo_mean_DecFeb, ao_mean_DecMar, pdo_mean_DecMar, pdo_mean_MaySep, sst_aprjul, sst_anom, and sss_mayaug • Dummy variables encoding river identity (e.g., River_Alagnak, River_Bonneville Lock & Dam, etc.) In total, the full dataset comprised 892 samples: 488 for Bristol Bay, 39 for Columbia River, and 365 for Fraser River. For Bristol Bay, only samples from 1995 onward were used. Features with missing values (e.g., sea surface temperature before 1984 in Fraser River) were removed prior to training. Ugashik River (Bristol Bay) models were trained separately without spawner data due to unavailability.