Competitions2025 Sockeye InternationalHooked On Data's submission
2025 Sockeye International
Hooked On Data
View Team

Predictions

Ugashik River's Sockeye run
8,548,516
Quesnel's Sockeye run
423,716
Kvichak River's Sockeye run
8,636,944
Stellako's Sockeye run
114,148
Egegik River's Sockeye run
6,471,012
Raft River's Sockeye run
26,692
Igushik River's Sockeye run
2,653,726
Naknek River's Sockeye run
5,374,252
Wood River's Sockeye run
11,063,164
Chilko River's Sockeye run
1,148,870
All of Columbia River's Sockeye run
153,428
Nushagak River's Sockeye run
6,686,194
Alagnak River's Sockeye run
5,484,220
Stuart River's Late Sockeye run
329,777

Prediction method

Submitted on Jul 01, 2025
Machine Learning-Based Prediction of Salmon Returns Using Environmental and Spawner Data
Abstract
We developed a machine learning framework to predict salmon returns for individual rivers using a combination of return data, spawner counts, and environmental variables. Our approach was based on annual observations from year y to y–5, with the goal of predicting salmon returns in year y+1. We applied time series-aware data splits by river, using the first 80% of samples for training and the last 20% for testing. To enhance model performance and generalizability, we tested a suite of configurations that included different machine learning algorithms, subsets of predictive features, time-lagged variables, and optionally an ARIMA model applied to the residuals of the machine-learning model. Each model was trained separately for the three major river systems. The final model for each river ("winner model") was selected based on test-set performance (minimum Mean Absolute Percentage Error), while ensuring limited overfitting by removing models with high divergence in R² between training and test sets. These winner models were retrained on all data up to 2024 and used to generate final predictions for 2025.
Supporting Documents

Prediction Model

Submitted on Jul 01, 2025
Description
We evaluated five machine learning algorithms: (1) Random Forest Regressor, (2) HistGradientBoostingRegressor, (3) XGBoost Regressor, (4) Linear Regression, and (5) Polynomial Regression (Linear Regression with degree-2 polynomial features). Each model was trained using annual samples consisting of features observed from year y to y–5, with the target variable being total salmon returns in year y+1. Feature selection was performed using SelectKBest, considering either all features, the top 20, the top 10, or the top 6, based on their statistical association with the target. We also tested the inclusion of time-lagged features (lags of 1–5 years). The feature set included variables from three categories: • Return and spawner data from Angler’s Atlas and Gottfried Pestal, including: o Total_Returns and 18 age-class return features (AgeClass_0.1, AgeClass_1.2, etc.) o Spawner counts for brood years y–2 to y–4 and their sum (total_spawners_y_minus_2_to_4) • Oceanographic and climate indices (from the pacea R package and NOAA): o Pacea_ALPI_Anomaly, npi_mean_NovMar, oni_mean_DecFeb, mei_mean_AprSep, npgo_mean_DecFeb, ao_mean_DecMar, pdo_mean_DecMar, pdo_mean_MaySep, sst_aprjul, sst_anom, and sss_mayaug • Dummy variables encoding river identity (e.g., River_Alagnak, River_Bonneville Lock & Dam, etc.) In total, the full dataset comprised 892 samples: 488 for Bristol Bay, 39 for Columbia River, and 365 for Fraser River. For Bristol Bay, only samples from 1995 onward were used. Features with missing values (e.g., sea surface temperature before 1984 in Fraser River) were removed prior to training. Ugashik River (Bristol Bay) models were trained separately without spawner data due to unavailability.