Telling tales in foreign tongues

by Trey Nosrac

“Data analytics is the future, and the future is now. Every mouse click, keyboard button pressed, and each swipe or tap is used to shape business decisions. Everything is about data these days.” 

A new breed of customer is emerging in the sporting world. I recently met a young man who is a rabid baseball fan, but not a traditional fan, not the type of fan from the old neighborhood. His bailiwick is data. He belongs to the growing group of people who find great satisfaction in data collection and analytics to predict future sporting results and evaluate talent. He and his tribe are deep into data and technology. Adapting, incorporating, and profiting from these data heads in an expanding gambling world is challenging.

Most of us know that every MLB baseball team and baseball fans follow trends, such as WAR projections and numerical measurements for analysis. WAR in baseball is Wins Above Replacement. This metric measures how each player compares against a typical average player. Understanding WAR can help Major League Baseball Teams put the best statistically driven players on the field to increase their number of wins. WAR ratings and projections are also crucial in trades and drafting players.

Baseball found it profitable to share troves of raw data. People obtain the data or hire others to analyze the constantly updated data. This niche of interest opens a new world for a particular breed of baseball fans. Their conversations lean away from words and lean into numbers.

PLEASE NOTE: At this point, skip over the following italicized paragraphs unless you want your head to explode.

….my point was to predict specific baseball stats for 2021, which meant this had to be a regression approach rather than a classification approach (or so I thought…more to come on that). Regression problems restrict options substantially more than classification problems: this is an oversimplification, but I was mainly choosing from among linear regression, random forest, and XGBoost. Linear regression would not have been a great choice here, as it assumes independence among its input variables, and that was very much not the case here. It is also limited to finding linear relationships.

Random forest regression would have been a perfectly reasonable choice, but I selected XGBoost to build the models, as it is an improvement on random forest. XGBoost is a popular ensemble decision tree algorithm that combines many successive decision trees — each tree learns from its predecessors and improves upon the residual errors of previous trees. It also tends to perform very well in these types of problems.

My planned steps were:

• Clean and prep data.

• Identify target variables.

• Fit models to the 2017 and 2018 data, trying to predict the subsequent year’s statistics.

• Tune the models’ hyperparameters to their optimal settings.

• Combine 2017 and 2018 data, retrain models, re-tune hyperparameters, and assess for differences.

• Use the resulting models to predict 2021 statistics using a blended input data set from 2019 and 2020.

• Let’s unpack this last piece. 2020 posed a problem in that, you may recall, we had a bit of a pandemic issue, so we only had 60 games. My initial plan for this was to make a blended data set from 2019 and 2020. I did this by preparing weighted averages of stats for the two years and then scaling them to 1.


When this baseball fan communicates with others of his ilk, they talk and listen (and wager) in the language of numbers. Everything has a weighted value. This data obsession is a strange world with a fierce grip on some sports fans.

Data in horseracing is plentiful. Much of the existing harness racing data is waiting for importing into different formulas. For example, data people or AI programs could answer a long-standing question in the harness racing community — what is the best foaling date for a yearling?

Almost everyone has an opinion on this subject. Data fiends would take every trotting race on record, cross reference with the month each horse was foaled, and find the optimum month. They would conclude the exact day and hour of optimum foaling and value the worst day for foaling and every day in between. Everything converted to sets of numbers.

Data people talk in foreign tongues. People in our world, the people we know, would say, “I sort of prefer April foals.”

Data people would say that horse X’s optimum foal date is +2.35 z (X) = March 28.

The databerg is endless, and the participants place numerical values and weight on EVERYTHING.

The Stallion is +4.61. The mare is at -1.23, the genetic cross is 2.307, the relative competition of the state program is 0.62, the conformation number is (after measurement data) 0.62, the size is -1.1, and the cosigned farm is +1.3.

You may believe that evaluating a one-year-old trotter heading for a sales ring using numbers and formulas is ridiculous. You probably think that animals growing in a distant field have too many unknowns, and this whole exercise is nuts.

But in the offices of MLB teams are sets of data points for a 16-year-old boy in a small Puerto Rican village. Analytics project his body growth, analyze his ancestry, evaluate his mental capacities, and have numbers for his current spin rates and exit velocities. This kid has a currently predicted slot in his 2024 international draft value of 743, and his cost for signing a contract at $375,000.

If people who love this uber numbers game became interested in racing, they would talk using the language of numbers. Even in the yearling selection game, they would want every yearling going to sale to have a value number and projected value and wagering will be more mathematical than ever.

Who knows how much data crunching and data takes place in our sport? But looking ahead, we would be wise to learn how to open the door to potential customers who speak in the strange tongue of analytics. Like it or not, the future is coming.