Monday, December 1, 2014

Running Regressions

Since I am no longer in the Excel dojo that I had in my civilian job, I needed some way to keep my skills sharp. So I decided to nerd out a little on my running log. In a previous post, I had noted the logarithmic relationship between distance and pace. While this was a pretty strong relationship I was curious about other factors. The effect of these factors though seemed harder to tease out of my data than the effects of pace versus distance, so I went full geek and did a multi-variable linear regression. Enjoy.

Deciding What to Test

After some thought and searching around the web, I settled on looking at the following factors:

·         Altitude
·         Temperature
·         Elevation gain

I chose these factors because they were easy to get for the races I have done this year. I have also done races over a wide variety of altitudes this year (Boulder, Ft. Bliss, Boise and Kuwait) so I have a lot more sample points than I usually have. I also had a decent spread of temperatures from 29 °F at the Frozen Foot 5k (February in Boulder) to an 86 °F run in Kuwait. I did not do any races with significant elevation gains this year (say over 300 ft per mile), but I started using Strava a lot more this year and it seems like a neat variable to test.

I did not look at other variables that like wind, elevation loss or technicality of terrain. Wind I did not choose to look at because most courses are loops and while you never get back from the wind what you give to it, I was not sure that I wanted to go back and figure out what percentage of each race was into the wind. Strava doesn’t give elevation loss and I did not feel like going through any more hassle to figure it out for point-to-point courses. Technicality of terrain I did not include because I was not sure how best to put a number to it. Perhaps next year when get back to doing tempo runs in the mountains I will revisit this one.

The dataset

While my racing for the first half of the year was light, during my deployment in Kuwait I am racing almost every week. Overall I had 13 races in my dataset. Not big enough to be truly statistically significant, but good enough for fun.

When I did a straight pace v. distance plot, the logarithmic relationship is still there, but it’s not as clean as it is with my PRs.

Pace versus Distance for my 2014 Races with a log curve fit
Pace versus distance for my personal records (with a log curve fit)

But I regress. . .

In a stats class offered through work I learned about the Linear Regression feature in Excel. The first step was to ensure that my variables had a linear relationship to pace (in other words, doubling the altitude doubles its effect on your pace). A little Googling found Run Works which takes formulas from Jack Daniels (no not that one) and other exercise physiology sources. This site allows you to put in a time and distance and then gives you estimated times for different altitudes, elevation gains, etc. It appeared that at least based on the formulas that other experts used, altitude, elevation gain and temperature had a reasonably linear relationship to pace.

Pace versus Altitude for a 16:49 5k at sea level (Source: Run Works / Jack Daniels' formulas)

Pace versus Altitude for a 16:49 5k with 40 ft of elevation gain (Source: Run Works / Jack Daniels' formulas)

Pace v. Temperature for a 16:49 5k at 50 F (Source: Run Works / Jack Daniels' formulas)
For altitude I used the average value for the race (rounded to about 100ft). I got my weather data from NOAA for each race. The elevation gain I got from Strava which I believe used the digital terrain mapping (DTM) data from Google Earth or Maps.


Using a distance only gave me an R squared value of 0.79 (the closer to 1 the better the prediction). Throwing in altitude, elevation gain and temperature brought this up to 0.87. Not bad.

As one final experiment I also added a Boolean variable to account for if a race was proceeded by a major training event. For example, I ran the Bolder Boulder this year one week after a marathon. Another recent race in Kuwait was a 5k that I ran two days after an 18-mile long run and three days after another 5k. Incorporating this brought my R squared value up to 0.93. 

But there was one more bit of nerdiness to tease out. Among the results of Excel’s regression is the P-value. This stat gives you an indication of how important this variable is (or how likely the fluctuations in your dependent variable appear to be due to any particular variable). Basically, the lower the P-value the more important your variable is.

The variable, in order of predictive power on my pace were:

Major Training Event
Temperature and Elevation gain (roughly tied)

So What

The other neat thing I can now do is plot the predicted pace again what I actually ran. If my pace is faster than the predicted value then that indicates that I had a good race and the result gives me some kind of indication of how good of a race I had (or vice versa). Additionally, if I found a random distance under some odd conditions (say a neat 18.5k race at 2000 ft in the crisp fall air), this formula could also give me an idea of how to pace myself.

Variation in my actual pace for the race versus the predicted pace
Above zero means that I ran a faster than predicted pace
It was a fun experiment on something I've been musing about for a while. It's not perfect and, in the academic sense, probably not statistically very meaningful. But I can live with that. The imperfections leave a bit of mystery and room for a bit of the passion involved in running.

No comments:

Post a Comment