Since I am no longer in the Excel dojo that I had in my
civilian job, I needed some way to keep my skills sharp. So I decided to nerd
out a little on my running log. In a previous post, I had noted the logarithmic
relationship between distance and pace. While this was a pretty strong
relationship I was curious about other factors. The effect of these factors
though seemed harder to tease out of my data than the effects of pace versus
distance, so I went full geek and did a multi-variable linear regression. Enjoy.

## Deciding What to Test

After some thought and searching around the web, I
settled on looking at the following factors:

·
Altitude

·
Temperature

·
Elevation gain

I chose these factors because they were easy to get for
the races I have done this year. I have also done races over a wide variety of altitudes
this year (Boulder, Ft. Bliss, Boise and Kuwait) so I have a lot more
sample points than I usually have. I also had a decent spread of temperatures
from 29 °F at
the Frozen Foot 5k (February in Boulder) to an 86 °F run in Kuwait. I did not do any
races with significant elevation gains this year (say over 300 ft per mile), but I started using Strava a
lot more this year and it seems like a neat variable to test.

I did not look at other variables that like wind,
elevation loss or technicality of terrain. Wind I did not choose to look at
because most courses are loops and while you never get back from the wind what
you give to it, I was not sure that I wanted to go back and figure out what
percentage of each race was into the wind. Strava doesn’t give elevation loss and I did not feel like going through
any more hassle to figure it out for point-to-point courses. Technicality of
terrain I did not include because I was not sure how best to put a number to
it. Perhaps next year when get back to doing tempo runs in the mountains I will
revisit this one.

## The dataset

While my racing for the first half of the year was light,
during my deployment in Kuwait I am racing almost every week. Overall I had 13 races in my dataset. Not big enough to be truly statistically significant, but good enough for fun.

When I did a straight pace v. distance plot, the logarithmic
relationship is still there, but it’s not as clean as it is with my PRs.

Pace versus Distance for my 2014 Races with a log curve fit |

Pace versus distance for my personal records (with a log curve fit) |

## But I regress. . .

In a stats class offered through work I learned about the
Linear Regression feature in Excel. The first step was to ensure that my variables
had a linear relationship to pace (in other words, doubling the altitude
doubles its effect on your pace). A little Googling found Run Works which takes formulas from Jack Daniels (no not that one) and other exercise physiology
sources. This site allows you to put in a time and distance and then gives you
estimated times for different altitudes, elevation gains, etc. It appeared that at least based on
the formulas that other experts used, altitude, elevation gain and temperature
had a reasonably linear relationship to pace.

Pace versus Altitude for a 16:49 5k at sea level (Source: Run Works / Jack Daniels' formulas) |

Pace versus Altitude for a 16:49 5k with 40 ft of elevation gain (Source: Run Works / Jack Daniels' formulas) |

Pace v. Temperature for a 16:49 5k at 50 F (Source: Run Works / Jack Daniels' formulas) |

For altitude I used the average value for the race (rounded to about 100ft). I got my weather data from NOAA for each race. The elevation gain
I got from Strava which I believe used the digital terrain mapping (DTM) data
from Google Earth or Maps.

## Results

Using a distance only gave me an R squared value of 0.79
(the closer to 1 the better the prediction). Throwing in altitude, elevation
gain and temperature brought this up to 0.87. Not bad.

As one final experiment I also added a Boolean variable to account for if
a race was proceeded by a major training event. For example, I ran the
Bolder Boulder this year one week after a marathon. Another recent race in
Kuwait was a 5k that I ran two days after an 18-mile long run and three days
after another 5k. Incorporating this brought my R squared value up to 0.93.

But there was one more bit of nerdiness to tease out. Among
the results of Excel’s regression is the P-value. This stat gives you an
indication of how important this variable is (or how likely the fluctuations in
your dependent variable appear to be due to any particular variable).
Basically, the lower the P-value the more important your variable is.

The variable, in order of predictive power on my pace were:

Distance

Altitude

Major Training Event

Temperature and Elevation gain (roughly tied)

## So What

The other neat thing I can now do is plot the predicted
pace again what I actually ran. If my pace is faster than the predicted value then that indicates that I
had a good race and the result gives me some kind of
indication of how good of a race I had (or vice versa). Additionally, if I found a random distance under some odd conditions (say a neat 18.5k
race at 2000 ft in the crisp fall air), this formula could also give me an idea of how to
pace myself.

Variation in my actual pace for the race versus the predicted pace Above zero means that I ran a faster than predicted pace |

It was a fun experiment on something I've been musing about
for a while. It's not perfect and, in the academic sense, probably not statistically very meaningful. But I can live with that. The imperfections leave a bit of mystery and room for a bit of
the passion involved in running.

## No comments:

## Post a Comment