Brian Taylor
February 2016
For this assignment I used baseball data downloaded from the Sean Lahman Baseball Database.
I took the photo (above) of Mariano Rivera on June 9, 2013 in Seattle.
Books that I consulted:
In this project I have had some fun exploring the data, learning some of the tools and trying out some of the lessons that I've learned. I haven't attempted to make this a polished (or concise!) report. I simply let my curiosity guide me. The database is such a rich source that there were many more relationships that I didn't have time to explore.
Each of the following seven questions could lead to much deeper analysis. I feel I have only skimmed the surface with each one.
# import pandas, numpy, seaborn, matplotlib and statsmodels
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import statsmodels.api as sm
# inline graphics
%matplotlib inline
master = pd.read_csv('./lahman-csv_2015-01-24/Master.csv')
I was curious to see if the relative age effect was present in the data. In other words, are players born earlier in the year over-represented. But, as the histogram below shows it looks like, if anything, players born in the late summer and early fall are over-represented.
g = master.birthMonth.hist(bins=12, facecolor='green', alpha=0.75)
g.set_title('Birth month of baseball players')
g.set_xlabel('Month')
g.set_ylabel('Frequency')
g.set_xlim(1, 12)
g.set_ylim(0, 2000)
I wanted to take a look at the data decade by decade to see if perhaps a relative age effect became apparent in a certain decade, but the following histograms seem to show a fair level of randomness with a possible 'bump' in late summer and early fall in some decades.
fig = plt.figure(figsize=(20,20))
for i in range(1,17):
year = 1830 + 10*i
months = master[master.birthYear >= year]
months = months[months.birthYear < year + 10]
h = fig.add_subplot(4,4,i)
h = months.birthMonth.hist(bins=12, facecolor='green', alpha=0.75)
h.set_title(str(year) + " - " + str(year+9))
h.set_xlabel('Month')
h.set_ylabel('Frequency')
h.set_xlim(1, 12)
h.set_ylim(0, 250)
In order to fully explore this topic I would have to obtain data on births by month. I could use US census data, but then I would have to limit the baseball players to US born. Assuming I had all the necessary data I'm not sure exactly what statistical test I would have to perform in order to claim that the distribution of birthdays of baseball players was significantly different from the distribution of birthdays of the general population.
I wanted to get an idea if players are living longer, so I created separate columns for date of birth, date of death and then calculated lifespan in days. I then created the scatterplot below and immediately realized I would have a bit of a problem if I ran a regression. Since younger players are still alive they don't show up (not having a date of death) and recent deaths would skew the data to shorter lifespans. Nevertheless, even without running a regression it's possible to see an upward trend, with far fewer players dying before 20,000 days (54.75 years).
master['dateOfBirth'] = pd.to_datetime(master.birthYear*10000 + master.birthMonth*100 + master.birthDay, format='%Y%m%d')
master['dateOfDeath'] = pd.to_datetime(master.deathYear*10000 + master.deathMonth*100 + master.deathDay, format='%Y%m%d')
master['lifespan'] = (master.dateOfDeath - master.dateOfBirth).dt.days
master.plot(kind='scatter', x='birthYear', y='lifespan', alpha=0.35, xlim=(1800,2000), ylim=(0,40000), figsize=(10,10))
In order to deal with the problem of not being able to look at the complete lifespan of most of the players in the database I could have looked at the proportion of players born in any given year who lived to 10,000 days, 15,000 days, etc. and compared the survival rates over time.
Next I wanted to see if baseball players are getting taller over time. The following scatter plot shows height (in inches) vs. year of birth. The dot at the bottom of the graph is not a mistake. It refers to Eddie Gaedel, who appeared in one game for the St. Louis Browns on August 9, 1951. He was only 43 inches tall. He was walked on four pitches. The plot below (made using Seaborn) has a regression line clearly showing an upward trend.
sb.set(style="darkgrid", color_codes=True)
sb.jointplot("birthYear", "height", data=master, kind="reg",
xlim=(1830, 2000), ylim=(40, 90), color="r", size=10)
Based on the regression below it appears that the height of an average player has been increasing by about 0.034 inches per year, or about one inch every 30 years. The standard error and the p-value are very small, but given that the independent variable is time I'm not confident that I can deal with the parameters in the normal way.
from scipy import stats
masterclean = master.dropna(subset=['birthYear', 'height'])
x = masterclean.birthYear
y = masterclean.height
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print "slope:", slope
print "intercept:", intercept
print "p value:", p_value
print "std. error:", std_err
print "r-value:", r_value
print "r-squared:", r_value**2
I would be interested to see the data on the heights of all men over the same time period and then compare it to baseball players. I would imagine that increases have been seen in the general population, too, due to better nutrition.
I took a look at some of the salary data. Clearly the salaries have been growing rapidly. I tried a couple of different graphs to get a feel for the growth. The highly skewed nature of the distribution of salaries makes it a bit difficult to work with visually.
salary = pd.read_csv('./lahman-csv_2015-01-24/Salaries.csv')
salary.describe()
sb.set(style="darkgrid", color_codes=True)
sb.jointplot("yearID", "salary", data=salary, kind="reg",
xlim=(1984, 2015), ylim=(0, 35000000), color="g", size=10)
The mean salary has increased from \$476,299 in 1985 to \$3,980,445 in 2014, and over the same time period the median salary has gone from \$400,000 to \$1,500,000.
salary.groupby('yearID').agg(np.mean)
salary.groupby('yearID').agg(np.median)
I like the following boxplot, but the extremely skewed nature of salaries makes it difficult to graph.
ax = sb.boxplot(x="yearID", y="salary", data=salary, width=0.75)
Clearly players are making a lot more money. There are many different aspects that I could pursue here. The dollar figures could be adjusted for inflation, for starters. I could have looked at salaries by team, by position. I could have looked for relationships between salary and performance. There are many areas that would be interesting.
OPS, a new baseball statistic that is the sum of on-base percentage and slugging percentage, seems to be increasing over time. I've limited the following scatterplot to players with at least 502 plate appearances, the minimum in a regular 162 game season to qualify for a batting title. It's easy to see the strike shortened years by the gaps, where very few players had that many plate appearances.
First I had to do a little work on the batting database. Players who played for more than one team in a season are represented on different lines for each team that they played for. I wanted to combine those lines so that every line represents a single player's full season. I also had to add a few columns in order to calculate OPS.
batting = pd.read_csv('./lahman-csv_2015-01-24/Batting.csv')
# If a player played for two or more teams in a single season he will have multiple lines.
# I wanted to combine those lines so that every line represents the full season played by
# a given player.
batting = batting.groupby(['yearID', 'playerID']).sum()
# put yearID and playerID back as columns, instead of indexes
batting.reset_index(inplace=True)
# some statistics weren't recorded in the old days, so replace the Nan entries with a zero
batting = batting.fillna(0)
# Let's take a look at a few rows to see if everything looks OK.
batting.loc[70005:70010]
# Let's add a column for Plate Appearances.
batting['PA'] = batting.AB + batting.BB + batting.HBP + batting.SH + batting.SF
# And a column for On Base Percentage.
batting['OBP'] = (batting.H + batting.BB + batting.HBP) / (batting.AB + batting.BB + batting.HBP + batting.SF)
# And a column for total bases. Note: because the H column includes extra base hits
# the formula looks a bit different than normally written.
batting['TB'] = batting.H + batting['2B'] + batting['3B'] * 2 + batting.HR * 3
# And a column for Slugging Percentage.
batting['SLG'] = batting.TB / batting.AB
# And finally a column for OPS.
batting['OPS'] = batting.OBP + batting.SLG
# Babe Ruth and Barry Bonds are the only players with an OPS greater than 1.3 and
# more than 502 plate appearances. Each did it three times.
batting[batting.OPS > 1.3][batting.PA >= 502]
batting_eligible = batting[batting.PA >= 502]
sb.set(style="darkgrid", color_codes=True)
sb.jointplot("yearID", "OPS", data=batting_eligible, kind="reg",
xlim=(1880, 2015), color="r", size=10)
Rather than look at the trend in OPS for players with sufficient plate appearances to qualify for a batting title I could have looked at an aggregate OPS by team or league over time. There are many possible interesting relationships to explore with OPS. I could have looked at salary vs. OPS, or age of player vs. OPS. Again, there are numerous possibilities.
Let's compare the batting averages of left-handed hitters vs. right-handed hitters. I'll ignore switch hitters as I don't have any data about their performance from either side of the plate. I'll use OPS instead of straight batting average to compare the two groups. To make it a bit more straightforward I'll only look at the batters since 1940, and limit it to batters with at least 502 plate appearances.
# take the handedness from 'master' (right refers to the right table)
right = master[['playerID', 'bats']]
# take the batting stats from 'batting' (left refers to the left table)
left = batting[['playerID', 'yearID', 'PA', 'OPS']]
# join the data to create a single data frame with the required columns
mg = pd.merge(left, right, how='inner', on='playerID', left_index=True)
# limit the players to 1940 and beyond
mg = mg[mg.yearID >= 1940]
# limit the players to 502+ plate appearances (min. needed to be eligible for batting title)
mg = mg[mg.PA >= 502]
# separate the data for lefties and righties
lefties = mg[mg.bats == 'L']
righties = mg[mg.bats == 'R']
The OPS of lefties is slightly better than that of righties for players from 1940 on who made at least 502 plate appearances. I performed a Mann-Whitney U test to see if the difference was significant. The result of the test allows me to reject a null hypothesis that the means of the two underlying populations are the same at a 95% confidence level (the p-value is actually exceptionally small).
print "Lefties OPS (mean & std. deviation):", lefties.OPS.mean(), lefties.OPS.std()
print "Righties OPS (mean & std. deviation):", righties.OPS.mean(), righties.OPS.std()
s, p = stats.mannwhitneyu(lefties.OPS, righties.OPS)
print "U-statistic:", s
print "p-value:", p
Based on this result I don't think that it's possible to make the claim that left-handed hitters are better than right-handed hitters since there is a pronounced difference in a hitter's performance facing a right-handed pitcher vs. a left-handed pitcher. Ideally, I would need data showing all four possible matchups (rightie vs. rightie, rightie vs. leftie, leftie vs. rightie, leftie vs. leftie). The Lahman data doesn't have that level of granularity. As with the previous section, one approach I could have taken would have been to take a look at all leftie at bats and compare that with all rightie at bats.
Sports writers like to focus on run differential to determine whether a team is under-performing or over-performing. Let's see what's going on. I'll limit my exploration to the National League teams from 1920 on (the so-called "live ball era"). The goal is to come up with a linear model to predict the number of wins a team should get based on run differential. For the independent variable I'm going to use the actual run differential divided by the number of games (to normalize for shorter seasons). The dependent variable will be similarly normalized and so will actually be winning percentage.
teams = pd.read_csv('./lahman-csv_2015-01-24/Teams.csv')
teams['rundiff'] = (teams.R - teams.RA) / teams.G
teams['winpercent'] = teams.W / teams.G
NLteams = teams[teams.lgID == 'NL']
NLteams = NLteams[NLteams.yearID >= 1920]
# I wanted to try out a different Ordinary Least Squares implementation.
# The following code was suggested in a StackOverflow discussion to deal
# with multiple x-values. Initially that's what I had in mind, but I adapted
# it to work with a single column of x-values.
import statsmodels.api as sm
y = NLteams.winpercent
x = [NLteams.rundiff]
def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results
print reg_m(y, x).summary()
The t-values for both coefficients are very large and the corresponding p-values are essentially zero. The R-squared is 0.878. So, we could say that run differential accounts for 87.8% of the variation in winning percentage. This isn't too surprising since in baseball to win a team needs to score more runs than it allows.
Based on these results we could come up with a model to predict the number of wins that a team should get based on their run differential. The equation would be:
$\frac { W }{ G } \quad =\quad 0.1037\left( \frac { RS-RA }{ G } \right) +0.4989\quad\quad\quad\quad\quad$
Currently the season is 162 games long. Multiplying both sides of this equation by 162 gives the following equation:
$W\quad =\quad 0.1037\left( RS-RA \right) + 80.8218\quad\quad\quad\quad\quad$
Let's see how well this matched the National League 2015 season.
W | L | RS | RA | RS-RA | pred | W-pred | |
St. Louis Cardinals | 100 | 62 | 647 | 525 | 122 | 93.5 | 6.5 |
Pittsburgh Pirates | 98 | 64 | 697 | 596 | 101 | 91.3 | 6.7 |
Chicago Cubs | 97 | 65 | 689 | 608 | 81 | 89.2 | 7.8 |
LA Dodgers | 92 | 70 | 667 | 595 | 72 | 88.3 | 3.7 |
NY Mets | 90 | 72 | 683 | 613 | 70 | 88.1 | 1.9 |
SF Giants | 84 | 78 | 696 | 627 | 69 | 88.0 | -4.0 |
Washington Nationals | 83 | 79 | 703 | 635 | 68 | 87.9 | -4.9 |
Arizona Diamondbacks | 79 | 83 | 720 | 713 | 7 | 81.5 | -2.5 |
San Diego Padres | 74 | 88 | 650 | 731 | -81 | 72.4 | 1.6 |
Miami Marlins | 71 | 91 | 613 | 678 | -65 | 74.1 | -3.1 |
Colorado Rockies | 68 | 94 | 737 | 844 | -107 | 69.7 | -1.7 |
Milwaukee Brewers | 68 | 94 | 655 | 737 | -82 | 72.3 | -4.3 |
Atlanta Braves | 67 | 95 | 573 | 760 | -187 | 61.4 | 5.6 |
Cincinnati Reds | 64 | 98 | 640 | 754 | -114 | 69.0 | -5.0 |
Philadelphia Phillies | 63 | 99 | 626 | 809 | -183 | 61.8 | 1.2 |
Of course, it's not a perfect match. At the extremes, the Cubs won 7.8 more games than the model predicted and the Reds lost 5.0 more games than the model predicted. One way to think about the difference between the actual number of wins and the predicted number of wins is to characterize the team as lucky or not. If you view run production as a stochastic process then obviously teams with higher run differentials will win more games, but to exceed the prediction means that the team must have won some very close games or, alternatively, lost a few very lopsided games.
Options for further study are numerous. For instance I could include additional variables such as salary or team OPS to see if a better model could be developed.