Using "R for Data Science" to Analyze 3rd Down Conversion % by Distance
A detailed look into how I use cfbscrapR & R Programming language to analyze 3rd down conversion rates by distance categories from the 2019 College Football Season
Whether you are a football coach or fan, you know that converting on 3rd Down is an important component to winning football. Coaches spend countless hours in both meeting and practice preparing for these critical situations, with the understanding that the game can be won or lost on third down. In past research, I have published statistical analysis showing how 3rd down conversion % is highly correlated to win totals. If you are interested in learning more about how 3rd downs contribute to Win Totals, you can find it in one of my previous posts in the Blog section of this website.
The plot below from my previous analysis shows the linear relationship between 3rd down conversion % & wins. You can clearly see that as 3rd down conversion % increases, the average number of wins increase with it. What you don't see in the plot below is just how much your chances of converting on 3rd down increase based on the distance needed to gain for a 1st down.
If 3rd down conversion % helps increase wins, what helps increase third down conversion %?
The intent of this analysis is to examine how of rates of third down conversions % are impacted by distance categories. While so much emphasis is placed on third downs, this analysis aims to show that what you do on 1st & 2nd down is just as important. In addition, I will walk you step by step through the way I use the R programming language & the cfbscrapR package to pull every snap from the 2019 college football season to perform this analysis.
In addition to sharing snapshots of the code I use in R, I will give an explanation of what I am doing & share my thought process as we go along. In my experience, I have found that the best way to learn R is by following along with real world examples and using real code, so I am happy to share this with you. As a disclaimer, anyone experienced in R knows that there is more than one way to "skin a cat" and that there are other data scientists who may have better or quicker ways to this analysis, however this works for me. I encourage anyone who reads this to connect with me by email firstname.lastname@example.org or on twitter @js_ace_football to share tips and tricks on how they would do things differently, as I am always looking to improve.
Third Down Conversion % by Distance Groupings in R
The first step is to install the required packages in R including "cfbscrapR" which is where we will get the data. After our packages are loaded & called in, I read in the 2019 CFB Play by Play data from both regular and post season and store it as a data frame which I call cfb.both
After I create the data frame, I call the "glimpse" function which gives me the ability to do a quick check to see if the data was correctly read into R. The screen shot below is a look at just the first 8 columns in my data frame, most importantly I see that there are 159,915 observations which tells me that It was a success. Wow thats a lot of data!
Cleaning & Filtering Data for All Third Downs
A word of caution: The biggest challenge to a data scientist when using using large data sets is the integrity of the data itself. You have to spend the time to ensure the accuracy of your data by appropriately cleaning & organizing it to a useable format. In my experience, more time should be spent cleaning & organizing data than on the analysis itself. There is so much great data that you can pull using cfbscrapR but for the purposes of this analysis, I only need a few columns. The "select" function allows me to create a new data frame containing the columns ( down, distance, play text, yards gained and play type), which is all I need in order to run my analysis. After selecting the columns, the "filter function" is used to filter out only 3rd down plays.
In addition to filtering for only third downs, I have to take it a step further by making sure that the plays in my new data frame are truly third down plays. One thing you will learn using cfbscrapR is that the data can sometimes be mis-labeled because of the way play by play data is stored in its data source. For this reason, you have to be extra careful before running any analysis. If plays are mislabeled, you must know what you're looking for and have the ability to "clean" the data. Having a good understanding of the game helps as you can scan the data to filter out inconsistencies that may skew your analysis. My experience as both a data scientist & football coach allows me to better understand the data and keep it relevant to ensure accuracy. One of the quirks found in cfbscrapR is that sometimes plays such as Kickoffs, Kick Returns & others are mislabeled as third down plays, so I correct this by filtering the play_type to only include actual offensive plays and storing it as a new data frame called "third.downs".
After cleaning and filtering the data I use another quick check to make sure that I only have third downs included in my data. I use the "summary function" to quickly scan for accuracy in the "down" column and you can see that I do in fact only have third downs. Remember we started out with over 195,000 observations in our original data frame, so taking a little extra time to do things right goes a long way!
Creating Third Down Distance Categories
A critical part of this analysis is being able to place each 3rd down play from the 2019 season into a distance category. Now that I am confident my data frame includes only third down plays, I will create a new column containing the third down distance "category". The purpose of this analysis is to go a step beyond overall third down conversion rates & show how those rates improve based on distance. I install some advanced R packages and use them to create a new column called "category". As you can see in the code below, I break all third downs up into distance categories & identify that range within the new column. I break third downs into 5 categories:
inches 0 - 1 yards
short 2 - 4 yards
medium 5 - 7 yards
long 8 - 10
longer greater than 10 yards
I can always go back and change the categories if I so choose or want to play around with how the results change based on changing the distance categories. Based on my coaching background and how I would game plan third downs, I am comfortable with how I have these categories grouped.
Adding Conversion " TRUE or FASE" to the Data Frame
The next step in the process is determining if a play was successful in converting for a first down. Before we can calculate the final conversion % by category we need to create a new column called "success " that will have an output of either TRUE or FLASE for a successful conversion. After playing around with a few different ways to do this, the best way for me is to use the "mutate function" to create a successful conversion if yards gained are greater than the distance. After creating the "success column", I store the updated data frame calling it "convert". Using the "View function", you can see from the snapshot below of the first 9 rows, that I was able to create the new column with converted first downs & touchdowns correctly given a "TRUE" label.
Calculating Conversion % by Distance Category
It's time to calculate the conversion % for each third down category but in order to do so we need to do a few things. I am going to take the data frame "convert" and break it up into two separate tables. The first table will be called "third.downs" containing the tally of ALL third downs by distance category & the second table will be called "true.table" containing a tally of all successful conversions by distance category. Remember all successful conversions were marked as TRUE within the "convert" data frame.
Below is a side by side view of the two new tables created. On the left is our true.table that has the total number of conversions by distance category and on the right is our third.downs table that has the total number of third down attempts by distance category. You can see from the table on the right hand side that we are dealing with a large amount of data as there are a total of 24,997 third down plays from the 2019 College Football season that fall into one of those 5 distance categories.
Combining the two tables to Calculate the Conversion % by Category
The last step in this process is to combine the two tables discussed above so we can calculate the conversion rates for each third down category. In order to do so, I will rename the columns in the third.table & use the "cbind function" to combine the two tables. After combining the two tables, I use the "pipe function %>%" to "mutate" & create another column called "conv.pct" which will be the third down conversion % for each distance category.
Now that I have the final table stored as "third.conv" I want to clean it up a little by rounding off the percentages, add a header and make it look nice with the "gt" package.
Results: Not all Third Downs Are Created Equal
The results below are clear, showing that what happens on 1st & 2nd down are critical to putting yourself in a better chance to convert on 3rd down. You can see how the chance of a conversion increases significantly as the distance category decreases. With close to 25,000 third down attempts from the 2019 season, there is plenty of data to reinforce the message that every extra yard or 2 gained on first or second down improves your chance of success of third down. This type of information can help shape your play calling as you look to put your team into third down situations with a higher chance of conversion. Getting from the "medium" category into the "short" category improves your chance of a conversion by over +10%
For those that like to see the relationship on a graph, I created plot below using the "ggplot2" package. You can see how your chances of a conversion go down significantly as the distance categories increase.
I hope you enjoyed this detailed look into how powerful the R Programming Language can be by taking large amounts of data and transforming it into a detailed analysis. So, the next time someone tells you how important third down conversion % is, make sure you tell them that its more important to put yourself in favorable third down situations by gaining additional yards on first & second down!
I want to thank @cfbscrapR for sharing such a great resource & I encourage you to follow on twitter.
As always, feel free to reach out to me by email email@example.com or on twitter @js_ace_football . I am always available to answer questions, share additional insight, guest appear on podcasts or shows, as well as take on projects as a consultant for your team or organization.