Create Your First Project
Start adding your projects to your portfolio. Click on "Manage Projects" to get started
Quant Data Collection and Player Selection
Code used at the bottom.
Quantitative data collection
We knew from the start that we were going to need a relatively large amount of players in order for this project to produce something valuable, however, there were two concerns. First, we needed to make sure that at least some of the players were almost certainly going to have a strong media presence to guarantee that we would have a plentiful amount of data to analyze. Yet only including players with a strong media presence could’ve biased our data too strongly, to where the results would not be quite as valuable. We compromised by coming up with a list of 32 players across all position groups (some position groups had more players than others) that we as a group, without much investigation, believed would have a strong presence in reported media. After we compiled this list, we also realized that we didn’t want to include players from drastically different eras of football. So, we generated a list of 60 random NFL players so that the majority of the reporting would be on players who were not hand selected and their presence in the media was relatively unknown. We also made sure this list included players that had played 2 seasons in the NFL post 2010. This guaranteed that there would be a good amount of relevant quantitative and qualitative data available to use for analysis. This gave us 92 total players that we would be analyzing, and considering the work that would need to be done for each one of these players, this seemed like a reasonable amount of players that would produce interesting results without taking all of our time. Our final list included 11 Defensive Backs, 9 Defensive Linemen, 7 Linebackers, 11 Offensive Linemen, 24 Quarterbacks, 10 Running Backs, 5 Tight Ends and 13 Wide Receivers. With this pool, we would be able to analyze a diverse group of players, and also inspect sentiment in media correlated to position groups rather than just all players of one group.
After we had our pool of players, we needed to start collecting data on them. The first step in collecting the quantitative data is locating and analyzing the data available on pro-football-reference.com (from here on, denoted as PFR). We needed to investigate how each position group's data was structured, and make sure that it would be valuable for us in the analysis and correlation stage later on. Each position group obviously had unique data, but fortunately for us, PFR created a stat called Approximate Value (AV). This stat gives us a normalized view across all position groups that denoted the value of a player towards their teams success. The AV value a player is assigned has the same meaning regardless of what position the player plays at. A Running Back with an AV of 14 contributed the same amount towards their teams success as a Linebacker with an AV of 14. The data was also structured simply in a table with headers, so organizing it into a database would be relatively easy. The only position group who had mildly less informative data available were the Offensive Linemen (OL). The OL had two main tables, snap counts and penalties. However, neither of these stats are an advanced analytic, and don’t really show their performance as comprehensively as we had hoped. Nevertheless, these two tables were sufficient enough, particularly penalties, and we still had AV for OL. Once we confirmed that PFR had the data we needed in a structured manner, it was time to extract it. There were two choices, write code to do this automatically, or visit each player's page and save the provided table as an excel file (a feature provided by PFR). We opted for the latter, as it seemed more time efficient to divide and conquer rather than write the code to accomplish this. After we had all of the excel files, we organized them by position group and began the process of cleaning the data.
The data, while coherently structured generally speaking, also included career statistics and summarized career totals. While interesting, we didn’t need these for our purposes. There was also some redundancy in the headers of the data which had to be resolved. This was relatively simple, as we just needed to go through and remove the career totals and slightly tweak headers. For example, running backs would have two “Yards” stats, one for receiving and one for rushing. We simply changed them to specify rushing or receiving. For ease, we wrote some code that would take every file for each position group and merge them into one file. We also wrote a simple powershell script that converted all of the files to CSV. Later on, we realized we needed to have the player name attached to a given player's data as well. We had initially created a player ID schema, but the qualitative data collection and resulting database did not use it in the final database, so we needed to add the player name as a column for each dataset as well. At this stage we had one CSV file for each position group, containing the data by season for each player in that position group. Next we imported these CSVs into one database inside DBBrowser. We did this because most of our analysis and future coding would be done through python, and DBBrowser makes it simple to store and functionally use data in a python coding environment. Also, we simply had experience with it. These CSV files and player IDs served as the foundation that the rest of the pipeline was built on, allowing the quantitative and qualitative datasets to be reliably merged later in the project.
At this stage, we had quantitative data for 92 players stored in a database that would be easily accessible for future analysis.
*The quantitative data used in this project was all collected from pro-football-reference.com.
Qualitative data sentiment analysis
The first step in analyzing the text from the articles we’d scraped was getting that data into a workable space. We created a python coding environment through anaconda navigator, which included libraries such as torch, transformers, accelerate, sentence piece, pymongo, pandas, numpy and tqdm. The initial idea we had was to use a word bank, but we quickly came to realize that a wordbank would not produce the sentiment scores with the same level of depth we set out to achieve. We investigated other methods of getting a sentiment score from the text and discovered the Bidirectional Encoder Representations from Transformers (BERT) analysis models. BERT is essentially a model built for text analysis, with numerous variations, and a multitude of different applications available. We felt confident that this would give us a more reliable sentiment score, however, there were a plethora of roadblocks. Before we could even begin analyzing the data, we had to do some restructuring of the qualitative data. As mentioned, the qualitative data was separated into two collections. One collection held articles where the player name appeared in the URL, and the other held articles where the player was simply mentioned in the article. We had to pull from both locations and ensure the model was checking every article from both locations. Once this was finished, we had to figure out which pretrained BERT model would be best to use. As previously mentioned, there are many different variations that are trained on a different set of text, and different means of extracting sentiment from that text. However, none of these variations were trained on sports text. Sports text is unique, and the way things are discussed in sports conversations can be difficult to interpret. Take this example sentence, “The quarterback forced a few throws into tight coverage, but his willingness to challenge defenses downfield kept the offense aggressive and ultimately opened up the run game”. Without a preconceived knowledge about what “tight coverage” or “the run game” means, the meaning in this sentence can be lost. Not only that, but the sentiment that this sentence reflects on the quarterback is difficult for a model to pinpoint. If the model is scoring each individual word, words like “challenge”, “force” and “aggressive” may move the sentence towards a more negative label rather than one that positively reflects the QBs performance, as this sentence does. We tried three different models on our data, a BERT model that was trained on financial text specifically for sentiment analysis, a BERT model using Aspect Based Sentiment Analysis that was trained on the English Language, and a BART model built by Meta. All three variations struggled in their own way, but the Meta model seemed to produce the most reliable results that weren’t completely misclassifying text as positive that was obviously negative. The BART Meta model is slightly different from BERT, BART models have encoders and decoders, are better at understanding relationships between sentences, and handle context/phrasing better. We used it with a Multi-Genre Natural Language Inference training corpus so that it would be readily equipped to understand the text, and also implemented zero shot classification, which returns outputs that aren’t a percent score of sentiment, rather it determines the likelihood of a given set of text being positive, neutral or negative given a target. This is great because we can focus the model on one player, but difficult because it doesn’t really give a “score”. Instead, it gives a confidence interval. In labeling the given text, it produces a percentage score that reflects how confident the model is that the text is positive, neutral or negative. This confidence interval eventually became a part of our final score that we correlated to the quantitative data we collected. While we had a model selected and the text available, we needed a multitude of helper functions and additional support code to guarantee everything went smoothly. These helper functions accomplished things such as getting player names and matching them to text, turning labels to numeric scores, classifying texts, scoring articles, applying keyword penalties, debugging and more that we’ll discuss later. We also needed to decide on how much of the text the model would actually analyze. As mentioned earlier, some of these articles just had a player name mentioned inside of them, and the player may not be a focus of the article. So we couldn’t feed the entire body of text into the model for a given player because all of that text may not be relevant for the player. The easy solution for this was to point the model at text that contained the player name in some capacity. One of our helper functions does this by making the model look for the players full name or just last name in deciding what text to analyze. Yet even still, the model needs more information than just the player's name to produce a sentiment. Initially, we decided on breaking the text into chunks of 300 tokens, and if the player's name was mentioned it would take 150 tokens before the name and 150 tokens after for analysis. When players were just mentioned briefly in the article though, this introduced an extreme amount of irrelevant noise into the data. Even for articles about the given player, there was so much extra information that wasn’t particularly relevant. We had also tried to include the chunks that preceded or followed the selected chunk with the player name, but this had the same problem. The best course of action was obvious, but costly. We added some helper functions and altered our approach to focus on sentences. The model would go over every sentence in the model, find instances of the players name or last name, and analyze those. We eventually included the sentences that followed as well, which introduced some bugs into the code that had to be resolved. Our code would double count sentences if the players name was present in a direct match and the sentence that followed immediately afterwards. Once fixed, our model was looking at all of the sentences of the text for each article, finding player name matches, analyzing those sentences, and then taking the following sentence as well. This is where the confidence interval came back into play. We noticed in our test runs that some sentences were producing low confidence intervals. For example, a sentence may have gotten labeled as positive, but the model was only fifty percent confident that the sentence was indeed positive. A low confidence level in the sentence seemed to invalidate the sentence, so we had to add a confidence threshold of .6 to guarantee that only sentences above a sixty percent confidence level were kept and analyzed. We used .6 as the threshold because our zero shot model naturally gave high confidence scores in testing when it was sure about the classification as a sentence, and when it was unsure, the score would typically range from .4-.6. .6 seemed to be a practical threshold that filters out weak predictions while preserving enough data for stable article-level scoring, and this was observed in our test runs. Remember that the confidence level was tied to the sentiment score we would be using for correlation, so in doing this we slightly biased the scores for each player to always be above .6. There were ways to combat this though. Despite selecting the pretrained model that was misclassifying sentences the least often, it was occasionally producing strange scores when a player had clearly done something extremely negative. For example, “The player had produced a career high season and set new records, but this drunk driving incident brought his season to an end”. This sentence has positive aspects, and would occasionally get misclassified as positive. In our opinion, off the field incidents that result in trouble with the law should be hyper negative. Thus, we added a word bank that included hyper negative words such as “arrest”, “murder”, “battery” etc. When these words are detected in a sentence, it takes the confidence level the model produced and reduces it by .2. It would do this to a maximum of .8 per sentence. So even if there were 5 negative words mentioned, the maximum the final score could be reduced would be .8. We relied heavily on the model to correctly capture less detrimental, but still negative things that weren’t a part of our word bank. Through testing, we found that law violations in particular were not being classified as negative in some cases, and this resolved that issue. This process however, was very computationally expensive. Each complete run took close to 24 hours to complete. While the test runs were beneficial, time became a more important factor than completely accurate results. We attempted to optimize run time and memory usage, but that had a negligible effect on the runtime overall. We also set a minimum article count of 10, which only removed 2 of our players for a total of 90. The final step of this process was to aggregate the results and store them. Once they were stored, they were moved into a coding environment for analysis. While we did ensure that we were only analyzing unique URLs, that didn’t necessarily mean that we were analyzing unique content. An important event such as an arrest may have multiple articles that comment on that event, each with different URLs. Our scores do not account for that, meaning there could be redundancy in our data. Also, the final sentiment score could be refined by improving the models ability to accurately classify and score the data. Nevertheless, without investigating shared resources or other methods that could have reduced the run time, we felt comfortable with the scores this model produced.
Future work
This process is not perfect. While the insights gained from the scale of our data collection and analysis is interesting, it is not completely comprehensive. Our data pool included 92 initial players that have played 2 seasons in the NFL post 2010. There are thousands of players that meet that criteria, and we simply did not have the time to scrape every player's media presence. On media presence, there are undoubtedly hundreds if not thousands of articles for each of these players that we don’t have simply because we didn’t scrape the websites that they are located on. One of our primary scrape locations was ESPN, and then we scraped every organization's webpage for the articles that they release about the players on their team. We scraped biased data, as it seems unlikely that the organizations themselves would release articles that paint their players, which they are paying millions of dollars, in a hyper negative light. When it comes to the analysis of this data, the sentiment score we created could certainly use work. While our BART model produced the best results, training our own model using HuggingFace on sports talk specifically would have certainly improved the quality of the classifications.
Ultimately, the results we have here introduce a nice starting point for capturing sentiment about players and how their on field performance translates to who they are as a person. However, media reporting in of itself is extremely biased. There is a large amount of money to be gained from multiple parties in their reporting and representation of these players. The most comprehensive and complete method of doing this is to conduct an ethnographic investigation on a team, interviewing players until reaching saturation with questions about their activities outside of their sport. While it wouldn’t produce the same scale of results, it would be more informative as to how a player's off the field activities impact their performance in a sport. If you were to combine that with a more comprehensive project such as this one, that fixes the holes in this project, you would ultimately have more informative outputs.







































