Create Your First Project
Start adding your projects to your portfolio. Click on "Manage Projects" to get started
Graphs
Harrison
For this part of the project, I was to combine multiple data sources that we scraped and visualize how NFL players are portrayed in the media. ESPN was just one of the sources we used to grab sentiments, but we also included additional datasets that helped us conclude our final scores. This made the process more about bringing different types of data together rather than just relying on one source. A big part of the project was combining qualitative data, like media narratives and article texts, with quantitative data, like article counts and sentiment scores.
To build the visual side of the project, I used Python along with pandas, sea born, and matplotlib. Pandas helped me load in multiple CSV files, organize the data, and merge datasets together. Seaborn and matplotlib were then used to create graphs that made the patterns easier to visualize and look nice.
The first graph focuses on article coverage using the main dataset. After loading in the data, I sorted players by article count from highest to lowest. This graph shows which players receive the most media attention. Due to there being too many players, I removed the player names from the y axis to reduce clutter and make the graph look more fleshed out. The blue gradient helps show the rankings, with the darker colors at the top representing players with the higher article counts.
The second graph combines multiple datasets to analyze sentiment across positions. I merged the sentiment dataset with the player dataset that included position information by matching player names. This allowed me to group players by position and calculate the average sentiment score by each group. We can see that Offensive Linemen have the highest average sentiment scores, meaning they are generally talked about more positively compared to Wide Receivers or Running Backs.
This entire process was about integrating multiple datasets and transforming qualitative information into quantitative insights. Using sources like ESPN helped with article data, and other datasets added context about sentiment and player positions. By combining these sources and visualizing all of the results, I was able to see how players are covered and how much they are talked about.
Jaclyn
To create the visualization for the analysis part of the project I used SQLite3 and Matplotlib as well as pandas and numpy. Pandas allowed me to load in and read the Quant Database and I used SQLite3 to join all the csv files into one table. To make the data easier to use for the visualizations I started by connecting to the database and pulling out the player names and the stat I wanted to use for them. I cleaned up the names to make sure they are uniform. I then grouped everything by player to calculate their average stats to give me a single performance score for each person. I then merged these averages with my sentiment data, joining them on player names. Then I cleared out players that were missing info.
For the 4 correlation visualizations I determined what statistic was most relevant to each position. From there I set my x-axis as the sentiment score and the y axis as the chosen statistic. For quarterback I did passer rating, D-Line I did average sacks, Linebacker was combined tackles, and defensive back was average pass deflections. I created a scatterplot to visualize the data points and used a trendline to show the overall direction the data was moving. This allows viewers to see at a glance whether their media presence correlates with their field performance.
Next I created a box and whiskers plot to visualize the distribution of how many sentences each player had in their sentiment analysis. I decided to make the boxplot horizontal to make it easier to view the data.
I created another scatterplot outside of the position specific graphs. The goal of this graph was to show a possible correlation between sentiment score and number of sentences that went through the sentiment analysis for that specific player. The x-axis is the average sentiment score and the y-axis the total sentences per player. I added a red trend line to easily visualize the correlation.
The last visualization I created was a histogram. This histogram displays the spread of the sentiment scores across all the players. I used 20 bins to group the scores together so we can see which ranges are most common. I added a red dashed mean line to show where the average of sentiment scores fell to make it easy for viewers to see how many players are above and below the average.
Findings
Quarterback: The scatter plot shows how their media presence has a positive correlation with their on field performance by 45%. This suggests that positive media presence for the quarterback, generally, can lead towards higher passer ratings. The spread of dots (quarterbacks) indicates while passer rating and sentiment score are related there are still several outliers who perform well even with a lower sentiment score and vice versa, players with a higher sentiment score but lower pass ratings.
Defensive lineman: The scatter plot shows almost no correlation between the Defensive linemen sentiment score and average sacks statistic. There is a very weak positive correlation between the two, resting at about 7% correlation between the two variables. This suggests that the defensive linemen media coverage does not rely on how many sacks they have.
Linebacker: The scatterplot depicts a positive relationship between media perception and combined tackles. We can see an extremely positive correlation at around 78%, implying that heavy tacklers have a favorable depiction in the media.
Defensive Back: The scatterplot shows a negative correlation between a defensive back media performance and their performance on the field in pass deflections. A player with fewer pass deflections but higher sentiment scores could be key players if the opposing quarterback avoids throwing in their direction, which could indicate positive media coverage.
Spread of Total Selected Sentences: We can see a high consistency with extreme outliers. The median sits at around 1068 sentences, with extreme outliers, the highest being over 10,000 sentences for a single player. This tells us that there are some players in our data that have significantly more media coverage than others. This could be an issue for our analysis because it could possibly skew our raw sentiment scores, but with normalization of the data we were able to get an “adjusted sentiment” that avoided this issue.
Sentiment score v total sentences selected: This scatterplot depicts all players together to check for volume bias by comparing the sentiment scores to total sentences per player. The trendline is nearly flat, 28% correlation, proving that the sentiment analysis is volume independent. This means that a player with 50 sentences does not automatically get a higher or lower score than a player with 1000 sentences written about them. This also proves that the sentiment score accurately reflects the quality of the text written about said player and not the quantity.
Sentiment score across players: This histogram uses 20 bins to depict the frequency of sentiment scores across our selected players. The data follows a normal distribution, centered around the mean of 0.06. This shows that while there are extreme lows and extreme highs the rest of our players sit in a “normal” baseline. This also shows that there are very few players that have exceedingly low sentiment scores.
Anthony:
For the visualization part of the project I used SQLLite3, Pandas, Numpy, and MatplotLib. I connected to our qualitative data that was gathered through the web scraping through mongo.db. I pulled from our quantitative data using SQLLite to pull the needed information out of the quantitative database. I then used MatplotLib to make the different visualizations which included scatter plots and some bar charts as well. The more detailed process and findings of each graph is listed below:
Qualitative vs. Quantitative Word Counts:
The goal here was to count the amount of qualitative and quantitative words throughout our entire database of web-scraped articles. To begin, I made a word bank containing both qualitative and quantitative words and stored them. Then, through connecting to mongodb, I was able to search through each article and pull a word count for qualitative and quantitative. Finally, using MatplotLib, I visualized these two numbers into a bar chart. The findings showed that there were over 3,500 counts of quantitative with only around 500 counts of qualitative. This wasn’t really a surprise to me as I figured statistical performance was the more consistently covered topic for athletes, but it did surprise me just how big the gap was. These findings allow us to put our project into context, understanding just how much of each data type is present within our articles.
The remaining visualizations were all scatter plots plotting the players in a specific positional group. More specifically these players are plotted by their average sentiment score and statistics relevant to that position group. All of these visualizations were produced the same way which consisted of simply plotting these figures and creating a trend line.
Running Backs:
For running backs, selecting the statistic was fairly straightforward as rushing yards are the standard metric used to estimate a player's success.
The findings here were somewhat of a mixed back. The correlation value between rushing yards and sentiment score was 0.25. This is enough correlation to suggest that there may be a correlation, but not strong enough to decisively conclude that. With some players being outliers it is hard to definitely tell if there is a correlation here, but it did show something which is worth consideration.
Tight Ends:
For tight ends, selecting the statistic was a little more tricky. Tight ends are used in both the pass and blocking game within football and so certain tight ends are better equipped for run-blocking while others are more of a receiving threat. In the end, I decided to use average receiving yards. The reason I chose this was at the end of the day, even though tight ends are a very flexible position, fan perspective and media perspective often comes from the flashy statistics, which in this case is receiving yards.
The findings here were very promising. Unlike the running backs, there was a 0.84 correlation between average receiving yards and average sentiment score. This signifies that indeed the better the media talks about a tight end, the more likely they are to be successful on the field. It is important to keep in mind the limited sample size of our data, but a strong correlation like this is worth looking further into if research on this topic were to continue.
Offensive Linemen:
This is where selecting the statistic became extremely tricky. Unlike positions that have meaningful stats recorded every game, offensive linemen really don’t. Their impact on the field is hard to measure with a single statistic. I worked through it logically and worked my way down to two different options: Average penalties and average snap percentage. The rationale behind penalties is that when an offensive linemen takes more penalties, it means that they are consistently getting beat by the player that they are attempting to block. However, they are not the perfect indicator as their actual success when not committing a penalty is omitted. Snap percentage looks at another aspect of offensive linemen that is viewed as important, availability. If an offensive linemen is consistently playing snaps it means that they are: a) Rarely injured b) Being started often. The problem here however is that most offensive linemen that are within our dataset will have a high probability of playing often, since they are represented in the media. I discussed it with the team and landed on making a chart for each statistic.
Looking first at snap percentage, the correlation sits at 0.08. This number of basically zero indicates that there is effectively no correlation between the snap percentage and media representation. This does not surprise me as most offensive linemen that have significant statistical presence in the NFL, will have a high snap percentage. This means that any type of player, positive or negative, will have around the same snap percentage. Penalties are a little more interesting but not by much. The correlation between penalties taken and sentiment score is -0.2. This is still a weak correlation and may not mean much, but I do believe that it is proof that penalties, if this research were to move forward, would be a better statistic to use than snap percentage.
Aaron:
For the visualizations created, I predominantly used player statistics in the form of the CSVs. While most of the data was gathered and stored in SQLLite3, some of the player data was not within the collected data. As such, CSVs were the most convenient source of data rather than data pulled from SQLite3. Now, to properly utilize them in the visualizations, I utilized a variety of libraries, being pandas, matplotlib, numpy, and scipy. Pandas was used to merge most of the data, with NumPy and SciPy handling a lot of the statistical calculations, with Matplotlib ultimately bringing it all together to create visuals.
While I had some interactions with other graphs, the 2 main graphs I created were for media sentiment vs. player approximate AV and media sentiment vs. WR succ%, each looking at sentiment in comparison to some statistic to players in order to draw important conclusions.

