Assignment 2 – Data Exploration through R

After assembling the data putting it into R I used ggplot2 to answer the following questions:

1.Who had the highest number of home runs (HR)? Jesse Barfield.

At first I got a scatter plot with the individual names on the y-axis which would ideally work if there were few enough names, but then I did an anesthetically pleasing way (shown second) with the name of the hitter placed instead of the point. Ideally I would love to add the label to the point on the graph to the left, but with my beginner skills I found using the two visualizations together works the best in this case (because there was a clear winner).


2. Who had the maximum number of hits in 1986? Don Mattingly.

I used the same (not ideal) technique that I used to answer the first questions. Again, I understand that I was only able to do this because there was a clear winner.


3. Name the second most expensive team in the league?

The best way to answer this question would be to sum all the salaries for each team, but I couldn’t find an effective way to do this with my skills. I used the team data to plot the average salaries for each team (shown in the first two visualizations). So then I used the hitter data to plot the salary verses the teams and got the third visualization. I then created a box and whisker plot per team. Though I can’t say definitively which team spends in total (on all the players) the second most. However, I can answer many other questions like: Chicago has the second highest average salary (shown by the team visualizations shown first). What was weird to me is that this did not coincide with the data from the individual box plot visualizations created with the hitter data. From this data I can conclude that Baltimore spends the most on an individual player and Boston the second; New York middle 50% (inner quartile range) extends the highest, but doesn’t start the highest; The average bar shows that Boston is the highest average salary, and Toronto the second. I attempted to get a stacked bar graph with the sums per team, but I feel like I would be more successful if I created that information with the data itself and then visualized it instead of the other way around.




Specific Goal: Are players paid according to their performance?

To answer this question I created many difference visualizations and will describe them one by one. I did manipulate the data by creating new variables that represent the percentage of hits, runs, home runs, etc. by using the number of hits and dividing it by the number of at bats. Another variable I created was the career percentage where I divided the number of hits, runs, etc. divided by their yearly average of hits, runs, etc. (their career hits, runs, etc. by the number of years they have played in the majors). This variable will be 1 if they are getting about the same amount as they have averaged in their career past, it will be >1 if they are doing much better in the current year and <1 if they are doing much worse in the current year.


^ First of all I plotted a histogram of the salaries to see the distribution of the salaries, which is as expected with most of the salaries being on the smaller range.


^ Plotted a box and whisker plot on salaries per position to see if there was any particular position that was clearly paid more than the others. However, it looks as though players of different positions are on average paid the same. (If I was continuing the research in this data I would investigate the positions of players only making over say 1,000 to see if there is a pattern within the players who make the most.)


^Plotted salary against errors made using the number of years they’ve played in the majors for the color. The color shows that the those with higher salaries have played in the majors for a little while. The trend line added shows that errors didn’t seem to effect their salary.


^ Plotting salary against the number of assists surprisingly shows that assists don’t seem to effect their salary either.


^ Plotting run percentage (runs/at bats) against salary. The trend line shows that as the runs percentage increases so does the salary. This would support the idea that players get paid according to their performance. Outlier on top all seem to be a darker blue and thus are newer players, so their pay makes sense. If they are lighter then they are either not paid according to their hitting/runs skills or they are paid disproportionately to to their skills.


^ Plotting hits percentage (hits/at bats) against salary. The trend line also shows that the hits percentage increases so does the salary. This would support the idea that players get paid according to their performance. Outlier on top all seem to be a darker blue and thus are newer players, so their pay makes sense. If they are lighter then they are either not paid according to their hitting skills or they are paid disproportionately to to their skills.


^ Plotting career hits against salary. The trend line shows that with an increase in career hits so does the salary. This would support the idea that players get paid according to their performance. Another thing seen from this visualization the players who’ve been playing longer with very small amount of career hits do not seem to get paid much, with some exceptions. Outlier is the point on the bottom right, with high number of career hits and a low salary, this player could have gotten most of those hits earlier in their career and is thus not playing as well now so they don’t get paid enough. Other outliers on the top left are players with high salaries and a low number of career hits, these players are probably not paid for their hitting (probably a pitcher) or its their first year so their career hits is their current number of hits (all darker blue) or they are not paid accordingly.


^Plotting career home runs against salary. The trend line shows an increase in salary as career home runs increase. This would support the idea that players get paid according to their performance. The players that have more career home runs that fall under the trend line are all players who haven’t played in the majors for very long, which makes sense that he new players that are hitting well in the first few seasons don’t have high salary contracts yet. This would support the idea that players get paid according to their performance, when you consider that they are getting paid based on past seasons performance. Outliers are the same as previous visualization but with home runs instead of hits.

Overall, the data still calls for more investigation; however at this point I will conclude that players get paid according to their performance. (Players that are shown to have higher performance but with lower salaries are assumed to not be of higher performance in a different category. Example: A newer player to the league with amazing unexpected stats then when they made the salary contract or a player that has great stats but just had an injury and was back from the DL. And players that are shown to have lower performance but with higher salaries are assumed to be of higher performance in a different category. Example: an amazing pitcher that makes a lot of money but is not the best hitter, therefore their hitting stats will be very low, but will be making a lot of money.

Try R

Screen Shot 2014-09-28 at 6.36.30 PM

Today I tried R and completed Code School’s 7 chapters of exercises. Here’s some screenshots of the chapters below!

Screen Shot 2014-09-28 at 6.36.43 PMScreen Shot 2014-09-28 at 6.36.54 PMScreen Shot 2014-09-28 at 6.37.04 PMScreen Shot 2014-09-28 at 6.37.16 PMScreen Shot 2014-09-28 at 6.37.25 PMScreen Shot 2014-09-28 at 6.37.34 PMScreen Shot 2014-09-28 at 6.37.42 PM

Assignment 1: READ ME

This assignment was a really good way to familiarize myself with Tableau. The first part really helped me look at the data with specific answers to look for in the data. I got to experiment with which way best shows what I am trying to show with the data. The second part where we picked our own data helped me look at data with a blank slate and try and figure out what types of information I am trying to seek from the data. The experimenting was more just which variables compared to the others and different ways to look at it. I know feel much more comfortable with the idea of taking data from a csv to an excel file and then to Tableau. Overall, I now feel pretty familiar with Tableau, which before this assignment I hadn’t ever used!

Assignment 1: Baseball Data

For this part of the assignment we were allowed to pick another dataset to explore, I chose baseball data from CMU’s Statlib Datasets Archive. I chose to upload the teams data sheet from the baseball data and investigate it further. There were many variables including: Div ID (division id), Home Ballpark, League ID, Team ID, Team’s Name, WC Win (wild car winner), WS Win (world series winner), Year ID, 2B (number of doubles hit), 3B (number of triples hit), Attendance, BB (walks), Caught Stealing, H (hits), HR (home runs allowed), R (runs), RA (runs allowed), Rank, Stolen Bases, W (wins). Some of the variables I had to figure out what they were by reading the about me and renaming the variable itself to make it easier. Here are some of the visualizations/findings I found when exploring the data:

Screen Shot 2014-09-12 at 4.55.42 PM

The packed bubble visualization above shows the total attendance of each team. The bigger the bubble, the more attendance that team receives. This visualization is not necessarily to see which team has the most, but does a good job visually showing which say 15 teams get the most. This could be valuable for someone whose looking to put up there advertising at multiple ball parks, but only have the funds to put them up at 1–20, this visualization would be helpful.

Screen Shot 2014-09-12 at 4.59.07 PM

Thinking about attendance, I wondered why the teams with the most attendance have the most attendance. My first thought is that they get the most home runs, so I looked at it. The scatterplot above show the attendance against the number of home runs per team. It seems to show that as the number of home runs increases the attendance of that team also increases.

Screen Shot 2014-09-12 at 5.01.44 PM

The above scatter plot shows the amount of times teams have players that get caught stealing against the amount of stolen bases. It seems to show that as the amount of times the team gets caught stealing increases so does the amount of successful steals of the team as well. It seems like with the risk of stealing comes the reward, which is what one would assume!

Screen Shot 2014-09-12 at 5.05.22 PM

Thinking about other things that seem like they would be true I decided to plot the number of hits allowed against the number of hits. The scatterplot shown above shows that with the as the teams allow more hits, they also hit more. Which makes sense that there are high hitting games and low hitting games and that most the time the games are either high scoring or low scoring depending on the match of the teams.

Screen Shot 2014-09-12 at 5.14.31 PM

The scatterplot shown above plots the average number of errors against the average runs allowed by teams. It shows that as a team, if you have a big number of errors then you are also have allowed a big number of runs, but it also shows that if you’ve allowed a lot of runs it doesn’t necessarily mean that you have made a large number of errors (which would be earn runs made by the other team).

Screen Shot 2014-09-12 at 5.20.37 PM

The above visualization shows the average doubles, triples, and home runs by league. It shows that the AL league gets the most doubles and home runs, and the PL league gets the most triples. It is also clear that doubles are more common that triples or home runs regardless of the league.

Assignment 1: Cars Data

Visualization-directed inquiry: 

Data used was from the cars data sheet from CMU’s Statlib Datasets Archive.

1. Which car has the highest mpg?

Screen Shot 2014-09-12 at 3.48.28 PM

Using the box & whisker plot visualization technique shown above it became clear that the toyota corolla has the highest mpg.

2. Is there a correlation between a cars mpg and its weight?

Screen Shot 2014-09-12 at 3.54.42 PM

Using a scatter plot visualization technique to highlight the correlation between mpg and weight it is easy to see that as the weight of the car increases the mpg decreases. In addition to highlighting this correlation I used the year the car was made to color the dots: the older the car the lighter the dot. Seeing that the under part of the plot is lighter than the top it seems like as the cars are older they have lower mpg, which I thought was cool.

3. Which car has six cylinders and still has a mpg that is above 35?

Screen Shot 2014-09-12 at 4.00.29 PM

Using the scatter plot technique and highlighting the cars with 35 mpg or greater blue it became clear that only one car with six cylinders and above 35 mpg is the oldsmobile cutlass ciera (diseal).

4. Is there a relationship between a cars displacement and the number of cylinders it has? If so, what is it? Screen Shot 2014-09-12 at 4.04.55 PM Screen Shot 2014-09-12 at 4.06.28 PM

For this question I used two different visualizations, thought I thought the first table was more effective, the second gets the point across without numbers, which I thought was cool. Both show that as the number of cylinders increases so does the displacement. The table shows the displacement per cylinder number, with green darkening as the displacement increases. The second is a packed bubbles visualization where each bubble represents a different number of cylinders and the size and color of the bubble changes as the displacement changes. As the size decreases and the color lightens the displacement decreases.

5. Is there a relationship between a cars horsepower and its weight? If so, what is it? Name the car that is an outlier with low weight and high horsepower.

Screen Shot 2014-09-12 at 4.19.42 PM

A scatter plot visualization technique shows that there is a relationship between horsepower and weight: as horsepower increases so does the weight of a car. To show an outlier with low weight and high horsepower, I used color to represent the weight with a sliding scale from low weight being red and high weight being blue and as you can see in the scatter plot the only car with high horse power that is red (low weight) is the buick estate wagon (sw).

6. Name any other interesting correlations that you find through interacting with the data. Screen Shot 2014-09-12 at 4.25.16 PM

As shown in the graph above, the average horsepower of cars is decreasing as the cars get newer.

Screen Shot 2014-09-12 at 4.27.14 PM

As shown in the table above, the average mpg of the cars is the highest between cars with 4 cylinders. If someone was looking to purchase a car and were wondering how many cylinders the car would have, they would be looking for a car with 4 or 5 cylinders.

Screen Shot 2014-09-12 at 4.29.25 PM

As shown in the histogram above, it seems that the average weight increases as the displacement increases, the only exception is in the last displacement bin. A consumer could use this fact when looking to purchase a low weight car by narrowing their search to cars with low displacement.

Screen Shot 2014-09-12 at 4.32.59 PM Screen Shot 2014-09-12 at 4.36.28 PM

Based on my findings from the scatter plot to answer question 2, I further investigated the relationship between the year the car was made and mpg. It seemed to validate my idea when I plotted ave mpg against the year. But then I decided to split the graph up into cars with different cylinders and the graph above shows just that with the colors shown to the right of the graph. This shows that the only cars that seem to be getting better mpg as they become newer are cars with 4 cylinders, which I thought was pretty interesting.