Challenge 19 Megathread

For any and all questions relating to challenge 19. :point_down:

For a tutorial on how to use Jupyter Notebook, we put together this video:

Still have questions? Read all the FAQs here.

If you would like to download the data set used in this challenge, click here.

To continue to play around with the datasets in a Jupyter environment, click here.

Anybody know the common practice to deal with those outliers here?

1 Like

Video Solution: https://youtu.be/O5hciZhS6mU

2 Likes

I really have to learn to ignore the misleading instructions.

Games with ratings above 9.0 are so infrequent as to make little difference, and give misleading information about average time if plotted. Perhaps a clearer suggestion would be to focus the games in the top ranges of ratings (quantiles)?

import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
df = pd.read_csv('boardgames.csv')
dft=df[df['avg_time']<6000] #cut off extreme outliers
print(dft.avg_time.mean(),dft.avg_time.max(), dft.avg_time.min(),dft.avg_rating.mean())
plt.figure()
plt.hist(dft['avg_time'], bins = 400) #Play around with the bin sizes when plotting your histogram
plt.show()
df9=dft[dft['avg_rating']>dft['avg_rating'].quantile(0.5)] #adjust quantile level to get a better sense of trends
print(df9.avg_time.mean(),df9.avg_time.max(), df9.avg_time.min(),df9.avg_rating.mean())
plt.figure()
plt.hist(df9['avg_time'], bins = 400) #Play around with the bin sizes when plotting your histogram
plt.show()
3 Likes

Common practice, not sure. But visually looking at the result, I noticed the majority of the data landed between 0-600, so I set the range to this.

plt.hist(df[‘avg_time’], range=(0,600))

3 Likes

To make sure of a like-for-like comparison, I filtered out games longer than 6000 from both datasets, as Dot’s narrative strongly ruled out considering the very longest of the games, though I have no idea how long M. Voisin’s parties may be.

Dot’s asking the wrong question, as focusing on the top 1/10th of 1% of games indicates a range of 30-150, so any really great game is likely to be shorter than most of the very good games.

df9=dft[dft['avg_rating']>dft['avg_rating'].quantile(0.999)] #adjust quantile level to get a better sense of trends
print(df9.avg_time.mean(),df9.avg_time.max(), df9.avg_time.min(),df9.avg_rating.mean())
plt.figure()
plt.hist(df9['avg_time'], bins = 400) #bins almost don't matter at all
plt.show()
1 Like

Hi~ In the video solution, I don’t quite understand, for question #2, why we need to include games that have an ‘avg_rating’ of 9.0. I thought only games over 9.0 need to be included. Can someone help me here?

2 Likes

Agreed. When I read the words “filter out”, I understand that to mean we need to exclude those from the dataset, therefore the filter should be <= 9.0. If they wanted us to only use games with ratings of 9 and above, they should have said “apply a filter to only include highly rated games, 9.0 or above” or something like that. Based on excluding them, there’s no difference. Including them does result in the longer playing times.

6 Likes

Silly me, tried thinking of second question as another one where I should use the histogram, and since none of the highly rated games have similar average time I concluded the histogram wasn’t skewed.

19 days in and I still expect the prompts to make sense. Silly me.

10 Likes

These video solutions are my creation, they don’t provide me with answers and are not official by any means. Its possible that I might have gotten something wrong, but I did arrive at the correct solution.

When I read the challenge text “For question 2, filter out games that have are above the avg_rating of 9.0.”, I interpret this as “that have” as having a rating of 9.0 and “above” as greater than 9.0.

3 Likes

Yeah, the wordings a little tricky. What helped me out is the question itself.

Do games that have a great avg_rating have longer play times?

The adjective “great” makes me think they want to include games with a high average rating above 9.0 and not a low average rating below 9.0.

2 Likes

I got the first one right but the second I chose the wrong one, mostly because they applied a filter in the resolution that does not make any sense to me.
I`m not sure I understand why in question one they are removing averages over 500.

avg_time_filter = df.avg_time <= 500
df_2 = df[avg_time_filter]

If we don`t apply this filter, the 2nd question will have a different answer. Which lead me to an error.

3 Likes

this makes no sense:

avg_time_filter = df.avg_time <= 500

why 500?

why do mean() of an average?

Very disappointed.

1 Like

Well that’s great. Even with the poor wording (“filter out” meaning “only include”) and double answers for Q2 (No and No Difference mean the same thing given the question wording), when I clicked on the right answer it still said I got it wrong.

While I appreciate Lighthouse putting this on for free, for future versions I strongly suggest putting a lot more effort into editing and error checking these assignments. I can only assume this is meant to drum up business for the paid programs - and I’ll give the benefit of the doubt that the paid programs are better than this - but the large number of unclear instructions and errors along the way have to be dissuading potential students from signing up.

12 Likes

I understand the solution filters out high play times to make the data more meaningful. The outliers skew the data and make the average play time higher than the solution shows. However, an arbitrary value was chosen. If the question said something like Dot doesn’t want to spend more than x number of hours at the neighbours and then compare that data set to those with a rating over 9.0 that would have mad a lot more sense.

I left the data set as is and that doesn’t give you the correct solution.

2 Likes

It might be useful to look at the median values of the two data sets to help answer #2.

all data avg_time median: 60
high rating avg_time median: 105

Note that looking at the means would get you the opposite answer. The all data set mean is higher than the high rating data set because of the outliers.

all data avg_time mean: 116
high rating avg_time mean: 98

5 Likes

If you want to plot mean() and median() data on your histogram you can use this code:

plt.axvline(df.avg_time.mean(), color='k', linestyle='dashed', linewidth=1)
plt.axvline(df.avg_time.median(), color='r', linestyle='dashed', linewidth=1)

Thanks to https://stackoverflow.com/questions/16180946/drawing-average-line-in-histogram-matplotlib for this tip.

1 Like

Why is it that the avg_time column shows the same value of the max_time? Unless we have a dataset of how many times the games have been played, shouldn’t the average time be the median between the minimum and the maximum?
image

ALSO! I put down the correct answer, and yet it said that I got it wrong, and when I clicked to show the answer, it gave me the SAME option I had picked. This is such a bummer!!!

3 Likes