Challenge 16 Megathread

For any and all questions relating to challenge 16. :point_down:

For a tutorial on how to use Jupyter Notebook, we put together this video:

Still have questions? Read all the FAQs here.

If you would like to download the data set used in this challenge, click here.

To continue to play around with the datasets in a Jupyter environment, click here.

I’m still offended that yesterday’s challenge was labeled “hard.”

6 Likes

I find that the boxplot is a little hard to read - and it appears that a large section of the data is considered an outlier, and is so grouped together, especially in the 1000-2000 page range that it doesn’t seem useful.

Am I missing something with the boxplot?
(Also, since we were encouraged to use a filter for Q2, which was the same subset as Q1, was there a reason for it, other than the practice of doing the boxplot?)

#Q1
plt.figure()
plt.boxplot(df["num_pages"], vert=False)
plt.show()

# Q2
Over4000 = df[df["num_pages"]>4000]
print(Over4000["average_rating"])
6 Likes

Video Solution: https://youtu.be/B-bonepuDMs

1 Like

When instantiating a matplotlib figure (plt.figure()), you can provide a tuple for the “figsize” parameter to adjust the size of the figure (e.g. plt.figure(figsize = (x,y)). This can help you make a bigger plot to see the outliers a bit more clearly for this challenge.

6 Likes

This was a bit hard for me because I didn’t quite understand the boxplots, but I think the circles = 1 count each.

df.columns
#1
plt.figure()
plt.boxplot(df['num_pages'], vert=False)
plt.show()

#2
morethan4000 = df[df['num_pages']>4000]
1 Like

Same here, I could get the code to run everything fine but I just stared at the figure wondering what I’m supposed to understand. Read a few articles as well and still feel dense.

2 Likes

Bluebird8203, I like your use of vert=False for this box plot. Regarding your question, from my perspective this is an atypical example of a box plot. There are over 11000 books in the dataset, with the majority of samples represented by the box, which is overshadowed by the outliers and does give a misleading visual depiction. The high end whisker is Q3 + 1.5* the interquartile range. The 75% percentile page count is 416 (from df.describe() ) while the upper whisker is about 750. The maximum is 9 times this value. So, while the plot is busy with outliers, these are still a small number of samples. The 9 times range of the outliers skews the box plot visual.
Try the box plot on other datasets, like the wine dataset. You can plot multiple box plots on one value filtering on wine quality - you’ll see a set of staggering boxes.

plt.figure(figsize=(8,4))
for i in range(1,7):
plt.subplot(1,6,i)
plt.boxplot(df[‘citric acid’][df[‘quality’]==i+2])
plt.xlabel(‘Q=’+str(i+2) )
plt.tight_layout()

or use seaborn - its does more with less code

import seaborn as sns

plt.figure(figsize=(8,4))
sns.boxplot(df[‘quality’], df[‘citric acid’])
plt.show()

5 Likes

Thanks, I’ll try those out!

Girigirirei, The box plot is a bit odd the first time you see it. They are insightful once you get familiar with them. Don’t be hard on yourself.
In this case the outlier points above 4000 pages were just 2 individual points. It would be very rare to individually count your outliers like this. If there were 2 books with 4100 and 4101 pages the two points would appear as one. If you really need a count of outliers above a certain value, its best to use a filter.

3 Likes

Seems straightforward

plt.figure()
plt.boxplot(df['num_pages'], vert=False)
plt.show()

page_filter = df['num_pages'] > 4000
x = df[page_filter]
x.groupby('average_rating').mean()
2 Likes

I tried to combine the answers to question one and two. While I could have labeled data points for all entries on the graph, that would be needlessly crowded; I simplified, observing that it was unlikely we’d need nore than the ratings of the highest and lowest outliers:

plt.figure(figsize=(20,2))  #The data is very dispersed, make it easier to read on the plot
plt.boxplot(df["num_pages"],vert=False) #Horizontal for easier viewing
df4000=df[df["num_pages"] > 4000] #Display the range of ratings over the far outliers
plt.annotate(df4000["average_rating"].min(), (df4000[df4000["average_rating"]==df4000["average_rating"].min()]["num_pages"], 1.1))
plt.annotate(df4000["average_rating"].max(), (df4000[df4000["average_rating"]==df4000["average_rating"].max()]["num_pages"], 1.1))
plt.show()
2 Likes

This code worked really nicely for me to show the number of books over 4000 pages:

plt.figure(figsize=(20,1))
plt.boxplot(df.num_pages, vert=False) 
plt.show()

It’s much easier to interpret this way, IMO. Note the addition of figsize and vert=False.

3 Likes

Am I the only one who got tripped up by the syntax error in their Boxplot Example?

plt.figure()
plt.boxplot(df.['numerical_data'])
plt.show()
8 Likes

print(df.describe())
plt.figure()
plt.boxplot(df[‘num_pages’])
plt.show()
df1 = df[df[‘num_pages’]>4000]
df1

Definitely not the only one!
I didn’t even notice at 1st when it finally did work.

Yes, it threw an error for me as well after copying it & changing the feature name (missed the extra period)

me too! :raising_hand_woman: it seems there are 2 valid notations, and they mixed them up:

plt.boxplot(df['num_pages'])

and

plt.boxplot(df.num_pages)