Challenge 17 Megathread

For any and all questions relating to challenge 17. :point_down:

For a tutorial on how to use Jupyter Notebook, we put together this video:

Still have questions? Read all the FAQs here.

If you would like to download the data set used in this challenge, click here.

To continue to play around with the datasets in a Jupyter environment, click here.

“What are the top 5 rated books in the dataset?” is an awkward way of saying what are the 5 books with the most ratings.

29 Likes

First cut was miserably inefficient and didn’t do what was asked, but got the answer:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("books.csv")
dff=df[df["ratings_count"]>=df["ratings_count"].mean()]
dft=dff.sort_values(by=["ratings_count"], ascending=False)
dfr=dff.sort_values(by=["average_rating"], ascending=False)
print(dft["title"])
dfr["title"].head(5)
2 Likes

If anyone is having trouble, here’s what I did:
Q1. df.sort_values(‘ratings_count’, ascending = False).head() sorted on ratings_count and displayed the first five books using head()

Q2. df[ df[‘ratings_count’] > df[‘ratings_count’].mean() ].sort_values(‘average_rating’, ascending = False).head()
Filter out all cases where the ratings count is below the mean (eliminates books with one review of 5) then sort on average rating and display using head()

Stretch Q:
df[‘authors’].value_counts().head() display head() of value counts on the authors feature.

Didn’t see a need to plot a bar chart for this…working on that, but can’t get back to the exercise. To plot the bars just assign the above statements to df1, df2

*** Make sure you use head() to limit the size to 5 or you’ll be plotting 11000 bars
plt.bar(x = df1[‘title’], height = df1[‘ratings_count’] ) # or use barh
plt,show()

plt.bar( x df2[‘title’] , height = df2[ ‘average_rating’] ) # or use barh
plt.show()

7 Likes

Maybe a weird question, but I’ve been wondering: From what I can tell, all of the functions and graphs and such that we’ve been doing so far can also be done in a standard spreadsheet program such as Excel or Calc. Are the benefits to doing them in Python just a lighter load on the machine for massive datasets, or am I missing something here?

1 Like

Yep. I struggled for a while with Q1 thanks to how it was phrased. I explored a bunch of different cutoff thresholds for the number of times a book was rated before I finally figured out what was really meant by “top rated”. Although it may be chalked up to poor phrasing, the difficulty ratings seem completely arbitrary at this point - I wonder if their alignment to the challenges got screwed up somehow.

Anyhow, I ended up presenting the top 5 using barh instead of bar so the labels could be read horizontally without overlapping. xticks(rotation=90) is also an option but then you have to turn your head sideways.

1 Like

Cleaner sol?

x = df['ratings_count'] > 20 # Arbitrarily "low number"
ddf = df[x]
ddf.sort_values('ratings_count', ascending=False).head(5)

y = df['ratings_count'] > df['ratings_count'].mean()
ydf = df[y]
ydf.sort_values('average_rating', ascending=False).head(5)

(w/out the visualization aspect)

That’s not a weird question, I can understand why you’d think that. Actually, some of the problems can be solved with a napkin or some mental math, so the fact they’re contrived for educational purposes probably plays a part, but it’s true that even with real data there’s a lot that you can do with Excel. Sometimes (or often) that’s the right tool for the job.

With Python, however, your code can run in more different contexts, is more repeatable and extensible, and it is more expressive and easier to read. The ceiling for what you can do is much higher (practically limitless) compared to Excel and it’s easier to aggregate data from many different sources. Python programs are way more flexible than a spreadsheet. These simple bar charts are just the beginning!

3 Likes

A traditional barplot was not at all useful, with the titles all jammed together and squished, but a horizontal barplot using barh(y=“title”, width="ratings_count) does work better, even if it wasn’t necessary to solve the challenge.

# Q1
top5rated = df.sort_values("ratings_count", ascending=False).head()
top5rated["title"]

# Q2
df.mean()["ratings_count"]  #  find the mean to remove items below this
top5ratingFilt = df[df["ratings_count"] > 17942.8]
top5rating = top5ratingFilt.sort_values("average_rating", ascending=False).head()
top5rating["title"]

# stretch
df.groupby("authors").count().sort_values("title", ascending=False).head()

I was trying to display both the title, and the ratings_count or average_rating, but I kept getting a key error - can anyone help me figure out why the below didn’t work?

top5rating["title", "average_rating"]

I don’t understand something here.
The note clearly says to filter out books with a ratings_count less than the mean. Shouldn’t be like this
df['ratings_count'] < df['ratings_count'].mean() ?

to show two (or more) features you need to use double square brackets:

top5rating[['title', 'average_rating']]
4 Likes

Thank you! Was driving me CRAZY!!!

I ended up doing horizontal bar plots which made everything look better!

#Question 1
top_5 = df.sort_values('ratings_count', ascending=False).head(5)
print(top_5['title'])

#Creating a bar graph of titles and ratings
plt.figure()
colours=['red' if (y==max(top_5['ratings_count'])) else 'silver' for y in top_5['ratings_count']]
plt.barh(y = top_5['title'], width = top_5['ratings_count'], color=colours)
plt.show()


#Question 2
mean_filter = df['ratings_count'] > df['ratings_count'].mean()
top_rated = df[mean_filter].sort_values('average_rating', ascending=False).head(5)
print(top_rated['title'])

#Creating a barplot of Question 2
plt.figure()
clr=['green' if (y==max(top_rated['average_rating'])) else 'silver' for y in top_rated['average_rating']]
plt.barh(y = top_rated['title'], width = top_rated['average_rating'], color=clr)
plt.show()

#Stretch Question
topauthors = df['authors'].value_counts().head(5)
print(topauthors)

#Barplot for Stretch
topauthors.plot.barh()
2 Likes

This will return items with a ratings_count less than the mean, not filter them out.
You want to keep the data above the mean.

2 Likes

Wish i could have done more with this but for some reason its not loading the cells. Q1 works and q2 is when it froze/slowed down.

q1

Summary
top_rated = df.sort_values(['ratings_count'], ascending=False).head()
titl = top_rated['title'].str
titl = titl[:30] + "..."
plt.figure(figsize=(20,5))
plt.title('Question 1')
plt.bar(x = titl, height=top_rated['ratings_count'])
plt.show()

q2

Summary
clrng = df[df['ratings_count'] > df['ratings_count'].mean()]
clrng.sort_values(['average_rating'], ascending=False).head()
q2titl = clrng['title'].str[:20]

plt.figure(figsize=(20,5))
plt.bar(x=q2titl, height=clrng['average_rating'])
plt.show()
1 Like

Oh, I am stupid! It’s filter out, meaning “I don’t want them”. Now it makes sense. I thought I was gonna stick my fingers into an electric outlet out of frustration.

1 Like

I had to resort to pure Pandas to match the answer, the bar plot cluttered all the names…Will try the horizontal bar approach

2 Likes

Don’t be so hard on yourself!

My first go, I eliminated all the average ratings below the mean instead of the ratings counts.

3 Likes