Challenge 18 Megathread

For any and all questions relating to challenge 18. :point_down:

For a tutorial on how to use Jupyter Notebook, we put together this video:

Still have questions? Read all the FAQs here.

If you would like to download the data set used in this challenge, click here.

To continue to play around with the datasets in a Jupyter environment, click here.

Video Solution: https://youtu.be/6wh1c4lNneY

4 Likes

Wasn’t a big fan of today’s challenge mostly because I’m a board game fan and use BGG a lot. The categories should not be combined and instead more text manipulation should have been used to split up the categories at the commas… E.g. “Card Games, Fantasy” is not one category, but two different categories…
In any case, I digress…

To solve it “their” way, where the entire text inside the category column is just one category I did this:
#Question 1

#Question 2

8 Likes

And here is what I think the “real” counts should be:
dfAdults = df[df[‘age’]>=13]
#Find all the unique categories, seperated at commas
allCategories = set(", “.join(dfAdults[‘category’].unique()).split(”,"))
allCategories = set(x.strip() for x in allCategories)

#Find the counts of each category in the adults df
counts = {}
for cat in allCategories:
counts[cat] = dfAdults[df[‘category’].str.contains(cat)][‘category’].count()

#Create a df of just the counts
dfCounts = pd.DataFrame({“Category”:counts.keys(),“Count”:counts.values()})
dfCounts = dfCounts.sort_values(‘Count’,ascending=False)

plt.figure()
plt.bar(dfCounts[‘Category’].head(10),height=dfCounts[‘Count’].head(10))
plt.xticks(rotation = ‘vertical’)
plt.title(“True counts of categories of non-children board games”)
plt.show()

5 Likes

This one is probably a perfect example of why it doesn’t make sense to just take the full category name:

The dataset provided categorizes this as: “Card Game, Collectible Components, Fantasy, Fighting”

That means this will not be included in the bar graph category of “Card Game, Fantasy” just because those two categories are not next to each other in the text…

Yet the game falls under both those categories.

5 Likes

Not groking the point of the lesson, but solving the questions.

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('boardgames.csv')
df13=df[df['age']>12]
dfs=pd.DataFrame({'category': df13.groupby(['category'])['category'], 'count': df13.groupby(['category'])['category'].count()}, dtype=object)
dfss=dfs.sort_values(by=['count'], ascending=True)
dfe=pd.DataFrame({'category': df.groupby(['category'])['category'], 'count': df.groupby(['category'])['category'].count()}, dtype=object)
dfes=dfe.sort_values(by=['count'], ascending=True)
print("Question 1")
dfss[dfss['count']>dfss['count'].quantile(q=0.99)].plot.barh(y='count', color = 'red')
print("Question 2")
dfes[dfes['count']>dfes['count'].quantile(q=0.995)].plot.barh(y='count', color = 'blue')
2 Likes

I agree that there are different ways to rank the categories. I looked at grouping categories and summing num_votes and then sort by num_votes. But I ended up with the following to get the answer wanted. I printed the top 5 and did a bar chart for each (13+ and all games) then took the union of the categories. - code without the output

age_filter = df['age']>=13
age_games = df['category'][age_filter].value_counts().head()
age_games

plt.barh(y=age_games.index, width=age_games)
plt.show()

all_games = df['category'].value_counts().head()
all_games

plt.barh(y=all_games.index, width=all_games)
plt.show()

common = all_games.index & age_games.index
print(common)
8 Likes

Am I the only one to not think that “Top” means most games i.e. count. If someone asked me to rank the top 5 television series, I would not assume they meant rank them by the number of episodes in the series. There’s a perfectly good word to use if you want the category with the most numbers, “prolific”. or just say “the category with the most titles”. I’m not learning anything in this challenge except to wait for someone else to solve each problem and get their translation of the problem in the forum.

8 Likes

it seems like the people writing these problems don’t understand the data themselves

8 Likes

I agree about the ambiguity of Top 5. I initially read that as most highly ranked but then looking at the hint seeing that they way want most popular. Adding a little more specificity would be helpful

5 Likes

it’s not even most popular. most popular would be fine, but would be ranking by number sold. Instead it’s by number of games that exist in the database.

2 Likes

This is so well done. Many parts of this code that I do not fully understand, but I can appreciate it for what it does. Prompted me to do a bit of research and I am learning for sure - so thanks!

Hey Dave - great idea with the comparison of the index at the end! I learned something through that.

Can you tell me why you didn’t do any sort values? How did you know that the ‘top 5’ (as it relates to this question) would be in the .head() function? Does the value_counts automatically sort them in descending order?

looking at your code vs. mine, I sorted my age filter by ‘avg_rating’ where as you sorted by ‘names’ and yet we got the same results. Could you tell me why you think that might be?

agefilt = df[df[‘age’] >= 13]
agefiltsort = agefilt.sort_values(‘avg_rating’, ascending = False)
plt.figure()
plt.bar(x = agefiltsort[‘category’].value_counts().head().index, height = agefiltsort[‘category’].value_counts().head(), color = colours)
plt.xticks(rotation = ‘vertical’)
plt.show()

Thanks I threw it together very quickly last night so not the best in terms of being clear. Happy to explain any one of those formulas if it helps.

High level what I did:

  • first did a unique on the category column to remove duplicates,
  • the ", ".join then just concatenates them all into one big string
  • the strip cuts them apart at the comma again to have as individual data point in one huge list
  • the set() is a data type similar to a list in Python that can only hold unique values, so was kindof like doing a unique formula but on a list
  • once I have all the unique categories I’m just searching for those categories back in the main data frame and doing a count using the str.contains (similar to contains filter in excel)
1 Like

Yup might have been a bit ambiguous there by using count method, should have probably used value_counts.

The first part of the formula that does the groupby and count will create a summary dstaframe where the index will be all the categories and every other column has the total count of that category. So for example the row “Card Game, Fantasy” will have 30, 30, 30, 30, all the way across for all columns. So when I did the sort at the end of that formula, it didn’t matter what column I picked. I could have picked any of the columns (avg rating, name of board game, url, age, etc) since they all the same count number in them for that row. When I’m back at my laptop I’ll take a quick screenshot to show you how the grouped df looks like.

EDIT:
Here’s how the group dataframe looks like:

1 Like

Not a fan of this challenge at all. The Category field is comma-seperated distinct categories. We should split them and then group as Andrei suggested.

It also wasn’t clear that we were supposed to rank categories by distinct game, rather than by total ratings (like yesterday), total copies owned, etc.

1 Like

jaredcm, Yes, value_counts() defaults to returning the counts in descending order, and head() defaults to returning the first 5 elements. You can also get value_counts to reverse the order, or normalize and return a frequency vs. a count and you can also get it to return count by bins by specifying the number of bins.
I used the index for the y axis on the barh plot since value_counts() like groupby returns the counted or grouped feature values as the index in the result.
Note: if you assign df[‘category’]. value_counts() to vc_result, you can print the results in alphabetical order (instead of by value) using sort_index, e.g.: vc_result.sort_index() ,or in reverse alphabetical with vc_result.sort_index(ascending = False)

2 Likes