Challenge 8 Megathread

For any and all questions relating to challenge 8. :point_down:

For a tutorial on how to use Jupyter Notebook, we put together this video:

Still have questions? Read all the FAQs here.

This challenge was well put together.

7 Likes

I’m playing with loc() and can’t figure some things out. How can I extract, for example, 12 months at a time (without using iloc()).

df.loc["07-Feb":"07-Dec", :]
Why doesn’t this work?

1 Like

df["Year"] = df["Month"].apply(lambda date: date.split("-")[0])

This is like the only time where I actually know how to implement lambda functions so you better be sure I’m gonna throw em in to make a new Year column for the stretch goal.

After that, it’s a .group_by() problem.

6 Likes

Nice I did the exact same thing. I guess you could also use the to_datetime function from pandas but would need to give it the format parameter for year and month.

df[‘Year’] = df[‘Month’].apply(lambda x: x.split("-")[0]) #this it to quickly extract the year as a separate column
yearlyDF = df.groupby(“Year”)[‘Monthly milk production: pounds per cow’].sum() #this does a quasi pivot table by year
yearlyDF = pd.DataFrame(yearlyDF) #had to convert back into a dataframe bc it was creating just a series (a series is a df with 1 col)
yearlyDF.loc[yearlyDF[‘Monthly milk production: pounds per cow’].idxmax()] #the idxmax finds the row where max total milk volume is
#using a loc in this case instead of iloc, bc the index is the year now not “row numbers”

2 Likes

You have to convert the dates to dates first (they’re being treated as regular text right now). Then if you want to quickly pull out by year you either have to filter the df or set the dates to be the index. Here are the codes to 1) changes the dates to dates, 2) change the index, and 3) pull out the 12 months for 2020. I like to use this site for the date codes in python: https://strftime.org/

df['Date'] = pd.to_datetime(df['Month'], format="%y-%b")
df.set_index('Date', inplace=True)
df.loc['2020','Monthly milk production: pounds per cow']
8 Likes

This is fantastic thanks, especially the link! So the table automatically defaults to integer indexes for the rows, and you have to specify that you want the date to be the index of each row instead?

By that assumption, did I do this properly for just using plain strings (seems to work in my notebook)?

df = pd.read_csv('milk.csv')
df.set_index('Month', inplace=True)
df.loc["07-Feb":"07-Dec", :]

Yup that works though it’s basically saying grab the rows between the row labeled ‘7-feb’ to row labeled ‘7-dec’. It doesn’t know that the 07 is 2007. So you would constantly have to give it ranges like that to pull out each year. Vs once your index is convert to actual dates you can use ‘2007’ or ‘2020’ to pull all of one year. Also, when you’re plotting these on a graph, it won’t know that your x-axis is dates and might give you a weird result.

1 Like

df.loc[“07-Feb”:“07-Dec”, : ] will work only if the date texts are the row index, but not when they are a feature. You will have to set the index to “Month”

df.set_index(‘Month’, inplace=True)

Usually better to convert dates to datetime, but you can use strings as row index provided they are unique. setting verify_integrity = True in set_index will check for duplicates.

You’ll see multilevel text indexing resulting from some groupby’s

2 Likes

Getting the min
df.iloc[df['Monthly milk production: pounds per cow'].idxmin()]
Getting the max
df.iloc[df['Monthly milk production: pounds per cow'].idxmax()]

Took Andrei’s approach for the stretch:

df['Year'] = df['Month'].apply(lambda x: x.split("-")[0])
yearlyDF = df.groupby('Year')['Monthly milk production: pounds per cow'].sum()
yearlyDF = pd.DataFrame(yearlyDF)
yearlyDF.loc[yearlyDF['Monthly milk production: pounds per cow'].idxmax()]
11 Likes

I think there are 2 lines important for strecth:

dfx=pd.to_datetime(df['Month'], format='%y-%b') 
df = df.join(dfx)
df.groupby(df['fdate'].dt.year)['Milk'].agg(['sum', 'mean', 'max'])

is there a way to get the milk.csv?

1 Like

Video Solution: https://youtu.be/lkodbHkcYVA
Video Solution (Stretch): https://youtu.be/yFtARFVEdp8

1 Like
import pandas as pd
df = pd.read_csv('milk.csv')
df.sort_values(by=['Monthly milk production: pounds per cow'])
3 Likes

@elliot.lupini Can you provide a link to download milk.csv? I can’t make the code work with the on-line platform but I can run it on my local jupyter notebook.

Alternatively, how can I export milk.csv to my pc?

Thanks!

5 Likes

This ^^

I do all my work in vs code, jupyter notebooks are trash.

2 Likes

Minor nit in the DataFrame Column Functions: this should be iloc[]

  • iloc() gets rows (or columns) with particular positions in the index (only takes integers)
4 Likes

Co-sign on this. I’m working with a group of friends, and we’ve set up our own Colab notebooks so that we each work on our own hidden solution, and then are able to comment on each others’ code and ask questions. This helps the beginners in our group learn more from the people who have a bit more experience. If we could get a download link to the files being used in the challenges, it would be a big help!

1 Like

Saw some great solutions posted already… I should have used the split in my lambda function, but here’s what I did:

col = df.columns.values ## so that I didn’t have to retype column names, particularly the second column
df['year'] = df[col[0]].apply(lambda x: x[:2]) ## used a string first two characters approach rather than split
table = df.groupby(['year']).sum().reset_index() ## gives you a new dataframe with year; automatically ignores columns that are not numerical, otherwise you need to specify which columns you want to keep
table[table[col[1]] == table[col[1]].max()] ## there are more concise ways to do this, so take a look at other solutions, but as a general approach, this outputs the row of a dataframe where the condition is met, in this case where the production# is equal to the maximum

1 Like