Challenge 20 Megathread

For any and all questions relating to challenge 20. :point_down:

For a tutorial on how to use Jupyter Notebook, we put together this video:

Still have questions? Read all the FAQs here.

If you would like to download the data set used in this challenge, click here.

To continue to play around with the datasets in a Jupyter environment, click here.

difficulty: medium
solution: df.corr()

hmmm…

14 Likes

So, the kernel remains a mystery to me. Clearly I need to rtfm and practice, practice, practice. Oh, and of course review other people’s code.

A trick I used is that correlation is usually pretty consistent across large enough random samples from a population, and I was able to see it with as few as 50 arbitrary lines, but 1000 was quite clear… After all, we know the correlation. If I were a nicer programmer, I’d even plot the correlation as a line or other shape.

dy=df['avg_rating']
df.weight.corr(dy)
plt.figure()
plt.scatter(x = df['weight'].head(1000), y = df['avg_rating'].head(1000)) # no need to crowd the plot
plt.show()
1 Like

or as the hint says

df[‘column_1’].corr(df[‘column_1’])

use the same column XD

2 Likes

Here’s mine.
I imported numpy to calculate the values for a linear regression line. There were so many points that it was messy so I thought a line would be nice.

Summary
w = df['weight']
avr = df['avg_rating']
coco = round(w.corr(avr), 4)

  # m is slope, b is intercept; use np.polyfit to calculate values
m, b = np.polyfit(w, avr, 1)
  # make figure
plt.figure()
  # do scatter plot with 2 things
plt.scatter(x = w, y = avr)
  # use values from polyfit
plt.plot(w, m * w + b, color='red')
plt.title('Correlation Coefficient: {}'.format(coco))
plt.show()
2 Likes

If you need help,
Scatter Plot: weak positive correlation
plt.figure
plt.scatter(df.weight, df.avg_rating, alpha=0.2) # alpha adjusts opacity of the points
plt.show()

df[[‘weight’, ‘avg_rating’]].corr()

Result:

weight avg_rating
weight 1.000000 0.547244
avg_rating 0.547244 1.000000
3 Likes

Question 2 solved a couple diff ways:
pd.DataFrame.corr(df).loc[‘weight’][‘avg_rating’]
df[‘weight’].corr(df[‘avg_rating’])

1 Like

excellent addition! i’ve kept a note of this for later use.

1 Like
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('boardgames.csv')

weights = df['weight'].values
x = np.array(weights)
avg_rating = df['avg_rating'].values
y = np.array(avg_rating)
correlation_coefficient = np.corrcoef(x,y)[0][1]

plt.title('\nA plot to show the correlation between weight and average rating\n')
plt.xlabel(f'Weight\n\n Correlation coefficient: {correlation_coefficient}')
plt.ylabel('Average Rating')

#scatter plot
plt.plot(x, y, 'g.', markersize=2) 

#correlation line
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), 'k', linewidth=2) 

plt.show()
2 Likes

Video Solution: https://youtu.be/G3kosJvUl2Q

3 Likes

correlation is usually pretty consistent across large enough random samples
If you’re going to use a subset of the data to analyze, make sure it’s a random subset; you don’t know if it’s been pre-sorted or by what field.

This is very true; a good practice is to check and see how closely correlation of the subset matches with other subsets. Convergence toward the same value is a good sign, and divergence can indicate pre-sorting or other more interesting effects.

# Question 1
df.plot(x="weight", y="avg_rating", kind="scatter", figsize=(12,6))
plt.title("Weight vs Rating Scatter plot")
plt.show()

# Question 2
print("Correlation:", df["weight"].corr(df["avg_rating"]))

I found it helpful to add a curve fit line to the scatterplot to better see the correlation.Here is the code that has to be added:

import numpy as np
m, b = np.polyfit(df.weight, df.avg_rating, 1)
plt.plot(df.weight, m*df.weight + b, color='r', linewidth=3)

See my tweet here of what this produces.

2 Likes

This is pretty cool, today’s challenge is very straight forward.

This challenge is A JOKE (esp for a Medium)… I challenge you to use today’s methodology to even have a correct answer to Challenge #19 (which was SUPPOSED to be EASY)… However it is you treat the outliers in Challenge #19

You won’t get the answer wanted… Coz Challenge #19 data is inconclusive until you’re told EXACTLY how the quiz makers want the answer…

In your spare time since today’s is EASY… Try this for some WOW instead of “df.corr()”
Heat Maps for Correlation Visuals

(still salty from my only loss of Challenge 19) - onward to Tomorrow’s final

Question 1: The following scatter plot shows a positive correlation between weight and avg_rating column:

plt.figure()
plt.scatter(x = df['weight'], y = df['avg_rating'])
plt.show()

Question 2: The correlation coefficient between weight and avg_rating column is:

cc = df['weight'].corr(df['avg_rating'])
print(cc)

That’s what I noticed, sometimes the challenges labeled as “HARD” or “MEDIUM” are NOT! meanwhile some of the “EASY” ones have you checking docs, hints and discussions or sometime have misleading statements…anyways we’re almost there one more challenge!!!

1 Like