For any and all questions relating to challenge 20.

For a tutorial on how to use Jupyter Notebook, we put together this video:

Still have questions? Read all the FAQs here.

To continue to play around with the datasets in a Jupyter environment, click here.

difficulty: medium
solution: df.corr()

hmmmâŚ

14 Likes

So, the kernel remains a mystery to me. Clearly I need to rtfm and practice, practice, practice. Oh, and of course review other peopleâs code.

A trick I used is that correlation is usually pretty consistent across large enough random samples from a population, and I was able to see it with as few as 50 arbitrary lines, but 1000 was quite clearâŚ After all, we know the correlation. If I were a nicer programmer, Iâd even plot the correlation as a line or other shape.

dy=df['avg_rating']
df.weight.corr(dy)
plt.figure()
plt.scatter(x = df['weight'].head(1000), y = df['avg_rating'].head(1000)) # no need to crowd the plot
plt.show()
1 Like

or as the hint says

df[âcolumn_1â].corr(df[âcolumn_1â])

use the same column XD

2 Likes

Hereâs mine.
I imported numpy to calculate the values for a linear regression line. There were so many points that it was messy so I thought a line would be nice.

Summary
w = df['weight']
avr = df['avg_rating']
coco = round(w.corr(avr), 4)

# m is slope, b is intercept; use np.polyfit to calculate values
m, b = np.polyfit(w, avr, 1)
# make figure
plt.figure()
# do scatter plot with 2 things
plt.scatter(x = w, y = avr)
# use values from polyfit
plt.plot(w, m * w + b, color='red')
plt.title('Correlation Coefficient: {}'.format(coco))
plt.show()
2 Likes

If you need help,
Scatter Plot: weak positive correlation
plt.figure
plt.scatter(df.weight, df.avg_rating, alpha=0.2) # alpha adjusts opacity of the points
plt.show()

df[[âweightâ, âavg_ratingâ]].corr()

Result:

weight avg_rating
weight 1.000000 0.547244
avg_rating 0.547244 1.000000
3 Likes

Question 2 solved a couple diff ways:
pd.DataFrame.corr(df).loc[âweightâ][âavg_ratingâ]
df[âweightâ].corr(df[âavg_ratingâ])

1 Like

excellent addition! iâve kept a note of this for later use.

1 Like
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

weights = df['weight'].values
x = np.array(weights)
avg_rating = df['avg_rating'].values
y = np.array(avg_rating)
correlation_coefficient = np.corrcoef(x,y)[0][1]

plt.title('\nA plot to show the correlation between weight and average rating\n')
plt.xlabel(f'Weight\n\n Correlation coefficient: {correlation_coefficient}')
plt.ylabel('Average Rating')

#scatter plot
plt.plot(x, y, 'g.', markersize=2)

#correlation line
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), 'k', linewidth=2)

plt.show()
2 Likes

Video Solution: https://youtu.be/G3kosJvUl2Q

3 Likes

correlation is usually pretty consistent across large enough random samples
If youâre going to use a subset of the data to analyze, make sure itâs a random subset; you donât know if itâs been pre-sorted or by what field.

This is very true; a good practice is to check and see how closely correlation of the subset matches with other subsets. Convergence toward the same value is a good sign, and divergence can indicate pre-sorting or other more interesting effects.

# Question 1
df.plot(x="weight", y="avg_rating", kind="scatter", figsize=(12,6))
plt.title("Weight vs Rating Scatter plot")
plt.show()

# Question 2
print("Correlation:", df["weight"].corr(df["avg_rating"]))

I found it helpful to add a curve fit line to the scatterplot to better see the correlation.Here is the code that has to be added:

import numpy as np
m, b = np.polyfit(df.weight, df.avg_rating, 1)
plt.plot(df.weight, m*df.weight + b, color='r', linewidth=3)

See my tweet here of what this produces.

2 Likes

This is pretty cool, todayâs challenge is very straight forward.

This challenge is A JOKE (esp for a Medium)âŚ I challenge you to use todayâs methodology to even have a correct answer to Challenge #19 (which was SUPPOSED to be EASY)âŚ However it is you treat the outliers in Challenge #19âŚ

You wonât get the answer wantedâŚ Coz Challenge #19 data is inconclusive until youâre told EXACTLY how the quiz makers want the answerâŚ

In your spare time since todayâs is EASYâŚ Try this for some WOW instead of âdf.corr()â
Heat Maps for Correlation Visuals

(still salty from my only loss of Challenge 19) - onward to Tomorrowâs final

Question 1: The following scatter plot shows a positive correlation between weight and avg_rating column:

plt.figure()
plt.scatter(x = df['weight'], y = df['avg_rating'])
plt.show()

Question 2: The correlation coefficient between weight and avg_rating column is:

cc = df['weight'].corr(df['avg_rating'])
print(cc)

Thatâs what I noticed, sometimes the challenges labeled as âHARDâ or âMEDIUMâ are NOT! meanwhile some of the âEASYâ ones have you checking docs, hints and discussions or sometime have misleading statementsâŚanyways weâre almost there one more challenge!!!

1 Like