I started with the large dataset of diamond prices that Solomon Messing created for use in Lesson 6 of the Udacity course. I then subsetted it to keep only those diamonds with GIA certification (about 77% of them) and to remove any diamonds that had no value for price (roughly 500 of them).
The dataset consists of 463066 observations of 13 variables, one of which I created by taking the log of price. There are eight numeric variables: carat, table, depth, price, x, y, z and logprice. There are four ‘Factor’ variables: cut, color, clarity and cert. And finally there is one character variable, measurements, which is just a string representation of the x, y and z variables.
Here’s an explanation of what each of these variables means, and the nature of them in this particular dataset.
Carat - a carat is a measure of weight equivalent to 0.2 g. This equivalence has been a global standard since 1907. It is by far the most important feature in determining the value of a dimaond. The diamonds in this dataset range from 0.2 carats to 7.17 carats, with a median value of 0.8 carats.
Cut - technically this is the ‘cut grade’ and not a reference to the style of cut, of which there are many. This is largely a subjective measurement although there have been recent technologocial advancements to remove some of the subjectivity. All the diamonds in this dataset are one of three cut grades: Good, V.Good and Ideal, with almost 64% being Ideal.
Color - diamond colour grades are single letter ranging from D (perfectly colourless) to Z (light yellow). Strictly speaking this is a measure of saturation for yellow diamonds, and not hue. If a diamond is a colour like blue or pink it is not given a colour grade. There are nine different colour grades for diamonds in this dataset: D, E, F, G, H, I, J, K, and L. E is the most common colour grade in this dataset with F and G fairly close behind and L the least common.
Clarity - this is a subjective judgment of a diamond’s internal characteristics, called inclusions, and its surface defects, called blemishes. Both affect the ‘sparkle’, or appearance of the diamond. The diamonds in this dataset range from IF (internally flawless), to I2. There are nine different grades with the most common being VS2 and SI1, roughly in the middle of the grade spectrum, and the least common being the worst grade, I2.
Table - this refers to the width of the top facet as a percentage of the total width of the diamond. It’s an important feature in the determination of the cut grade, but can vary with different styles, too. The minimum table value for diamonds in this dataset is 0 (i.e. no top facet) to 75, a very wide top facet, with the median being 58.
Depth - this refers to the depth of the diamond from the widest point to the bottom expressed as a percentage. In this dataset it ranges from 0 (I’m not sure if this is a missing value or the widest point is the bottom of this particular diamond) to 81.30, with a median of 62.
Cert - this refers to the certification. I’ve subsetted the original dataset to include only diamonds with a GIA certification so that the various grades are consistent. GIA refers to the Gemological Institute of America, based in Carlsbad, California.
Measurements - this is simply a string showing the x, y and z measurements.
Price - the minimum price of a diamond in this dataset is $300 and the maximum price is $99966. The median price is $3305. There is a strange gap in the prices of the diamonds – there are no diamonds at all between the amounts of $2471 and $2600. I don’t have a good explanation for this.
x, y and z - these are the volumetric measurements of the diamond in millimeters. It’s possible to estimate the volume of a diamond by assuming that it is roughly pyramid shape and using the formula V = xyz/3. Since diamonds have a density of 3.5 g per cubic cm we can compare this volume to the carat measure and see how good the formula is. It turns out that it would appear to slightly underestimate the volume for most diamonds. The correlation between volume (as calculated with this formula) and carat weight is almost perfect (0.999), so in reality volume, as a separate variable, doesn’t add any new information.
Logprice - I created this variable by taking the log(base10) of price. The minimum value is 2.477 and the maximum value is 5.000, with a median of 3.519. This was a useful transformation as the vast majority of diamonds were lower priced. The third quartile of price is $11207, but the maximum price is $99966.
Mass_estimate - I created this variable by multiplying the x, y and z variables and then multiplying that by 7 and dividing by 1200. I started by using the formula for a rectangular pyramid, V = xyz/3, which gives us the volume in cubic millimeters. I then divided by 1000 to get cubic centimeters, multiplied by 3.5 to get grams and multiplied by 5 to get carats.
Price_per_carat - I created this variable by dividing price by carat.
## 'data.frame': 463066 obs. of 15 variables:
## $ carat : num 0.24 0.31 0.24 0.3 0.34 0.2 0.29 0.22 0.2 0.24 ...
## $ cut : Factor w/ 3 levels "Good","V.Good",..: 2 2 3 1 1 2 3 3 2 3 ...
## $ color : Factor w/ 9 levels "L","K","J","I",..: 6 2 6 5 7 4 6 4 8 4 ...
## $ clarity : Factor w/ 10 levels "I3","I2","I1",..: 5 4 5 3 3 8 3 7 8 7 ...
## $ table : num 61 59 55 57 66 62 58 62 54 56 ...
## $ depth : num 58.9 60.2 61.3 62.2 55 59.1 61.4 59.6 62.9 62 ...
## $ cert : Factor w/ 9 levels "GIA","IGI","EGL",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ measurements : chr "4.09 x 4.10 x 2.41" "4.40 x 4.42 x 2.65" "4.01 x 4.03 x 2.47" "4.21 x 4.24 x 2.63" ...
## $ price : num 300 300 300 300 300 301 301 301 301 301 ...
## $ x : num 4.09 4.4 4.01 4.21 4.75 3.79 4.25 3.9 3.72 3.98 ...
## $ y : num 4.1 4.42 4.03 4.24 4.61 3.82 4.31 3.93 3.75 4 ...
## $ z : num 2.41 2.65 2.47 2.63 2.57 2.25 2.63 2.33 2.35 2.47 ...
## $ mass_estimate : num 0.236 0.301 0.233 0.274 0.328 ...
## $ logprice : num 2.48 2.48 2.48 2.48 2.48 ...
## $ price_per_carat: num 1250 968 1250 1000 882 ...
## carat cut color clarity
## Min. :0.2000 Good : 44509 E :79774 VS2 :87557
## 1st Qu.:0.4200 V.Good:122672 F :76667 SI1 :86660
## Median :0.8000 Ideal :295885 G :73475 VS1 :80016
## Mean :0.9982 D :60245 SI2 :65856
## 3rd Qu.:1.3100 H :58054 VVS2 :57294
## Max. :7.1700 I :50843 VVS1 :49413
## (Other):64008 (Other):36270
## table depth cert measurements
## Min. : 0.00 Min. : 0.00 GIA :463066 Length:463066
## 1st Qu.:56.00 1st Qu.:61.10 IGI : 0 Class :character
## Median :58.00 Median :62.00 EGL : 0 Mode :character
## Mean :57.67 Mean :61.66 EGL USA : 0
## 3rd Qu.:59.00 3rd Qu.:62.70 EGL Intl. : 0
## Max. :75.00 Max. :81.30 EGL ISRAEL: 0
## (Other) : 0
## price x y z
## Min. : 300 Min. : 0.150 Min. : 1.000 Min. : 0.46
## 1st Qu.: 1150 1st Qu.: 4.690 1st Qu.: 4.830 1st Qu.: 3.01
## Median : 3305 Median : 5.680 Median : 5.880 Median : 3.72
## Mean : 8682 Mean : 5.885 Mean : 6.071 Mean : 3.93
## 3rd Qu.:11207 3rd Qu.: 6.840 3rd Qu.: 7.040 3rd Qu.: 4.54
## Max. :99966 Max. :12.900 Max. :12.970 Max. :12.62
## NA's :1453 NA's :1473 NA's :2255
## mass_estimate logprice price_per_carat
## Min. :0.0058 Min. :2.477 Min. : 525
## 1st Qu.:0.4019 1st Qu.:3.061 1st Qu.: 2731
## Median :0.7389 Median :3.519 Median : 4329
## Mean :0.9451 Mean :3.572 Mean : 6059
## 3rd Qu.:1.2480 3rd Qu.:4.049 3rd Qu.: 7913
## Max. :7.0759 Max. :5.000 Max. :49519
## NA's :3030
##
## Pearson's product-moment correlation
##
## data: carat and mass_estimate
## t = 16767, df = 460030, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9991781 0.9991876
## sample estimates:
## cor
## 0.9991829
##
## Calls:
## m1: lm(formula = carat ~ mass_estimate, data = diamondsb[diamondsb$mass_estimate >
## 0, ])
##
## ===============================
## (Intercept) 0.002***
## (0.000)
## mass_estimate 1.054***
## (0.000)
## -------------------------------
## R-squared 1.0
## adj. R-squared 1.0
## sigma 0.0
## F 281148696.8
## p 0.0
## Log-likelihood 961978.6
## Deviance 411.2
## AIC -1923951.1
## BIC -1923918.0
## N 460036
## ===============================
I used a few different binwidths and finally subsetted the data to eliminate the largest 1% of the diamonds. It’s quite clear that cutting decisions are made to hit certain carat values as there are pronounced spikes, especially at round numbers as the diamonds get bigger, e.g. 1 carat, 1.5 carats, 2 carats, etc.
The vast majority of the diamonds are under $10,000, but the maximum price is almost $100,000 so the histogram is a bit easier to digest after applying a log transformation to the price scale. We can now see a few spikes, the first being just under $1,000 (perhaps an important price point) and the second notable spike at around $15,000. There is also the gap, as noted above, between $2471 and $2600, for which I have no explanation.
More than half the diamonds are of Ideal cut.
The distribution of color follows an almost normal looking curve, with the intermediate colors E, F and G being the highest and the values at the extremes dropping off.
The distribution of clarity also follows an almost normal looking curve, with the intermediate clarities of VS2 and SI1 being the highest and the extremes dropping off.
The dataset contains measurements and grades of 463,066 diamonds. The primary features are price and the “four Cs” – carat, cut, color and clarity.
There are also a number of other measurements including the three volume dimensions in millimeters and some diamond design measurements – table and depth.
I created a variable to estimate the mass of the diamond based on the three volume dimensions, using a very rough assumption that a diamond is a rectangular pyramid and using the density of diamonds. I created a price per carat variable and I also created a scaled price variable by taking the log (base 10) of price.
A small number of the volume dimensions were missing, so I eliminated those diamonds when calculating the correlation coefficient for my estimated mass and the carat mass, and for the linear model I used to see how close the two variables were.
There is clearly a relationship between price and size, but it appears non-linear. It also seems that the larger the diamond the greater the variation in price. Most of the diamonds under 1 carat cluster in a fairly tight price band, but once we get to 2 carats the price varies from under $10,000 to $100,000.
This plot certainly shows that the bigger the diamond the less variability in the table value. Also, table values of 0 or close to 0 are not common for diamonds bigger than about 2 carats. I don’t really know what a table value of 0 means, whether it’s a different style of cut or perhaps is a missing value. More investigation here is necessary.
In reading about table, I thought it might influence clarity, but this plot doesn’t seem to indicate that there is much of a relationship.
## clarity table.0% table.25% table.50% table.75% table.100%
## 1 I2 0 57 59 61 69
## 2 I1 0 56 58 60 71
## 3 SI2 0 56 58 59 71
## 4 SI1 0 56 58 59 75
## 5 VS2 0 56 58 59 72
## 6 VS1 0 56 58 59 71
## 7 VVS2 0 56 58 59 71
## 8 VVS1 0 56 57 59 71
## 9 IF 0 56 57 59 68
The boxplot seems to show that the better the clarity the lower the value of table, but only slightly. The median table value drops from 59% for I2 to 57% for IF. The IQR is 4% for I2 and 3% for all other grades of clarity.
There doesn’t seem to be much relationship here, except that the Ideal cut diamonds, the majority of diamonds in this dataset, have a much tighter distribution of Table than either the V.Good or Good cut diamonds.
## cut table.0% table.25% table.50% table.75% table.100%
## 1 Good 0 57 59 62 75
## 2 V.Good 0 56 58 60 71
## 3 Ideal 0 56 57 59 68
The boxplot seems to show that the Ideal diamonds have a narrower range of table values and a slightly lower median table value. The median table value is 59% for Good, 58% for V.Good and 57% for Ideal. The IQR goes from 5% to 4% to 3% as we get better grades for cut.
This plot demonstrates how closely the Estimated Mass models the actual mass. I was struck by the anomolous behaviour around the integral carat values, though, and will write more about that below.
## cut carat.0% carat.25% carat.50% carat.75% carat.100%
## 1 Good 0.20 0.43 0.70 1.00 6.52
## 2 V.Good 0.20 0.46 0.71 1.05 7.17
## 3 Ideal 0.20 0.41 0.83 1.50 7.09
This side-by-side comparison of the distribution of carat sizes by cut shows that the vast majority of Good cut diamonds are 1 carat or less (the 75% percentile is 1.00 carats). There is a bit more variation for V.Good, but not as much as for Ideal. The IQR goes from 0.57 carats for Good to 0.59 for V.Good to 1.09 for Ideal. I’m not sure if this reflects bigger diamonds being given to more experienced people who can generate better cuts, or some other influence.
## color carat.0% carat.25% carat.50% carat.75% carat.100%
## 1 L 0.23 0.63 1.01 1.81 7.17
## 2 K 0.20 0.70 1.10 2.03 7.05
## 3 J 0.20 0.70 1.03 1.74 6.51
## 4 I 0.20 0.67 1.01 1.71 5.77
## 5 H 0.20 0.51 1.00 1.54 5.71
## 6 G 0.20 0.42 0.76 1.30 5.13
## 7 F 0.20 0.40 0.70 1.10 5.16
## 8 E 0.20 0.36 0.58 1.01 4.20
## 9 D 0.20 0.35 0.54 1.01 4.01
This side-by-side comparison of the distribution of carat sizes by color shows that the D-grade (completely colorless) diamonds are more likely to be smaller. As you move along the color scale the distribution becomes less peaked. The four color grades I, J, K and L are much more spread out. With the exception of L grade the median carat size falls as the color grade gets better from a peak of 1.10 carats for K grade to 0.54 carats for D grade. Similarly, with the exception of L, the IQR falls from a peak of 1.33 carats for K grade to a minimum of 0.40 carats for F grade, with a slight widening for E grade (0.43 carats) and D grade (0.47 carats).
## clarity carat.0% carat.25% carat.50% carat.75% carat.100%
## 1 I2 0.29 0.51 0.70 1.02 4.50
## 2 I1 0.21 0.50 0.70 1.00 5.71
## 3 SI2 0.20 0.55 0.92 1.43 7.05
## 4 SI1 0.20 0.50 0.81 1.50 7.09
## 5 VS2 0.20 0.43 0.80 1.50 7.17
## 6 VS1 0.20 0.41 0.80 1.50 6.56
## 7 VVS2 0.20 0.40 0.70 1.23 5.92
## 8 VVS1 0.20 0.39 0.62 1.09 5.06
## 9 IF 0.20 0.38 0.77 1.23 6.03
This side-by-side comparison of the distribution of carat sizes by clarity shows less variability than the previous two comparisons (for cut and color). It’s difficult to make any generalizations based on these plots. The median carat size peaks at 0.92 for SI2 grade and then falls as the clarity gets better with a median of 0.77 carats for IF. The IQR is greatest for the middle ranges of clarity with a peak of 1.09 carats for VS1 and a minimum of 0.50 carats for I1.
The three “lesser” Cs – cut, color and clarity – all seemed to influence price, but their influence is only visible when looking at diamonds of a given carat weight. And of the three cut seems to be the most ambiguous, whereas clarity and color seem strongly related to price for a given carat weight.
I thought the carat size by color relationship was interesting and a bit mysterious. I can’t think of any particular reason why the least coloured diamonds would have distributions that skew to the small size, whereas the most coloured diamonds seemed to have wider distributions of size.
The strongest relationship is that between carat and price. That being said there was still significant variation at a given carat size, as mentioned before – anywhere from four to eight-fold difference between the cheapest and most expensive diamonds in a given size.
This plot shows that the most common price per carat is around $3000 and that as the diamonds get more expensive on a price per carat basis the proportion of Ideal cut diamonds seems to increase.
I think the best insight from this plot is the effect of color on the price curve. The further right (toward the perfectly colorless end) the steeper the curve. In order to get into the really high priced diamonds in the L or K color grade the size has to be truly massive (5 or 6 carats), whereas there are D and E grade diamonds that are over $64,000 and only about 3 carats.
This plot is similar to the previous one, but the relationship doesn’t seem quite as strong, perhaps just because of the small numbers of diamonds with poor clarity grades.
This plot shows a non-linear relationship. We know that there are many more lower priced diamonds and given the nature of the non-linear relationship a log transformation on the price seems to reasonable. It also shows that Good diamonds are not consistently the cheapest for any given carat size. Some even appear to be among the most expensive, albeit not that often.
This plot shows a non-linear relationship even after transforming the price variable to a log scale. Since carat and volume are almost perfectly correlated it seems reasonable to try a cube-root transformation on carat to approximate a single dimension characteristic of the diamonds. Perhaps consumers perceive the size of the diamonds by a single dimension more readily than the volume.
In this plot we finally see what looks like the makings of a linear model, although as the carat size gets bigger it does seem to be deviating from a linear model. I changed the palette because I was finding it hard to see the different cuts in the previous plots. Now it’s possible to see that diamonds with a V.Good cut are sometimes just as expensive as identically sized diamonds with Ideal cut.
The same plot as before, but with the effect of Color highlighted. Color would appear to be a much more influential factor than cut.
Here we can see the same linear relationship, but a much more striking effect of clarity.
Carat size is clearly the driving force behind diamond prices, but for a given carat size there is quite a bit of variability based on cut, color and clarity.
Cut does not seem to be as important. While the Ideal cut diamonds do seem to be priced higher there are certainly many examples of V.Good cut diamonds that are priced as high or higher than identically sized Ideal cut diamonds.
The three dimensional measurements of each diamond (x, y and z) are provided in millimeters. A very rough assumption is that each diamond is a rectangular pyramid and so the formula for the volume of a pyramid can be used along with the density of diamonds to come up with an estimated mass in carats for each diamond based on the measurements. This plot shows how closely the estimated mass tracks the actual mass in carats. The actual mass is on average 5.4% bigger than the estimated mass.
However, the most surprising thing about this plot (and the reason I chose to include it as one of the three polished plots) is the regular deviations at integral carat measurements and to a certain extent at the half carat measurments (e.g. at 1.5 and 2.5 carats). I can think of two possible explanations for these deviations. It’s possible that with certain diamonds a different style of cut is chosen in order to ensure that the diamond hits a round number (e.g. 2 carats, or 3 carats) instead of being just under that round number (e.g. 1.99 carats, or 2.98 carats). Alternatively, there is some upward rounding taking place, an indication of fraud, and diamonds are being sold as 2 carat or 3 carat diamonds when in fact they are slightly smaller. This might be worth pursuing for an investigative journalist.
This plot demonstrates the remarkable diversity of diamond prices. The most common price per carat is about $2,500, but the diamonds in this dataset vary from about $1000 per carat to over $30,000 per carat. It also shows that cut alone is not a very good determiner of how valuable a diamond is. While there appears to be a greater proportion of Ideal cut diamonds among the more expensive it is still the case that some V.Good cut diamonds and even some Good cut diamonds have prices well in excess of $10,000 per carat.
This plot demostrates quite clearly (no pun intended) how important clarity is in the price of a diamond. For any given carat size we can see that a large amount of the price variation is due to the clarity grade with the best clarity grades being consistently more expensive. For example at the 1.5 carat size the diamonds graded IF appear to be almost eight times more expensive than the diamonds graded I1 or I2.
One of the most striking features of the dataset is the effect of round numbers. In virtually every plot we can see the effect of round numbers for carat size, both integral values and half-integral values. As a side project I would like to do an investigation looking at diamonds that are very similar in all respects but slightly different in size, either just at an integral carat amount, or just below.
It was also very intresting to see how important clarity and color were while realizing that cut didn’t play as big a role, or at the very least played a much more ambiguous role.
One struggle I had was simply the time it took for plots to render. On my MacBook Air some of the plots took 15 seconds to render given that I was dealing with almost half a million diamonds. A more important struggle was that because so many of the diamonds had similar characteristics almost every plot had a large amount of overplotting. This made outliers and exceptions seem more common than they really were. Changing the transparency improved many of the plots, but when I was also using the color brewer scales it had the tendency to wash out the lighter end of the spectrum of colours making the plots hard to interpret.