An Exploration of Diamond prices and characteristics by Brian Taylor

What is the structure of your dataset?

I started with the large dataset of diamond prices that Solomon Messing created for use in Lesson 6 of the Udacity course. I then subsetted it to keep only those diamonds with GIA certification (about 77% of them) and to remove any diamonds that had no value for price (roughly 500 of them).

The dataset consists of 463066 observations of 13 variables, one of which I created by taking the log of price. There are eight numeric variables: carat, table, depth, price, x, y, z and logprice. There are four ‘Factor’ variables: cut, color, clarity and cert. And finally there is one character variable, measurements, which is just a string representation of the x, y and z variables.

Here’s an explanation of what each of these variables means, and the nature of them in this particular dataset.

Carat - a carat is a measure of weight equivalent to 0.2 g. This equivalence has been a global standard since 1907. It is by far the most important feature in determining the value of a dimaond. The diamonds in this dataset range from 0.2 carats to 7.17 carats, with a median value of 0.8 carats.

Cut - technically this is the ‘cut grade’ and not a reference to the style of cut, of which there are many. This is largely a subjective measurement although there have been recent technologocial advancements to remove some of the subjectivity. All the diamonds in this dataset are one of three cut grades: Good, V.Good and Ideal, with almost 64% being Ideal.

Color - diamond colour grades are single letter ranging from D (perfectly colourless) to Z (light yellow). Strictly speaking this is a measure of saturation for yellow diamonds, and not hue. If a diamond is a colour like blue or pink it is not given a colour grade. There are nine different colour grades for diamonds in this dataset: D, E, F, G, H, I, J, K, and L. E is the most common colour grade in this dataset with F and G fairly close behind and L the least common.

Clarity - this is a subjective judgment of a diamond’s internal characteristics, called inclusions, and its surface defects, called blemishes. Both affect the ‘sparkle’, or appearance of the diamond. The diamonds in this dataset range from IF (internally flawless), to I2. There are nine different grades with the most common being VS2 and SI1, roughly in the middle of the grade spectrum, and the least common being the worst grade, I2.

Table - this refers to the width of the top facet as a percentage of the total width of the diamond. It’s an important feature in the determination of the cut grade, but can vary with different styles, too. The minimum table value for diamonds in this dataset is 0 (i.e. no top facet) to 75, a very wide top facet, with the median being 58.

Depth - this refers to the depth of the diamond from the widest point to the bottom expressed as a percentage. In this dataset it ranges from 0 (I’m not sure if this is a missing value or the widest point is the bottom of this particular diamond) to 81.30, with a median of 62.

Cert - this refers to the certification. I’ve subsetted the original dataset to include only diamonds with a GIA certification so that the various grades are consistent. GIA refers to the Gemological Institute of America, based in Carlsbad, California.

Measurements - this is simply a string showing the x, y and z measurements.

Price - the minimum price of a diamond in this dataset is $300 and the maximum price is $99966. The median price is $3305. There is a strange gap in the prices of the diamonds – there are no diamonds at all between the amounts of $2471 and $2600. I don’t have a good explanation for this.

x, y and z - these are the volumetric measurements of the diamond in millimeters. It’s possible to estimate the volume of a diamond by assuming that it is roughly pyramid shape and using the formula V = xyz/3. Since diamonds have a density of 3.5 g per cubic cm we can compare this volume to the carat measure and see how good the formula is. It turns out that it would appear to slightly underestimate the volume for most diamonds. The correlation between volume (as calculated with this formula) and carat weight is almost perfect (0.999), so in reality volume, as a separate variable, doesn’t add any new information.

Logprice - I created this variable by taking the log(base10) of price. The minimum value is 2.477 and the maximum value is 5.000, with a median of 3.519. This was a useful transformation as the vast majority of diamonds were lower priced. The third quartile of price is $11207, but the maximum price is $99966.

Mass_estimate - I created this variable by multiplying the x, y and z variables and then multiplying that by 7 and dividing by 1200. I started by using the formula for a rectangular pyramid, V = xyz/3, which gives us the volume in cubic millimeters. I then divided by 1000 to get cubic centimeters, multiplied by 3.5 to get grams and multiplied by 5 to get carats.

Price_per_carat - I created this variable by dividing price by carat.

Univariate Plots Section

## 'data.frame':    463066 obs. of  15 variables:
##  $ carat          : num  0.24 0.31 0.24 0.3 0.34 0.2 0.29 0.22 0.2 0.24 ...
##  $ cut            : Factor w/ 3 levels "Good","V.Good",..: 2 2 3 1 1 2 3 3 2 3 ...
##  $ color          : Factor w/ 9 levels "L","K","J","I",..: 6 2 6 5 7 4 6 4 8 4 ...
##  $ clarity        : Factor w/ 10 levels "I3","I2","I1",..: 5 4 5 3 3 8 3 7 8 7 ...
##  $ table          : num  61 59 55 57 66 62 58 62 54 56 ...
##  $ depth          : num  58.9 60.2 61.3 62.2 55 59.1 61.4 59.6 62.9 62 ...
##  $ cert           : Factor w/ 9 levels "GIA","IGI","EGL",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ measurements   : chr  "4.09 x 4.10 x 2.41" "4.40 x 4.42 x 2.65" "4.01 x 4.03 x 2.47" "4.21 x 4.24 x 2.63" ...
##  $ price          : num  300 300 300 300 300 301 301 301 301 301 ...
##  $ x              : num  4.09 4.4 4.01 4.21 4.75 3.79 4.25 3.9 3.72 3.98 ...
##  $ y              : num  4.1 4.42 4.03 4.24 4.61 3.82 4.31 3.93 3.75 4 ...
##  $ z              : num  2.41 2.65 2.47 2.63 2.57 2.25 2.63 2.33 2.35 2.47 ...
##  $ mass_estimate  : num  0.236 0.301 0.233 0.274 0.328 ...
##  $ logprice       : num  2.48 2.48 2.48 2.48 2.48 ...
##  $ price_per_carat: num  1250 968 1250 1000 882 ...
##      carat            cut             color          clarity     
##  Min.   :0.2000   Good  : 44509   E      :79774   VS2    :87557  
##  1st Qu.:0.4200   V.Good:122672   F      :76667   SI1    :86660  
##  Median :0.8000   Ideal :295885   G      :73475   VS1    :80016  
##  Mean   :0.9982                   D      :60245   SI2    :65856  
##  3rd Qu.:1.3100                   H      :58054   VVS2   :57294  
##  Max.   :7.1700                   I      :50843   VVS1   :49413  
##                                   (Other):64008   (Other):36270  
##      table           depth               cert        measurements      
##  Min.   : 0.00   Min.   : 0.00   GIA       :463066   Length:463066     
##  1st Qu.:56.00   1st Qu.:61.10   IGI       :     0   Class :character  
##  Median :58.00   Median :62.00   EGL       :     0   Mode  :character  
##  Mean   :57.67   Mean   :61.66   EGL USA   :     0                     
##  3rd Qu.:59.00   3rd Qu.:62.70   EGL Intl. :     0                     
##  Max.   :75.00   Max.   :81.30   EGL ISRAEL:     0                     
##                                  (Other)   :     0                     
##      price             x                y                z        
##  Min.   :  300   Min.   : 0.150   Min.   : 1.000   Min.   : 0.46  
##  1st Qu.: 1150   1st Qu.: 4.690   1st Qu.: 4.830   1st Qu.: 3.01  
##  Median : 3305   Median : 5.680   Median : 5.880   Median : 3.72  
##  Mean   : 8682   Mean   : 5.885   Mean   : 6.071   Mean   : 3.93  
##  3rd Qu.:11207   3rd Qu.: 6.840   3rd Qu.: 7.040   3rd Qu.: 4.54  
##  Max.   :99966   Max.   :12.900   Max.   :12.970   Max.   :12.62  
##                  NA's   :1453     NA's   :1473     NA's   :2255   
##  mass_estimate       logprice     price_per_carat
##  Min.   :0.0058   Min.   :2.477   Min.   :  525  
##  1st Qu.:0.4019   1st Qu.:3.061   1st Qu.: 2731  
##  Median :0.7389   Median :3.519   Median : 4329  
##  Mean   :0.9451   Mean   :3.572   Mean   : 6059  
##  3rd Qu.:1.2480   3rd Qu.:4.049   3rd Qu.: 7913  
##  Max.   :7.0759   Max.   :5.000   Max.   :49519  
##  NA's   :3030
## 
##  Pearson's product-moment correlation
## 
## data:  carat and mass_estimate
## t = 16767, df = 460030, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9991781 0.9991876
## sample estimates:
##       cor 
## 0.9991829
## 
## Calls:
## m1: lm(formula = carat ~ mass_estimate, data = diamondsb[diamondsb$mass_estimate > 
##     0, ])
## 
## ===============================
##   (Intercept)       0.002***   
##                    (0.000)     
##   mass_estimate     1.054***   
##                    (0.000)     
## -------------------------------
##   R-squared               1.0  
##   adj. R-squared          1.0  
##   sigma                   0.0  
##   F               281148696.8  
##   p                       0.0  
##   Log-likelihood     961978.6  
##   Deviance              411.2  
##   AIC              -1923951.1  
##   BIC              -1923918.0  
##   N                  460036    
## ===============================

The distribution of carat masses

I used a few different binwidths and finally subsetted the data to eliminate the largest 1% of the diamonds. It’s quite clear that cutting decisions are made to hit certain carat values as there are pronounced spikes, especially at round numbers as the diamonds get bigger, e.g. 1 carat, 1.5 carats, 2 carats, etc.

The distribution of prices

The vast majority of the diamonds are under $10,000, but the maximum price is almost $100,000 so the histogram is a bit easier to digest after applying a log transformation to the price scale. We can now see a few spikes, the first being just under $1,000 (perhaps an important price point) and the second notable spike at around $15,000. There is also the gap, as noted above, between $2471 and $2600, for which I have no explanation.

The distribution of cuts

More than half the diamonds are of Ideal cut.

The distribution of color

The distribution of color follows an almost normal looking curve, with the intermediate colors E, F and G being the highest and the values at the extremes dropping off.

The distribution of clarity

The distribution of clarity also follows an almost normal looking curve, with the intermediate clarities of VS2 and SI1 being the highest and the extremes dropping off.

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

The dataset contains measurements and grades of 463,066 diamonds. The primary features are price and the “four Cs” – carat, cut, color and clarity.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

There are also a number of other measurements including the three volume dimensions in millimeters and some diamond design measurements – table and depth.

Did you create any new variables from existing variables in the dataset?

I created a variable to estimate the mass of the diamond based on the three volume dimensions, using a very rough assumption that a diamond is a rectangular pyramid and using the density of diamonds. I created a price per carat variable and I also created a scaled price variable by taking the log (base 10) of price.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

A small number of the volume dimensions were missing, so I eliminated those diamonds when calculating the correlation coefficient for my estimated mass and the carat mass, and for the linear model I used to see how close the two variables were.

Bivariate Plots Section

Price vs. Carat

There is clearly a relationship between price and size, but it appears non-linear. It also seems that the larger the diamond the greater the variation in price. Most of the diamonds under 1 carat cluster in a fairly tight price band, but once we get to 2 carats the price varies from under $10,000 to $100,000.

Carat vs. Table

This plot certainly shows that the bigger the diamond the less variability in the table value. Also, table values of 0 or close to 0 are not common for diamonds bigger than about 2 carats. I don’t really know what a table value of 0 means, whether it’s a different style of cut or perhaps is a missing value. More investigation here is necessary.

Clarity vs. Table

In reading about table, I thought it might influence clarity, but this plot doesn’t seem to indicate that there is much of a relationship.

##   clarity table.0% table.25% table.50% table.75% table.100%
## 1      I2        0        57        59        61         69
## 2      I1        0        56        58        60         71
## 3     SI2        0        56        58        59         71
## 4     SI1        0        56        58        59         75
## 5     VS2        0        56        58        59         72
## 6     VS1        0        56        58        59         71
## 7    VVS2        0        56        58        59         71
## 8    VVS1        0        56        57        59         71
## 9      IF        0        56        57        59         68

Boxplot of table vs. clarity

The boxplot seems to show that the better the clarity the lower the value of table, but only slightly. The median table value drops from 59% for I2 to 57% for IF. The IQR is 4% for I2 and 3% for all other grades of clarity.

Cut vs. Table

There doesn’t seem to be much relationship here, except that the Ideal cut diamonds, the majority of diamonds in this dataset, have a much tighter distribution of Table than either the V.Good or Good cut diamonds.

##      cut table.0% table.25% table.50% table.75% table.100%
## 1   Good        0        57        59        62         75
## 2 V.Good        0        56        58        60         71
## 3  Ideal        0        56        57        59         68

Boxplot of table vs. cut

The boxplot seems to show that the Ideal diamonds have a narrower range of table values and a slightly lower median table value. The median table value is 59% for Good, 58% for V.Good and 57% for Ideal. The IQR goes from 5% to 4% to 3% as we get better grades for cut.

Carat vs. Estimated Mass

This plot demonstrates how closely the Estimated Mass models the actual mass. I was struck by the anomolous behaviour around the integral carat values, though, and will write more about that below.

##      cut carat.0% carat.25% carat.50% carat.75% carat.100%
## 1   Good     0.20      0.43      0.70      1.00       6.52
## 2 V.Good     0.20      0.46      0.71      1.05       7.17
## 3  Ideal     0.20      0.41      0.83      1.50       7.09

Carat size by cut

This side-by-side comparison of the distribution of carat sizes by cut shows that the vast majority of Good cut diamonds are 1 carat or less (the 75% percentile is 1.00 carats). There is a bit more variation for V.Good, but not as much as for Ideal. The IQR goes from 0.57 carats for Good to 0.59 for V.Good to 1.09 for Ideal. I’m not sure if this reflects bigger diamonds being given to more experienced people who can generate better cuts, or some other influence.

##   color carat.0% carat.25% carat.50% carat.75% carat.100%
## 1     L     0.23      0.63      1.01      1.81       7.17
## 2     K     0.20      0.70      1.10      2.03       7.05
## 3     J     0.20      0.70      1.03      1.74       6.51
## 4     I     0.20      0.67      1.01      1.71       5.77
## 5     H     0.20      0.51      1.00      1.54       5.71
## 6     G     0.20      0.42      0.76      1.30       5.13
## 7     F     0.20      0.40      0.70      1.10       5.16
## 8     E     0.20      0.36      0.58      1.01       4.20
## 9     D     0.20      0.35      0.54      1.01       4.01

Carat size by color

This side-by-side comparison of the distribution of carat sizes by color shows that the D-grade (completely colorless) diamonds are more likely to be smaller. As you move along the color scale the distribution becomes less peaked. The four color grades I, J, K and L are much more spread out. With the exception of L grade the median carat size falls as the color grade gets better from a peak of 1.10 carats for K grade to 0.54 carats for D grade. Similarly, with the exception of L, the IQR falls from a peak of 1.33 carats for K grade to a minimum of 0.40 carats for F grade, with a slight widening for E grade (0.43 carats) and D grade (0.47 carats).

##   clarity carat.0% carat.25% carat.50% carat.75% carat.100%
## 1      I2     0.29      0.51      0.70      1.02       4.50
## 2      I1     0.21      0.50      0.70      1.00       5.71
## 3     SI2     0.20      0.55      0.92      1.43       7.05
## 4     SI1     0.20      0.50      0.81      1.50       7.09
## 5     VS2     0.20      0.43      0.80      1.50       7.17
## 6     VS1     0.20      0.41      0.80      1.50       6.56
## 7    VVS2     0.20      0.40      0.70      1.23       5.92
## 8    VVS1     0.20      0.39      0.62      1.09       5.06
## 9      IF     0.20      0.38      0.77      1.23       6.03

Carat size by clarity

This side-by-side comparison of the distribution of carat sizes by clarity shows less variability than the previous two comparisons (for cut and color). It’s difficult to make any generalizations based on these plots. The median carat size peaks at 0.92 for SI2 grade and then falls as the clarity gets better with a median of 0.77 carats for IF. The IQR is greatest for the middle ranges of clarity with a peak of 1.09 carats for VS1 and a minimum of 0.50 carats for I1.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The three “lesser” Cs – cut, color and clarity – all seemed to influence price, but their influence is only visible when looking at diamonds of a given carat weight. And of the three cut seems to be the most ambiguous, whereas clarity and color seem strongly related to price for a given carat weight.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I thought the carat size by color relationship was interesting and a bit mysterious. I can’t think of any particular reason why the least coloured diamonds would have distributions that skew to the small size, whereas the most coloured diamonds seemed to have wider distributions of size.

What was the strongest relationship you found?

The strongest relationship is that between carat and price. That being said there was still significant variation at a given carat size, as mentioned before – anywhere from four to eight-fold difference between the cheapest and most expensive diamonds in a given size.

Multivariate Plots Section

Price per Carat distinguished by Cut

This plot shows that the most common price per carat is around $3000 and that as the diamonds get more expensive on a price per carat basis the proportion of Ideal cut diamonds seems to increase.

Price vs. Carat faceted by Color and Cut

I think the best insight from this plot is the effect of color on the price curve. The further right (toward the perfectly colorless end) the steeper the curve. In order to get into the really high priced diamonds in the L or K color grade the size has to be truly massive (5 or 6 carats), whereas there are D and E grade diamonds that are over $64,000 and only about 3 carats.

Price vs. Carat faceted by Clarity and Cut

This plot is similar to the previous one, but the relationship doesn’t seem quite as strong, perhaps just because of the small numbers of diamonds with poor clarity grades.

Price vs. Carat distinguished by Cut

This plot shows a non-linear relationship. We know that there are many more lower priced diamonds and given the nature of the non-linear relationship a log transformation on the price seems to reasonable. It also shows that Good diamonds are not consistently the cheapest for any given carat size. Some even appear to be among the most expensive, albeit not that often.

Log(price) vs. Carat distinguished by Cut

This plot shows a non-linear relationship even after transforming the price variable to a log scale. Since carat and volume are almost perfectly correlated it seems reasonable to try a cube-root transformation on carat to approximate a single dimension characteristic of the diamonds. Perhaps consumers perceive the size of the diamonds by a single dimension more readily than the volume.

Log(price) vs. Cube root(carat) distinguished by Cut

In this plot we finally see what looks like the makings of a linear model, although as the carat size gets bigger it does seem to be deviating from a linear model. I changed the palette because I was finding it hard to see the different cuts in the previous plots. Now it’s possible to see that diamonds with a V.Good cut are sometimes just as expensive as identically sized diamonds with Ideal cut.

Log(price) vs. Cube root(carat) distinguished by Color

The same plot as before, but with the effect of Color highlighted. Color would appear to be a much more influential factor than cut.

Log(price) vs. Cube root(carat) distinguished by Clarity

Here we can see the same linear relationship, but a much more striking effect of clarity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Carat size is clearly the driving force behind diamond prices, but for a given carat size there is quite a bit of variability based on cut, color and clarity.

Were there any interesting or surprising interactions between features?

Cut does not seem to be as important. While the Ideal cut diamonds do seem to be priced higher there are certainly many examples of V.Good cut diamonds that are priced as high or higher than identically sized Ideal cut diamonds.


Final Plots and Summary

Plot One

Description One

The three dimensional measurements of each diamond (x, y and z) are provided in millimeters. A very rough assumption is that each diamond is a rectangular pyramid and so the formula for the volume of a pyramid can be used along with the density of diamonds to come up with an estimated mass in carats for each diamond based on the measurements. This plot shows how closely the estimated mass tracks the actual mass in carats. The actual mass is on average 5.4% bigger than the estimated mass.

However, the most surprising thing about this plot (and the reason I chose to include it as one of the three polished plots) is the regular deviations at integral carat measurements and to a certain extent at the half carat measurments (e.g. at 1.5 and 2.5 carats). I can think of two possible explanations for these deviations. It’s possible that with certain diamonds a different style of cut is chosen in order to ensure that the diamond hits a round number (e.g. 2 carats, or 3 carats) instead of being just under that round number (e.g. 1.99 carats, or 2.98 carats). Alternatively, there is some upward rounding taking place, an indication of fraud, and diamonds are being sold as 2 carat or 3 carat diamonds when in fact they are slightly smaller. This might be worth pursuing for an investigative journalist.

Plot Two

Description Two

This plot demonstrates the remarkable diversity of diamond prices. The most common price per carat is about $2,500, but the diamonds in this dataset vary from about $1000 per carat to over $30,000 per carat. It also shows that cut alone is not a very good determiner of how valuable a diamond is. While there appears to be a greater proportion of Ideal cut diamonds among the more expensive it is still the case that some V.Good cut diamonds and even some Good cut diamonds have prices well in excess of $10,000 per carat.

Plot Three

Description Three

This plot demostrates quite clearly (no pun intended) how important clarity is in the price of a diamond. For any given carat size we can see that a large amount of the price variation is due to the clarity grade with the best clarity grades being consistently more expensive. For example at the 1.5 carat size the diamonds graded IF appear to be almost eight times more expensive than the diamonds graded I1 or I2.


Reflection

One of the most striking features of the dataset is the effect of round numbers. In virtually every plot we can see the effect of round numbers for carat size, both integral values and half-integral values. As a side project I would like to do an investigation looking at diamonds that are very similar in all respects but slightly different in size, either just at an integral carat amount, or just below.

It was also very intresting to see how important clarity and color were while realizing that cut didn’t play as big a role, or at the very least played a much more ambiguous role.

One struggle I had was simply the time it took for plots to render. On my MacBook Air some of the plots took 15 seconds to render given that I was dealing with almost half a million diamonds. A more important struggle was that because so many of the diamonds had similar characteristics almost every plot had a large amount of overplotting. This made outliers and exceptions seem more common than they really were. Changing the transparency improved many of the plots, but when I was also using the color brewer scales it had the tendency to wash out the lighter end of the spectrum of colours making the plots hard to interpret.