I have been collecting data regarding novice programmers here at Kent for two years; quite a bit of interesting data, if I may say so. Now, I’ve been analyzing this data for some time using exploratory techniques, and relying on the fact that simple analyses, by-and-large, have had a lot to say.
As I look towards reporting some of this data, I feel I should do more than just say “Hey, look at that histogram!” Instead, I feel I should say “Hey, look at that histogram! That distribution of data seems to be most closely related to x, which might mean y in this context!”
Well, I feel I should say something. So, consider this first piece of data. In the Fall of 2003, I studied 62 students while they were programming in-class. A simple question would be to ask “How many times did they press the compile button?” This is figure 1.

Figure 1: Total number of compilation events per student, Fall 2003
As raw data, this is
> comp
[1] 55 58 69 40 68 30 27 71 64 17 90 107 52 61 8 7
[17] 38 18 32 55 15 50 54 35 90 61 17 196 52 38 46 23
[33] 15 64 19 40 50 154 37 55 58 14 114 81 25 57 29 180
[49] 60 28 112 35 149 36 49 19 30 44 15 34 60 35
Neither of these views are particularly meaningful; a histogram, however, makes a bit more sense for univariate data of this sort (Figure 2).

Figure 2: The distribution of figure 1.
When viewed this way, things make a bit more sense; we can see that the majority of the subjects from the first semester only generated around 30-70 compilation events, total. This means that, for some analyses, I will be most interested in a handful of students, and not the entire population.
However, I’m currently interested in this distribution; I don’t think it is normal. From what I’ve been able to discover (but have had a hard time learning how to characterize), this looks like it is in the Weibull family—a distribution most often encountered in failure rates (MTBF) and in some branches of medicine.
Here’s what I don’t like. If I ask R to fit this distribution for me, this is what I get:
> fitdistr(comp,"weibull")
shape scale
1.5076407 60.1879954
( 0.1395277) ( 5.3731381)
If these are the shape and scale parameters (alpha and beta) of a Weibull distribution, they sure don’t look like they fit my distribution at all (Figure 3). In fact, I seriously doubt the scale parameter, as seen below.

Figure 3: The Weibull distribution with alpha = 1.507 and beta = 60.187
Generated with ‘plot(function(x) dweibull(x, shape=1.507, scale=60.187))’
However, before I started using my head, I had already written a Scheme routine that placed my data into bins of size twenty. This gives me a dataset which is distributed similarly, but is ready for counting and plotting as a histogram. (Specifically, if I apply a transformation like(map (lambda (x) (add1 (remainder x 20)) comp) over the above array, you get my “binned” data. In R, this would be expressed as mapply(function(x) ((x %/% 20) + 1), comp).)
> binned
[1] 3 3 4 3 4 2 2 4 4 1 5 6 3 4 1 1 2 1 2 3 1 3
[23] 3 2 5 4 1 10 3 2 3 2 1 4 1 3 3 8 2 3 3 1 6 5
[45] 2 3 2 10 4 2 6 2 8 2 3 1 2 3 1 2 4 2
If I plot this data directly, I just get the same histogram as the one above back (well, close to the same; I’d have to tweak the size of the breaks in the histogram above, but it is the same distribution, effectively). What bothers me though is this:
> fitdistr(binned,"weibull")
shape scale
1.6953604 3.5683620
(0.1563823) (0.2835319)
If I plot a Weibull with these parameters against my original histogram, I get Figure 4. So, when I ask ‘fitdistr’ to estimate parameters on the raw data, I get a scale parameter that is completely whacked (at least, I think it is). If I ask ‘fitdistr’ to fit the data that I have already broken into bins, it gives me back a sensible set of parameters.

FIgure 4: The original data against a Weibull distribution with parameters derived from… a derived dataset.

Now, this whole exploration is just that: an exploration. Because I am somewhat lost in the wild here, as I only have one text that deals with these kinds of distributions (“Reliability Modelling” by Linda Wolstenholme), and it isn’t for statisticians of my level of competency. Yesterday, I ordered “Introductory Statistics with R” by Peter Dalgaard, which I probably should have ordered long ago… hopefully, it will lend some insights as well.
My point? I’m using an expert tool while wielding inexpert knowledge. (And, I know that a lot of questions to the R-help list fall into this category.) Between the two, I can’t answer the following questions, and would appreciate insights that anyone might have. (I’ve posted to the R-help mailing list with these questirons, in particular):
- From my reading of the various PDFs, on-line documentation, and mailing-list archives, my approach seems sane… except for the (to me) inconsistencies in parameter reporting of these two distributions. Does this look reasonable to you so far, or would you have done something differently?
- Is it appropriate to apply the transformation that I did to my data? Or, is it significantly ‘different’ once I have applied the transformation that I did? Is the source of my problem—an invalid transformation?
- If ‘fitdistr’ is an appropriate tool to use in this case, why does it seem to return unreasonable parameters on my raw data, but reasonable parameters on my ‘processed’ data? That is my interpretation anyway; it certainly looks like something is wrong with the parameters returned by ‘fitdistr’ on my raw data.
- This is a more general question, and one which I don’t have a good sense for: how do I report my use of R in my writing? Is it appropriate to include the code I used to generate my plots and results as part of an appendix? What is considered “due diligence” in this situation? Am I expected to provide proof of the appropriateness of the techniques I applied, or just reasonable argument citing relevant literature?
I think I can answer the last question for myself; however, the others are puzzling me a great deal.
The “comment” link for this weblog is email sent to my gmail account, ‘jadudm at gmail’. Any insights would be appreciated.
Update: A little bit of gardening later
From Andy at 3M:
Matt,
There is nothing wrong here. Â I rerun your example and got the same
parameters. Â Your only "problem" is that you let the density plot use the
default limits on x, which are not reasonable here since your data extends
to over 200. Â Try this:
hist(dd, freq=FALSE) Â #dd is your original data
xx <- seq(0, 300, by=.1)
yy <- dweibull(xx, shape=1.5079, scale=60.2139)
lines(xx, yy, col=2)
BTW, when you binned your data you also scaled it down by 20, so your scale
parameter changed accordingly.
Cheers,
Andy
Many thanks! This is the danger in using expert tools, but the benefit of a helpful community. Which reminds me—I should make a point of being a better Scheme citizen, and participate more on the lists I’m on.
Update: A little bit more gardening later
From Gabor at gmail:
Note that the gamma distribution seems to work well here. Â After running
Andy's code run this to see both superimposed on the same graph:
gparms <- fitdistr(dd, "gamma")[[1]]
yyg <- dgamma(xx, gparms[1], gparms[2])
lines(xx, yyg, col=3)
Well! If I plot just the results of what Andy sent along, I get Figure 5.

Figure 5: The result of Andy’s comments. What he said seems obvious,
in hindsight, but I wouldn’t have realized it.
Now, if we run Gabor’s code as well, we get Figure 6, and true enough, the gamma distribution does fit the data better.

Figure 6: The Weibull distribution (red, lower peak) and
gamma distribution (green, higher peak) overlaid on the same distribution.
Mathworld has this to say about the gamma distribution (note also the ESH):
A gamma distribution is a general type of statistical distribution that is related to the beta distribution and arises naturally in processes for which the waiting times between Poisson distributed events are relevant.
This is interesting, and good, because it opened my eyes a bit to other possibilities. In the end, I must admit—this is interesting. The sense-making is hard, though, and it’s where I’m most afraid about my reporting—what if I’m wrong? Or, more precisely, what if I draw conclusions that are not the best possible?
Eh. Just keep writing.
Thanks again to Andy and Gabor!