Archive for August, 2008

Making Excel Graphics Clear

August 31, 2008 Leave a comment

I think it was my undergraduate adviser who first introduced me to the work of Edward Tufte and basic concepts of the graphical display of data. He was also the first to convince me that MS Excel was just about the worst program with which to generate scientific data graphics. The noise: information ration was quite high. I learned how to change the background colors, change the axes and gridlines etc.. in compliance with good graphical design.

Recently, a discussion of excel graphics appeared on Tufte’s discussion website, and there was a link to a visual basic macro to clean up bad excel graphics! In a post at Juice Analytics from April 2006, entitled “Fixing Excel charts: Or why cast stones when you can pick up a hammer” there is a link to an excel add-in that cleans up standard charts, allowing you to save a great deal of time on doing these things one at a time.

This is especially useful for older versions of excel (pre 2007) that have especially awful graphics. Excel 2007 (that I am currently using) has a standard graphic that isn’t all that bad, not great yet, but much better than previous versions of excel. I was going to to do a before and after image of an excel chart using this add-on, but with 2007, they were pretty much the same. So if you use an older version of excel and want to try it out, send me a before and after shot and I will post it.

Categories: statistics

R: no nested FOR loops

August 14, 2008 7 comments

I am an avid user of the R project for statistical computing and use it for much of my computational and statistical analyses. I will be posting tips, tricks and hints for using R on this blog, especially when I can’t find the information elsewhere on the internet.

One of the main benefits of R is that it is vectorized, so you can simplify your code writing a great deal using vectorized functions. For example you don’t have to write a loop to add 5 to every element in a vector X. You can just write “X + 5”.

The R mantra is to minimize the use of loops wherever possible as they are inefficient and use a great deal of memory. I have been pretty good about this except for one situation – Running a simulation over a parameter space defined by a few vectors. For instance, anyone who is familiar with writing code in C or some other non-vectorized language will recognize the following pseudocode for running a simulation over a parameter space of 4 variables with nreps replications for each parameter set:

for (i in 1:length(m)) {
for (j in 1:length(sde)) {
for (k in 1:length(dev.coef)) {
for (l in 1:length(ttd)) {
for (rep in 1:nreps) {
Run the simulation

(my apologies for the bad formatting, blogger keeps taking out the spaces I put in. If anyone knows how to easily get the html version of my text to treat spaces as spaces, please, let me know).

This runs the simulation over a parameter space that is defined in a vector for each parameter. This is how I have been writing my R code for years.

I FINALLY FOUND OUT HOW TO VECTORIZE THIS PROCESS! I am not sure why it took me so long, but at long last I can do all of this in one command, using the apply and expand.grid functions. The apply command will apply a function sequentially to data taken from rows of an array and expand.grid takes factors and combines them into an array.

For example, say my parameter space is defined by:

> m <- c(1,2,3,4)
> n <- c("m","f")
> o <- c(12,45,34)

I can call expand.grid to get all of the combinations put into rows in an array:

> expand.grid(m,n,o)
Var1 Var2 Var3
1 1 m 12
2 2 m 12
3 3 m 12
4 4 m 12
5 1 f 12
6 2 f 12
7 3 f 12
8 4 f 12
9 1 m 45
10 2 m 45
11 3 m 45
12 4 m 45
13 1 f 45
14 2 f 45
15 3 f 45
16 4 f 45

then I can run my simulation command on the rows of this array by using apply() with MARGIN = 1 as a parameter telling apply to use the rows of the array:

So in my example, I have replaced the following ugly, inefficient code full of nested for loops (summarized above) with the following single command:

out <- apply(expand.grid(m,sde,dev.coef,ttd,reps), MARGIN = 1,
function(x) one.generation(N, x[1],x[2],x[3],x[4]))

Fantastic! I love it when I figure out how to do things more succinctly in R. I hopefully will regularly put in R tidbits on this blog in the hopes that some ‘ignorant’ R programmer like myself will find this post the first time he searches the internet for it. It may have taken me years to finally figure it out, but that doesn’t mean it has to take that long for everyone else! 🙂

Categories: opensource, R, statistics

Demystifying Statistics: On the interpretation of ANOVA effects

August 12, 2008 5 comments

One of the statistical concepts that is very difficult for many people to grasp, yet critically important for an understanding of statistics, is the interpretation of significant effects in an Analysis of Variance (ANOVA). In this post, I will use a graphical approach to describe how to interpret effects from a two-way factorial ANOVA. I will not delve into the design, implementation, or computation involved in such ANOVAs.

Suppose we have an experiment where we are measuring a biological trait from each of two species (Sp1, Sp2) raised in each of two environments (Env1, Env2). We set up a two-way ANOVA and out pops an ANOVA table with four lines, one each for the species effect, the environment effect, the species-by-environment interaction effect, and one for the residuals (which I hope to discuss in detail in a later post, and will not talk about here.

ANOVAs are designed to look for differences in mean values accross different groupings of the data. In the case above, it is looking for a difference in mean trait values between the two species and between the two environments, while simultaneously testing for their independence (with the interaction term). So in order to look at a graph and think about ANOVAs, you have to think about mentally picturing the various means.

The first two effects are pretty straightforward to interpret. Suppose the analysis showed a significant Species effect. This means that there is a significant difference among species in the mean trait measure when pooled accross environments. Maybe that Sp1 always has a bigger eyeball width than Sp2, or something like that. Same thing with a significant environment effect; maybe both species do better in tropical vs. desert conditions.

It is the interaction term that is the hard term to understand for many people. A significant interaction term signifies a lack of independence of the other two variables, in this case species and environment. In this example, an interaction term may imply that there is environmental preference among species, with Sp1, say, preferring Env1 and Sp preferring Env2. Another way to think about significant interaction terms is in an Analysis of CoVAriance (ANCOVA) setting, where the two slopes are different (see below).

I find it easiest to think about these things graphically so below you will find the 7 different qualitative results for significant effects in a two-way ANOVA. These are the pics that always pop into my head when I think about ANOVAs. They are interaction plots of the two variables.

(Disclaimer – These graphs do not include error bars. This is purely for clarity of making my point, any time figures like these are published, they should include estimates of error – as discussed here).

Let’s start with cases where there is no, or only one, significant effect:

Figure (A) – there are no significant effects. There is no difference in means either between environments or species, and there is no interaction term as the two are independent, and the slopes of the lines are the same.

Figure (B) – a significant species effect. In this case there is no difference in the mean trait value across environments (it falls between the two lines), but the mean for each species is different. The slopes are equal so there is no significant interaction.

Figure (C) – a significant environment effect. Here there is a difference among the means of the two environments, but not a difference in the means among species. Again the slopes are equal.

Figure (D) – a significant interaction effect. In this case there is a significant interaction effect as the slopes are not equal. In this case there are no other effects as the means of the environments nor species differ. This is important as many people mistakingly believe that there cannot be a significant interaction term if neither of the main effects are significant. This shows that it is possible.

Now let’s move on to cases where there is more than one significant effect:

Figure (E) – a significant species effect and interaction term. In this case you can see that the slopes are unequal and thus there is an interaction term. The species means are different, but the environmental means are not.

Figure (F) – a significant environment effect and interaction term. As above, the slopes are unequal, but in this case the species means are the same while the environmental means differ.

Figure (G) – everything is significant. You can figure this one out. All the means are different, as are the slopes.

That’s the basics. Now ANOVAs can get really complicated with many levels of factors and complicated designs, but this relatively simple graphical understanding of ANOVAs has greatly helped me to understand more complex designs.

Categories: statistics