New Posts from Skeetersays!

November 27, 2013 Leave a comment

This blog that was once a repository of ramblings from my graduate school days is now going to be a place for scientific discussion.  The posts will come from myself or from undergraduate students in my research lab and will cover any topics that we deem interesting!  I look forward to working to bring this blog back to life.

Old posts not related directly to science have been removed for a nice fresh start!

Categories: Uncategorized

Making Excel Graphics Clear

August 31, 2008 Leave a comment

I think it was my undergraduate adviser who first introduced me to the work of Edward Tufte and basic concepts of the graphical display of data. He was also the first to convince me that MS Excel was just about the worst program with which to generate scientific data graphics. The noise: information ration was quite high. I learned how to change the background colors, change the axes and gridlines etc.. in compliance with good graphical design.

Recently, a discussion of excel graphics appeared on Tufte’s discussion website, and there was a link to a visual basic macro to clean up bad excel graphics! In a post at Juice Analytics from April 2006, entitled “Fixing Excel charts: Or why cast stones when you can pick up a hammer” there is a link to an excel add-in that cleans up standard charts, allowing you to save a great deal of time on doing these things one at a time.

This is especially useful for older versions of excel (pre 2007) that have especially awful graphics. Excel 2007 (that I am currently using) has a standard graphic that isn’t all that bad, not great yet, but much better than previous versions of excel. I was going to to do a before and after image of an excel chart using this add-on, but with 2007, they were pretty much the same. So if you use an older version of excel and want to try it out, send me a before and after shot and I will post it.

Categories: statistics

R: no nested FOR loops

August 14, 2008 7 comments

I am an avid user of the R project for statistical computing and use it for much of my computational and statistical analyses. I will be posting tips, tricks and hints for using R on this blog, especially when I can’t find the information elsewhere on the internet.

One of the main benefits of R is that it is vectorized, so you can simplify your code writing a great deal using vectorized functions. For example you don’t have to write a loop to add 5 to every element in a vector X. You can just write “X + 5”.

The R mantra is to minimize the use of loops wherever possible as they are inefficient and use a great deal of memory. I have been pretty good about this except for one situation – Running a simulation over a parameter space defined by a few vectors. For instance, anyone who is familiar with writing code in C or some other non-vectorized language will recognize the following pseudocode for running a simulation over a parameter space of 4 variables with nreps replications for each parameter set:

for (i in 1:length(m)) {
for (j in 1:length(sde)) {
for (k in 1:length(dev.coef)) {
for (l in 1:length(ttd)) {
for (rep in 1:nreps) {
Run the simulation

(my apologies for the bad formatting, blogger keeps taking out the spaces I put in. If anyone knows how to easily get the html version of my text to treat spaces as spaces, please, let me know).

This runs the simulation over a parameter space that is defined in a vector for each parameter. This is how I have been writing my R code for years.

I FINALLY FOUND OUT HOW TO VECTORIZE THIS PROCESS! I am not sure why it took me so long, but at long last I can do all of this in one command, using the apply and expand.grid functions. The apply command will apply a function sequentially to data taken from rows of an array and expand.grid takes factors and combines them into an array.

For example, say my parameter space is defined by:

> m <- c(1,2,3,4)
> n <- c("m","f")
> o <- c(12,45,34)

I can call expand.grid to get all of the combinations put into rows in an array:

> expand.grid(m,n,o)
Var1 Var2 Var3
1 1 m 12
2 2 m 12
3 3 m 12
4 4 m 12
5 1 f 12
6 2 f 12
7 3 f 12
8 4 f 12
9 1 m 45
10 2 m 45
11 3 m 45
12 4 m 45
13 1 f 45
14 2 f 45
15 3 f 45
16 4 f 45

then I can run my simulation command on the rows of this array by using apply() with MARGIN = 1 as a parameter telling apply to use the rows of the array:

So in my example, I have replaced the following ugly, inefficient code full of nested for loops (summarized above) with the following single command:

out <- apply(expand.grid(m,sde,dev.coef,ttd,reps), MARGIN = 1,
function(x) one.generation(N, x[1],x[2],x[3],x[4]))

Fantastic! I love it when I figure out how to do things more succinctly in R. I hopefully will regularly put in R tidbits on this blog in the hopes that some ‘ignorant’ R programmer like myself will find this post the first time he searches the internet for it. It may have taken me years to finally figure it out, but that doesn’t mean it has to take that long for everyone else!🙂

Categories: opensource, R, statistics

Demystifying Statistics: On the interpretation of ANOVA effects

August 12, 2008 5 comments

One of the statistical concepts that is very difficult for many people to grasp, yet critically important for an understanding of statistics, is the interpretation of significant effects in an Analysis of Variance (ANOVA). In this post, I will use a graphical approach to describe how to interpret effects from a two-way factorial ANOVA. I will not delve into the design, implementation, or computation involved in such ANOVAs.

Suppose we have an experiment where we are measuring a biological trait from each of two species (Sp1, Sp2) raised in each of two environments (Env1, Env2). We set up a two-way ANOVA and out pops an ANOVA table with four lines, one each for the species effect, the environment effect, the species-by-environment interaction effect, and one for the residuals (which I hope to discuss in detail in a later post, and will not talk about here.

ANOVAs are designed to look for differences in mean values accross different groupings of the data. In the case above, it is looking for a difference in mean trait values between the two species and between the two environments, while simultaneously testing for their independence (with the interaction term). So in order to look at a graph and think about ANOVAs, you have to think about mentally picturing the various means.

The first two effects are pretty straightforward to interpret. Suppose the analysis showed a significant Species effect. This means that there is a significant difference among species in the mean trait measure when pooled accross environments. Maybe that Sp1 always has a bigger eyeball width than Sp2, or something like that. Same thing with a significant environment effect; maybe both species do better in tropical vs. desert conditions.

It is the interaction term that is the hard term to understand for many people. A significant interaction term signifies a lack of independence of the other two variables, in this case species and environment. In this example, an interaction term may imply that there is environmental preference among species, with Sp1, say, preferring Env1 and Sp preferring Env2. Another way to think about significant interaction terms is in an Analysis of CoVAriance (ANCOVA) setting, where the two slopes are different (see below).

I find it easiest to think about these things graphically so below you will find the 7 different qualitative results for significant effects in a two-way ANOVA. These are the pics that always pop into my head when I think about ANOVAs. They are interaction plots of the two variables.

(Disclaimer – These graphs do not include error bars. This is purely for clarity of making my point, any time figures like these are published, they should include estimates of error – as discussed here).

Let’s start with cases where there is no, or only one, significant effect:

Figure (A) – there are no significant effects. There is no difference in means either between environments or species, and there is no interaction term as the two are independent, and the slopes of the lines are the same.

Figure (B) – a significant species effect. In this case there is no difference in the mean trait value across environments (it falls between the two lines), but the mean for each species is different. The slopes are equal so there is no significant interaction.

Figure (C) – a significant environment effect. Here there is a difference among the means of the two environments, but not a difference in the means among species. Again the slopes are equal.

Figure (D) – a significant interaction effect. In this case there is a significant interaction effect as the slopes are not equal. In this case there are no other effects as the means of the environments nor species differ. This is important as many people mistakingly believe that there cannot be a significant interaction term if neither of the main effects are significant. This shows that it is possible.

Now let’s move on to cases where there is more than one significant effect:

Figure (E) – a significant species effect and interaction term. In this case you can see that the slopes are unequal and thus there is an interaction term. The species means are different, but the environmental means are not.

Figure (F) – a significant environment effect and interaction term. As above, the slopes are unequal, but in this case the species means are the same while the environmental means differ.

Figure (G) – everything is significant. You can figure this one out. All the means are different, as are the slopes.

That’s the basics. Now ANOVAs can get really complicated with many levels of factors and complicated designs, but this relatively simple graphical understanding of ANOVAs has greatly helped me to understand more complex designs.

Categories: statistics

The Elements of Style

July 30, 2008 Leave a comment

Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts. This requires not that the writer make all his sentences short, or that he avoid all detail and treat his subjects only in outline, but that every word tell.

There were two required texts for my undergraduate Evolution class: Evolutionary Analysis by Scott Freeman (link is to new edition rather thean the one I used), and The Elements of Style by William Strunk Jr and E.B. White (link to original 1918 edition). The former seemed appropriate at the time while the latter confused me greatly. The course was writing intensive and a large part of our grade was based on our writing clear, concise and substantial analyses of primary literature as well as a 30 page review paper. At the time I am sure that I looked through the style book but I can honestly say that I did not give it the consideration it deserved.

Today, in an act of procrastination, I took the ‘little book’ from the shelf, blew the dust off of the binding and leafed through it over a cup of coffee and realized how relevant it is, even nearly a century after its first publication. The above quote is a great summary of what my advisors have been telling me for years concerning scientific writing. In the version I have (1979) there are 11 elementary rules of usage which I mention below:

  1. Form the possessive singular of nouns by adding ‘s
  2. In a series of three or more terms with a single conjunction, use a comma after each term except the last
  3. Enclose parenthetic expressions between commas
  4. Place a comma before a conjunction introducing an independent clause
  5. Do not join independent clauses by a comma
  6. Do not break sentences in two
  7. Use a colon after an independent clause to introduce a list of particulars, an appositive, an amplification, or an illustrative quotation
  8. Use a dash to set off an abrupt break or interruption and to announce a long appositive or summary
  9. The number of the subject determines the number of the verb
  10. Use the proper case of pronoun
  11. A participial phrase at the beginning of a sentence must refer to the grammatical object.

These are timeless rules of grammar. The book includes a few things that have not survived the ages such as the following section discriminating against the use of the suffix “-ize”, that I believe only appears in the White edited version of the text (not in the original)

Do no coin verbs by adding this tempting suffix. Many good and useful verbs do end in –ize: summarize, temporize, fraternize, harmonize, fertilize. But there is a growing list of abominations: containerize, customize, prioritize, finalize, to name four. Be suspicious of –ize; let your ear and your eye guide you. Never tack –ize onto a noun to create a verb. Usually you will discover that a useful verb already exists. Why say “moisturize” when there is the simple, unpretentious word “moisten”.

What a fantastic quote! Needless to say, it seems that society has not agreed with the authors, as evidenced by the inclusion of all of the ‘abominations’ in the MS word dictionary. Not to mention the numerous other –ize words in common usage today. I think that one of my new favorite quotes is the last sentence about the word “moisten”.

Overall, this book is still as relevant and accurate as it was nearly a century ago and I will strive to make my writing follow Strunk’s guidelines to the best of my ability.    

Categories: Uncategorized

On the use of error bars

July 30, 2008 2 comments

Dave Munger over at Cognitive Daily just wrote this post about people’s lack of understanding of error bars in the graphical representation of data. The post is very interesting and I encourage people to take the quiz that he has posted on the correct interpretation of error bars.
A particular comment on that post concerns me, and I am going to use this post to give my two cents on error bars and their importance in the understanding of data. Specifically I will try to address some misconceptions and problems with how people use and read error bars.

The comment that concerns me is:

I may, in the future, forget the exact definition of what the error bars mean, but I will still be capable of saying “Whoo, small error bar, that figure is probably pretty accurate” and “Whoa, look at that huge error bar, I’ll use a bigger grain of salt to look at that figure”.

This comment frightens me. I can’t help but to think about the book How to Lie With Statistics (link to book at powells). The main problem with this reasoning is that there are many ‘types’ of error bars that are often included in scientific graphics, with most researchers choosing some multiple of either the standard error or the standard deviation. One can not just look at the length of the error bars and assume that it means accurate data (to get a bit picky on semantics – also, error bars do not reflect the accuracy of the data, rather it reflects the precision with which you can measure the data). It will all depend on which error measurement is being plotted, and it is highly variable among scientific papers. I tend to use error bars that are the length of 2 * Standard error for reasons I will get to in a bit, and thus relative to other graphics that usually plot 1 SE my data may seem ‘less accurate’ to the reader, and that would be a shame and completely incorrect.

The appropriate use of error bars

Data that is plotted without error bars are data that cannot be put into relevant scientific context. Are two means the same? What is the measurement error on the observations? Is there a pattern of variability among groups? These are all incredibly important scientific questions that cannot be addressed without estimates of errors of one form or another. As such, error bars should ALWAYS be included in scientific graphics. The lack of error bars in figures immediately raises suspicion in my mind as to the appropriateness of any conclusions drawn from the data. There are a few exceptions where a complex graphic would lose all meaning if the error bars were included (say there were too many points and if the error bars were included you would not be able to see any of the data), but under these conditions, the text associated with the figure should make very clear the level of error on the data.

In my publications, I tend to use error bars representing two standard errors (SE) around a mean. This is because the standard two-group t-test (or F-test) has a 95% confidence interval of ~2SE. Therefore you can use this to directly estimate the significance of a difference in means, rather than having to visually double the length of 1SE error bars that most people use (mostly because they make people like the one quoted above more trusting of their data, rather than for any worthwhile reason). With 2 SE error bars, one can look to see if the mean of one group is included in the confidence interval of the other group – if so then there is likely no difference among the groups. Note that it is not relevant whether the error bars ‘overlap’ but whether the mean of one group ‘overlaps’ with the error bars of the other.

Here is fictitious example with some randomly generated data.

In this case the two groups are significantly different using a Students t-test (t =3.59, df = 198, p = 0.0004. I have plotted the same data twice showing that the two samples are different, with the plot on the left-side having 1SE error bars and the one on the right having 2SE error bars. There is not much difference in interpretation of these graphs. In either case they look different (OK, for this example, I have two groups that are highly different to try to make this easier to visualize). The 2SE error bars do not make the data look ‘less accurate’ but they do make it easier to see what is going on. The mean of either sample is not included in within the error bars of the other sample – thus the two samples are different. This is easier and more appropriate to interpret than the left plot with which in order to correctly interpret the error bars, you must first visually double their length. The person quoted above may have less trust in ‘accuracy’ of the data on the right, even though it is the same data, just with a different choice of error bar.

The following example is again randomly generated data, but in this case there is no significant difference among the groups (t = 0.96, df = 198, p = 0.336)

In this case the difference between the left plot (with 1SE error bars) and the right plot (with 2SE error bars) is clear. The right figure yields to the most appropriate interpretation of the data. It is clear with 2SE that the mean of one group ‘overlaps’ with the error bars of the other group, therefore suggesting that there is no difference among groups, which is the case here. If people were incorrectly using the same reasoning in the plot with 1SE error bars, they would incorrectly conclude that the means of the two groups were different.

My main conclusions are the following:
1. Error bars should ALWAYS be included in scientific graphics or at least have associated text describing the error measurements.
2. Do not just look at the width of error bars as an estimate of ‘accuracy’ of the data – it is context dependent on what the data are and which type of error bars the author has decided to use.
3. I encourage the use of 2SE error bars in the majority of cases to improve the clarity of the relationships of the data and to minimize the mis-interpretation of the error bars, even though it may make your data look ‘more noisy’.
4. Teach others what error bars really mean so that they can accurately read scientific figures.

Categories: research, statistics

Degenerate PCR – A guide and tutorial

July 10, 2008 8 comments

Finding gene sequence data in organisms for which there are no genomic resources available can be a non-trivial task. One of the most common methods for finding such sequence information is through degenerate PCR methods, either using total genomic DNA or a DNA library (genomic or cDNA library) as template. This technique is tedious and its effectiveness requires quite a bit of luck as well as skill. I have never really found a comprehensive resource for optimizing the chances of success so in this post I will outline the techniques I use and give a few tips that have really helped me in the past.

In general, polymerase chain reaction (PCR) requires two primers (short sequences of nucleotides) that specifically bind to a region of the genome that is to be amplified. This requires knowledge of at least a portion of the specific sequence to be amplified. Degenerate PCR involves using primers that allow for some ‘wiggle room’ in the sequence of the primers. For example the 4th nucleotide in the primer sequence may allow it to anneal to template sequence with nucleotides A, T, or G, while excluding those with C. This allows for flexibility in amplification. On the downside, it reduces specificity of the primers.

Degenerate PCR works because, in general, there is far more conservation at the amino acid (AA) level than at the nucleotide level. Conserved portions of AA sequences among organisms closely related ot the focal organisms are likely to be conserved in the focal organism as well. For example, let’s say that there is a gene that has the following amino acid sequence “GCCHCDE” that is conserved among a few closely related organisms. There are 256 nucleotide sequences that will code for this specific AA sequence. It is likely that neutral changes in the nucleotide sequence of these organisms have been accumulated throughout evolutionary time, but if you design primers to take into consideration all of these possibilities, you should be able to use this sequence as a PCR primer, assuming the AA sequence is conserved in your species of interest. Aligning the nucleotide sequences of these organisms is unlikely to yield conserved primers.

I will discuss the design of degenerate primers for ‘finding’ a gene in an organism that has closely related organisms with gene sequences available.

Step 1 – Get the sequence data of the gene-of-interest from related organisms

I work on a mosquito, so I generally start by finding the sequence data from D. melanogaster, a fellow dipteran. I start at Flybase and look up the sequence (usually by name) that I am interested and download the protein sequence (translation) in FASTA format (you can use all of the following methods for non-coding DNA as well). The sequence should be copied into an empty text file. I usually change the beginning of the header line (the line starting with “>”) to be the species name as this is the portion of the header that will be included in the alignment files. The alignment program will read from the character after “>” until the first space, so keep that in mind when you name your sequence.

Next, it is necessary to find sequence data for other related organism (in my case other mosquitos including Culex pipiens, Aedes aegypti and Anopheles gambiae). I usually do this at NCBI using their BLAST program. I usually stick to 3 – 5 species that are closely related to the one you are after. Including more organisms make it a bit harder to work with but are more likely to give you highly conserved regions.

Once you have acquired the protein coding sequence of a variety of closely related organisms, it is time to move on to step 2!

Step 2 – Align the protein sequences

With the text file including the fasta-formatted protein sequences, the next step is to align the sequences to account for gaps. I use ClustalX, but there are web-based interfaces for Clustal available including one here. I usually just leave all of the default options and run the multiple alilgnment, which spits out an alignment file (*.aln) with the aligned protein sequences. Then I print out a copy of the alignment (working on the computer screen is difficult). This is where the tedious part begins.

Step 3 – Finding conserved sequence regions with low degeneracy

The goal here is to come up with stretches of conserved amino acid (AA) sequences that have a low degeneracy. By this I mean that the conserved AA sequence could be generated by a relatively low number of nucleotide sequences. In general the lower the degeneracy you can get (fewer possible nucleotide sequences) the better. At worst, I will use a primer that has a degeneracy of 1000 possible nucleotide sequenes, but I try to make them much smaller than that if possible.

Stare at the printout. Alignment files are nice as they put “*” under all sites that are conserved and “:” under all similar sites. Look for stretches of these symbols and that is a good place to start. Once you find conserved sequences (trying to make primers of ~17 – 24 nucleotides requires 6- 8 AAs) the degeneracy of these sequences must be determined. This can be done by taking the product of the degeneracy of each AA in the sequence. For example, Valine has four codons (GTT GTC GTA GTG) and thus has a degeneracy of 4 while Tryptophan has only one codon (TGG) and thus has a degeneracy of 1. The degeneracy of each AA, along with its codons, is included in the reference table I have linked at the end of this post.

When you find a site with low degeneracy, write out all of the possible sequences using the Degeneracy code found in the reference table and order your primers. Then you can try the PCRs and hope for the best.

Tips and Tricks that will give you the best chances of successful degenerate PCR

  1. Keep the degeneracy of each primer low. Under 400 is great – under 1000 is ok but not good, and over 1000 isn’t worth your time.
  2. In general, larger PCR reactions work better – I tend to use 50uL reactions for degenerate PCRs
  3. Use 3-5 times the amount of primer you would normally use to increase the chances of the appropriate primer being in the reaction at any decent concentration. I tend to use 3 uL of each primer (at 10mM) for each 50 uL reaction.
  4. I have had the best success with nested degenerate PCR if possible. In this you have a minimum of 3, but best is at least 4 primers within the sequence. In the case of four, you will have two forward and two reverse primers. For the first PCR reaction you use the two “outer” forward and reverse primers. Then you take a portion of this first PCR and use it as template for the second reaction (I usually use 5uL of the first 50 uL reaction as template for the second reaction). This helps to reduce the number of amplicons and makes the reactions more specific to the gene you are looking for.
  5. The more primers you can design for a given gene, the better the chances that one of the primer sets will work.
  6. Methionine (M) and Tryptophan(W) are the only amino acids that are coded for by a unique codon. Having these in your primer sequences is great!
  7. Try to stay away from Serine (S) Arginine (R) and Leucine (L) as they each are coded for by six codons. This said, don’t let the presence of some of thes AAs keep you from using that region. But realize that a sequence of SSRLSR is not going to make a good degenerate primer.
  8. Amplifying a 200 – 600 bp region seems to be optimal but I have done as few as 80 bp and as much as 1200 bp.

I have created a reference table that will come in very handy for any attempt to design degenerate primers. The table includes the codon list for all amino acids along with the degenerate code for the colleciton of nucleotides in both forward and reverse compliments. The degeneracy of each amino acid is also listed.

Degenerate Primer Design Reference Table

As an example, this morning, I worked up some primers for Juvenile Hormone Esterase for mosquitos. In the pdf linked below you can see the alignment file with the primer sequences highlighted along with the details of the associated primers.

JHE example

Categories: protocol, science, tutorial

Get every new post delivered to your Inbox.