Degenerate PCR – A guide and tutorial
Finding gene sequence data in organisms for which there are no genomic resources available can be a non-trivial task. One of the most common methods for finding such sequence information is through degenerate PCR methods, either using total genomic DNA or a DNA library (genomic or cDNA library) as template. This technique is tedious and its effectiveness requires quite a bit of luck as well as skill. I have never really found a comprehensive resource for optimizing the chances of success so in this post I will outline the techniques I use and give a few tips that have really helped me in the past.
In general, polymerase chain reaction (PCR) requires two primers (short sequences of nucleotides) that specifically bind to a region of the genome that is to be amplified. This requires knowledge of at least a portion of the specific sequence to be amplified. Degenerate PCR involves using primers that allow for some ‘wiggle room’ in the sequence of the primers. For example the 4th nucleotide in the primer sequence may allow it to anneal to template sequence with nucleotides A, T, or G, while excluding those with C. This allows for flexibility in amplification. On the downside, it reduces specificity of the primers.
Degenerate PCR works because, in general, there is far more conservation at the amino acid (AA) level than at the nucleotide level. Conserved portions of AA sequences among organisms closely related ot the focal organisms are likely to be conserved in the focal organism as well. For example, let’s say that there is a gene that has the following amino acid sequence “GCCHCDE” that is conserved among a few closely related organisms. There are 256 nucleotide sequences that will code for this specific AA sequence. It is likely that neutral changes in the nucleotide sequence of these organisms have been accumulated throughout evolutionary time, but if you design primers to take into consideration all of these possibilities, you should be able to use this sequence as a PCR primer, assuming the AA sequence is conserved in your species of interest. Aligning the nucleotide sequences of these organisms is unlikely to yield conserved primers.
I will discuss the design of degenerate primers for ‘finding’ a gene in an organism that has closely related organisms with gene sequences available.
Step 1 – Get the sequence data of the gene-of-interest from related organisms
I work on a mosquito, so I generally start by finding the sequence data from D. melanogaster, a fellow dipteran. I start at Flybase and look up the sequence (usually by name) that I am interested and download the protein sequence (translation) in FASTA format (you can use all of the following methods for non-coding DNA as well). The sequence should be copied into an empty text file. I usually change the beginning of the header line (the line starting with “>”) to be the species name as this is the portion of the header that will be included in the alignment files. The alignment program will read from the character after “>” until the first space, so keep that in mind when you name your sequence.
Next, it is necessary to find sequence data for other related organism (in my case other mosquitos including Culex pipiens, Aedes aegypti and Anopheles gambiae). I usually do this at NCBI using their BLAST program. I usually stick to 3 – 5 species that are closely related to the one you are after. Including more organisms make it a bit harder to work with but are more likely to give you highly conserved regions.
Once you have acquired the protein coding sequence of a variety of closely related organisms, it is time to move on to step 2!
Step 2 – Align the protein sequences
With the text file including the fasta-formatted protein sequences, the next step is to align the sequences to account for gaps. I use ClustalX, but there are web-based interfaces for Clustal available including one here. I usually just leave all of the default options and run the multiple alilgnment, which spits out an alignment file (*.aln) with the aligned protein sequences. Then I print out a copy of the alignment (working on the computer screen is difficult). This is where the tedious part begins.
Step 3 – Finding conserved sequence regions with low degeneracy
The goal here is to come up with stretches of conserved amino acid (AA) sequences that have a low degeneracy. By this I mean that the conserved AA sequence could be generated by a relatively low number of nucleotide sequences. In general the lower the degeneracy you can get (fewer possible nucleotide sequences) the better. At worst, I will use a primer that has a degeneracy of 1000 possible nucleotide sequenes, but I try to make them much smaller than that if possible.
Stare at the printout. Alignment files are nice as they put “*” under all sites that are conserved and “:” under all similar sites. Look for stretches of these symbols and that is a good place to start. Once you find conserved sequences (trying to make primers of ~17 – 24 nucleotides requires 6- 8 AAs) the degeneracy of these sequences must be determined. This can be done by taking the product of the degeneracy of each AA in the sequence. For example, Valine has four codons (GTT GTC GTA GTG) and thus has a degeneracy of 4 while Tryptophan has only one codon (TGG) and thus has a degeneracy of 1. The degeneracy of each AA, along with its codons, is included in the reference table I have linked at the end of this post.
When you find a site with low degeneracy, write out all of the possible sequences using the Degeneracy code found in the reference table and order your primers. Then you can try the PCRs and hope for the best.
Tips and Tricks that will give you the best chances of successful degenerate PCR
- Keep the degeneracy of each primer low. Under 400 is great – under 1000 is ok but not good, and over 1000 isn’t worth your time.
- In general, larger PCR reactions work better – I tend to use 50uL reactions for degenerate PCRs
- Use 3-5 times the amount of primer you would normally use to increase the chances of the appropriate primer being in the reaction at any decent concentration. I tend to use 3 uL of each primer (at 10mM) for each 50 uL reaction.
- I have had the best success with nested degenerate PCR if possible. In this you have a minimum of 3, but best is at least 4 primers within the sequence. In the case of four, you will have two forward and two reverse primers. For the first PCR reaction you use the two “outer” forward and reverse primers. Then you take a portion of this first PCR and use it as template for the second reaction (I usually use 5uL of the first 50 uL reaction as template for the second reaction). This helps to reduce the number of amplicons and makes the reactions more specific to the gene you are looking for.
- The more primers you can design for a given gene, the better the chances that one of the primer sets will work.
- Methionine (M) and Tryptophan are the only amino acids that are coded for by a unique codon. Having these in your primer sequences is great!
- Try to stay away from Serine (S) Arginine (R) and Leucine (L) as they each are coded for by six codons. This said, don’t let the presence of some of thes AAs keep you from using that region. But realize that a sequence of SSRLSR is not going to make a good degenerate primer.
- Amplifying a 200 – 600 bp region seems to be optimal but I have done as few as 80 bp and as much as 1200 bp.
I have created a reference table that will come in very handy for any attempt to design degenerate primers. The table includes the codon list for all amino acids along with the degenerate code for the colleciton of nucleotides in both forward and reverse compliments. The degeneracy of each amino acid is also listed.
As an example, this morning, I worked up some primers for Juvenile Hormone Esterase for mosquitos. In the pdf linked below you can see the alignment file with the primer sequences highlighted along with the details of the associated primers.