Questioning the evidence for non-canonical RNA editing in humans

In May of last year, Li and colleagues reported that they had observed over 10,000 sequence mismatches between messenger RNA (mRNA) and DNA from the same individuals (RDD sites, for RNA-DNA differences) [1]. This week, Science has published three technical comments on this article (one that I wrote with Yoav Gilad and Jonathan Pritchard; one by Wei Lin, Robert Piskol, Meng How Tan, and Billy Li; and one by Claudia Kleinman and Jacek Majewski). We conclude that at least ~90% of the Li et al. RDD sites are technical artifacts [2,3,4]. A copy of the comment I was involved in is available here, and Li et al. have responded to these critiques [5].

In this post, I’m going to describe how we came to the conclusion that nearly all of the RDD sites are technical artifacts. For a full discussion, please read the comments themselves.


Position biases in alignments around RDD sites. For each RDD site with at least five reads mismatching the genome, we calculated the fraction of reads with the mismatch (or the match) at each position in the alignment of the RNA-seq read to the genome (on the + DNA strand). Plotted is the average of this fraction across all sites, separately for the alignments which match and mismatch the genome.


It’s worth remembering why the Li et al. study has received so much attention. It is known that there are many thousands of bases in human transcripts that are sometimes modified from adenine to inosine (A->I) after transcription via RNA editing. However, these sites are generally found outside of protein-coding regions of mRNAs (i.e, in introns and untranslated regions), often in repeats (e.g., [6]). There are perhaps a few dozen known RNA editing sites that affect protein sequence, though more presumably exist (incidentally, many of these were found by Billy Li, one of the authors of the Lin/Piskol et al. technical comment).

In light of what we know about RNA editing, Li et al. was a bombshell. They found over 10,000 exonic RDD sites, most of which were not A->I changes (or C->U changes, another known type of RNA editing). These included many thousands of RDD sites that were predicted to change protein sequence. These results implied the existence of at least one, and probably more, novel mechanisms of gene regulation, and indeed called into question some basic assumptions used regularly in genetics (for example, that if one knows the sequence of a gene, one can predict with near certainty the sequence of the relevant protein).

So it’s not the existence of RDD sites, per se, that was so surprising about Li et al., but rather the major biological impact of the sites and the implied existence of previously unknown regulatory pathways.

Why I think nearly all of the RDD sites in Li et al. are false positives

Since the publication of Li et al., two groups have raised serious issues about the reported RDD sites [7,8]. Both concluded that the majority of these sites were false positives (If these authors are wondering why they’re not cited in our comment, it’s because Science didn’t let me add the citations during the editing process, sorry!).

The observation that I personally found most convincing is displayed in the plot at the beginning of this post. What I’m showing is that mismatches to the genome at RDD sites occur almost exclusively at the ends of sequencing reads. All three technical comments include this observation. Importantly, Lin/Piskol et al. take this analysis one step further. They show (in their Figure 2) that this effect is driven by the fact that mismatches to the genome at RDD sites tend to occur at the beginning of sequencing reads that go in the opposite direction of transcription (this effect is masked in my plot).

To argue that this pattern is not due to major technical problems, then, one needs to come up with a biological mechanism that accounts for the following observations:

  1. Mismatches to the genome at RDD sites are almost exclusively at the ends of sequencing reads (all three comments show this)
  2. In particular, mismatches at RDD sites are massively enriched at the beginning of sequencing reads in the opposite direction of mRNA transcription (Lin/Piskol et al. show this)
  3. Known A->I RNA editing sites do not have the above two properties (21/23 RDD sites that were previously observed as A->I edits pass the filters used by Pickrell et al.)

The response by Li et al. [5] proposes (but does not show) that these observations are due to A) clustering of RDD sites and B) co-occurrence of RDD sites with insertion/deletion RDD sites. They argue that these two effects could lead to mapping biases, such that sequencing reads carrying an edited base will only map to the genome if the mismatch is at the end of the read. There are two important points to make about this potential explanation. First, this proposed mechanism cannot account for observation #2 above, nor is it immediately clear how it would accomodate observation #3. Second, it is perhaps not obvious to others that widespread insertional RNA editing has not been observed in humans. Li et al. propose a new regulatory mechanism (widespread insertional RNA editing) that interacts with the new regulatory mechanism proposed in their original paper (novel RNA editing types) to create patterns in the data that look indistinguishable from technical artifacts. I think it’s fair to say that the burden of proof is on Li et al. to show that this explanation is more than adding an epicycle on an epicycle.

Lin/Piskol et al. instead propose a plausible artifactual explanation for all three observations. To understand this explanation, it’s important to note that Li et al. have not sequenced RNA itself, but rather cDNA generated from mRNA. To generate the cDNA, they added random short DNA sequences to each sample to act as primers for a DNA synthesis reaction. The argument is as follows: at some sites, the random primers were imperfect matches to the mRNA, but were still able to bind. During synthesis, the mismatches from the primers were incorporated into the cDNA, leading to a false signal of RNA editing (specifically at the positions where the primer initially bound; i.e., the beginning of reads in the opposite direction of transcription). In effect, at a small fraction of sites (but a large absolute number), Li et al. inadvertently performed site-directed mutagenesis on their cDNA library.

Addressing the validation experiments in Li et al.

If Lin/Piskol et al. are right that the majority of RDD sites are artifacts due to errors introduced during cDNA library generation, how can we explain the fact that Li et al. [1] were able to validate the presence of both “wild-type” and “edited” RNA and proteins at some sites? The technical comments include additional analyses showing that some RDD sites are due to mis-mapped reads from paralogous genes, and some due to previously unidentified genetic variation. At these sites, we argue that the two mRNA and protein forms are in fact present in the data, but that they derive from two different DNA forms, rather than resulting from RNA editing.

In their response, Li et al. [5] present no new validation experiments involving RNA or protein sequences (the closest thing is a single, indirect protein assay). Instead, they present new DNA sequence validation. It’s thus worth revisiting the validation experiments from the original paper.

The first type of validation performed in the original paper involved Sanger sequencing of RNA and DNA from 11 RDD sites. Both Kleinman et al. and Pickrell et al. specifically finger four of these sites (in the genes HLA-DQB2 and DPP7) as particularly likely to be false positives due to genetic variation. In the original paper, the validation data at these sites was not shown. In their response, Li et al. [5] do not present DNA sequence validation at these four sites; it’s unclear whether this is a tacit acknowledgement that these were false positives. Of the remaining 7 sites, 6 are of the A->I type, and indeed 4 of these were already known A->I editing sites. This validation, then, actually had a false positive rate of 80% for non-A->I sites (4/5); there is perhaps one site worth exploring further.

The other validation exercise performed by Li et al. [1] involved identifying peptide sequences that correspond to “edited” RDD sites. Pickrell et al. point out that many of the peptide sequences are in fact equally good matches to multiple genes. We propose, then, that these RDD sites are false positives due to mis-mapped reads from paralogous genes. In their response, Li et al. [5] show DNA sequencing data from several of these sites. However, to show that a paralog of a gene does not have a genetic variant would require sequencing the paralog as well; this was not done. The paralog issue remains, to me, the most plausible explanation for the sites Li et al. claim to have validated.

Are there any examples of new types of RNA editing in Li et al.?

The conclusion that at least 90% of the RDD sites in Li et al. are false positives is in some sense unsatisfying. After all, if the remaining 10% are all true positives then they’ve still identified hundreds of examples of new types of RNA editing! This is the spirit of the argument made by Li et al. in their response [5], when they say that they “view the discovery of RDDs as the important point and find the exact number to be less salient”.

However, I am skeptical of the remaining sites as well. It is likely that other types of errors besides those described in the technical comments exist, but are hard to detect by the methods we’ve used. Indeed, two separate analyses of RNA editing by Peng et al. [8] and Bahn et al. [9] filtered out false positive sites based on criteria similar to those used in the technical comments. They then tried to validate non-A->I sites by Sanger sequencing. Even after performing rigorous filtering, at least 50% of the remaining non-A->I sites were false positives. Given that this assay is also not a perfect filter, the true fraction of false positives must be even higher, and I am not convinced that it’s less than 100%.

In sum, by selecting for the most “odd-looking” regions of the genome in an analysis, one enriches for strange and unexpected technical artifacts. Even if a given systematic error affects only 0.001% of the bases in the genome, you’d still expect to run across it 30,000 times if you look at the whole genome! (Or maybe half that if you look only at bases expressed in pre-mRNA). As Daniel wrote regarding his own work on nonsense SNPs (which we know exist, but still were quite difficult to identify reliably), the more interesting something is, the less likely it is to be real.

Of course, it remains plausible that previously unidentified forms of RNA editing are active in humans, and RNA sequencing technology will certainly be important for determining whether such new forms exist. The comments published today, however, indicate that the analyses done by Li et al. are based on technical artifacts, and do not provide evidence for interesting biology. My opinion is that the Li et al. study should have been outright retracted. However, there is a small, but non-zero, probability that a handful of the reported non-A->I sites are real; readers can draw their own conclusion as to whether this justifies keeping the paper as part of the scientific record.

[1] M Li et al. (2011) Widespread RNA and DNA Sequence Differences in the Human Transcriptome. DOI: 10.1126/science.1207018

[2] Pickrell et al. (2012) Technical Comment on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome”. DOI: 10.1126/science.1210484.

[3] Lin et al. (2012) Technical Comment on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome”. DOI: 10.1126/science.1210624.

[4] Kleinman and Majewski (2012) Technical Comment on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome”. DOI: 10.1126/science.1209658.

[5] M Li et al. (2012) Response to Comments on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome”. DOI: 10.1126/science.1210419.

[6] Levanon et al. (2004) Systematic identification of abundant A-to-I editing sites in the human transcriptome. doi:10.1038/nbt996.

[7] Schrider et al. (2011) Very Few RNA and DNA Sequence Differences in the Human Transcriptome. doi:10.1371/journal.pone.0025842

[8] Peng et al. (2012) Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome. doi:10.1038/nbt.2122.

[9] Bahn et al. (2012) Accurate identification of A-to-I RNA editing in human by transcriptome sequencing. doi:10.1101/gr.124107.111.

  • Digg
  • StumbleUpon
  • Facebook
  • Twitter
  • Google Bookmarks
  • FriendFeed
  • Reddit

17 Responses to “Questioning the evidence for non-canonical RNA editing in humans”

  • Hi Joe,

    Thanks for your comment and response by Li et al. In addition to your blog post, I have further comments (though some points duplicating yours more or less):

    1. In Figure 2, the flanking sequence of CCND2 is: CCTTTTCCGTTTTTTTTTTT”TTATT”GTTGTTGTTAATTTTATTGC where “TTATT” is the middle is shown in the plot. Note that in this case, reads starting right before the T homopolymer run may be mapped with an A=>T mismatch because most aligners prefer mismatch over an gap. The authors need to show the alignment in a ~50bp window for us to see what is really happening.

    2. Also in Figure 2, the second example MRS2, the flanking can be mapped to another place with one mismatch:

    000000001 caggaattatgttcatgggaagtggcctcatctggaggcgcctgctttcattccttggacg 000000061
    ========= ||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||| =========
    126003306 caggaattatgttcatgggaagtagcctcatctggaggcgcctgctttcattccttggacg 126003355

    This is a similar case to Figure 3A. In general, we try not to call SNPs from such reads because an error or a true SNP at the mismatch will lead to mismapping. More importantly, nearly all mappers tend to produce mapping errors in such a case especially when there are additional errors/SNPs/INDELs in the read. This does not happen often as the authors have argued, but there are hundreds MB of human sequences or probably a couple of MB CDS are in regions like this and the authors are calling SNPs from a few reads rather than assuming 50% of read support. SNP errors may be frequent.

    3. The read depth analysis seems irrelevant to the potential RDD artifacts. Note that bowtie distributes equally well mapped hits randomly. We can only see increased read depth if the reference is missing a paralogous copy of the gene. But in this case, we expect to a heterozygous to be called from both the whole-genome and RNA-seq alignment and will not be classified as an RDD. Thus in my view, the authors’ “read-depth” analysis only proves an effect that we know is unlikely to happen. We have not argued missing copies in the reference is a big problem with the authors’ RDDs.

    4. As to Table 2, I have not got their supplementary. My guess is they were aligning 10,210 50bp flanking sequences to the whole genome and found 9% of them have a equally good to or better hit than to the RDD flanking. 9% is surprisingly and exceedingly high especially given that bowtie already believes the reads can be “uniquely” mapped. If we randomly draw 10000 good exon SNPs, probably the percentage is well below 1%. This already implies that their RDDs are highly enriched with paralogous sequences.

    5. About the RPL28 potential RDD. I do not have a complete theory to explain the experiment in Figure 5, but I need to point out that this “RDD” is unusual in that that stop codon occurs at the boundary of an exon. This RDD may potentially be caused by the exon boundary effect which you and others have mentioned, but the authors have not refuted. As to the translation of that 10aa peptide, that may be the result of a rare alternative splicing. You have shown in your previous paper that there may be many rare isoforms in human cells. To exclude the alternative splicing hypothesis, the authors need to find a peptide that bridge the RDD, not following the RDD.

    In all, my view is the authors’ responses are not adequate to assure us that most of RDDs are not artifacts. I would suggest doing the following to address my concerns:

    a) For every filter, compute the ratio of A-to-I RDDs and non-A-to-I RDDs. If the ratio stays the same for all filters, non-A-to-I RDDs are likely to be real.

    b) Apply the CRG 36bp alignability mask available from UCSC. If after this filter, the authors can still find 10000 RDDs and/or the ratio of A-to-I and non-A-to-I stays the same, we will stop worrying about the false RDDs due to paralogous sequences.

    c) After alignment, trim the first and the last 5bp from read and redo the analysis as Lin/Piskol et al did. The authors seem to suggest this lead to a lower A-to-I/non-A-to-I ratio in their response, but details are not given on how the sites are exactly classified. It would be much cleaner and easier for us to understand if they just trim off 10bp from each read. I know some groups are requiring each SNP to be supported by at least one read at the middle 1/3 part of the read. This is also a good filter.

    d) Get the raw reads for a human with deep coverage, map reads with bowtie and run their pipeline to call genomic SNPs with the exact settings they use for calling RNA-seq SNPs (not assuming heterozygotes get 50% read support). Compare the result to a published one or to the one from a standard pipeline (BWA+GATK/samtools). This helps to evaluate the quality of the pipeline.

    Finding RNA editings is very challenging, perhaps more challenging even than finding somatic mutations due to the distinct library prep of RNA-seq and the splicing as well. We need to be extremely careful about such sites when we try to find them in new sequencing data. My gut feeling is most publications might not get the number right and I know a few most experienced research groups on variant calling find that RDDs are very rare and are nearly all A-to-I (they believe the rest non-A-to-I are artifacts).


  • Hi Heng (note to readers: Heng Li is not the Li from Li et al.),

    Thanks for these comments.

    experienced research groups on variant calling find that RDDs are very rare and are nearly all A-to-I (they believe the rest non-A-to-I are artifacts).

    Yes, this is consistent with what I’ve heard as well.

  • Jeffrey Rosenfeld

    Hi Joe,

    This is a great post. I think one of the most important things you point out is the size of the genome and how even small errors get magnified. As you say “Even if a given systematic error affects only 0.001% of the bases in the genome, you’d still expect to run across it 30,000 times if you look at the whole genome!” People tend to forget this and have trouble understanding how inadequate 99% accuracy is in terms of a 3 gigabase genome. Similarly, we forget that when looking at human genomes, there are 7 billion samples on this planet and we have looked at <<<1% of them

  • Hi Jeff,

    Thanks for the comment. Yep, I think it’s easy to be impressed by raw numbers. 10,000 RDD sites sounds way to large for *all* of them to be false positives. But when you think instead about it being one in a hundred thousand sites, now it’s not so crazy: there are errors that pop up at rates much higher than that.

  • Hi Joe,

    Thanks for this blog post. As you can imagine, I fully agree with your analysis. I think the authors fail to address the main issues that were raised. After a quick reading of their response, I have a couple of comments to add:

    1) From the original 10,210 RDD sites, 8069 are also identified by GSNAP. That is, by changing only the alignment step, 21% of their RDDs are lost. Not a very robust detection pipeline.

    2) The fact that RDDs happen often in the first position of reads, and only in the first strand, as Jin Billy Li shows in his comment, may be explained by a bias in Illumina transcriptome sequencing. This biased is characterized in a paper published in 2010 by S. Dudoit:

    Figure 1 of that paper shows a plot of the nucleotide content at each position of mapped sequencing reads in multiple RNAseq experiments. They find a strong bias at the first position: an enrichment in G, a depletion in A. They show that this artifact is due to the random hexamer priming, and predict that this bias will only affect reads from the 5’end of the first strand. This analysis matches very well Billy Li’s observations, and supports his speculation about the cause of the enrichment of RDDs at the 5’end.

    This bias would also predict a higher frequency of A-to-G mismatches when reads are allowed to match imperfectly to the genome, as Cheung reports in her response.

    2)The biological mechanism suggested by the authors to explain why the RDDs are located mostly at the end of reads (indel RDDs) is not only improbable, as you point out very well here, but insufficient. If true, it would explain only a very small fraction of the sites they report. They find 1586 (15%) of the RDD sites close to an indel. Given that they need to explain 75-80% of their sites that are estimated FP by positional bias, this mechanism seems highly insufficient.

    3) In figure 2, they only show RNA-seq reads. In other individuals from 1000 genomes, which have a very high DNA coverage (individuals NA12891 and NA12892), reads with the alternative base are also seen. Since they are present at low proportion, the individuals are still called homozygous. The challenges at sequencing and mapping these difficult regions (homopolymers, paralogous, regions with indels) are evident when you look at the actual DNA reads, and not only at the inferred genotype, as they do. The same thing happens for the case of figure 3.

    4) I really, really do not understand why RNA editing in rice and Arabidopsis can be used as evidence to back up their claims (Table 1). RNA editing is a very common and well characterized process in many species, including animals, plants, fungi, protists and viruses. The types, frequencies and molecular mechanisms involved vary enormously across species. RNA editing in plants can not be used to prove anything in humans. Of course, it adds a line to the table, which seems a little more impressive. I don’t dear looking in detail at the other references…

    5) The read depth analysis is irrelevant, as Heng Li points out. Besides the arguments given by Heng, the authors do not mention removing repetitive regions or intronic/intergenic regions from their non-RDD control set, to match the genomic regions under study. I have no access to their supporting material yet, so I may be wrong here.

    Anyway, I am glad the comments are finally public. I was also glad to see that our analyses went in the same direction. Hopefully, more stringent controls will be required in future studies. How to adequately peer review these increasingly larger-scale studies, though, is a whole other issue.

  • Hi Claudia,

    Thanks for this, I agree with everything you write. Another discussion of the primer issue is here:

    Both discussions of the primer problem talk more about non-randomness in sequence coverage, rather than changed RNA sequences, but this may be just because they were more interested in the former.

    I think the Li et al. response boils down to, as they write:

    “[The] Comments do not refute the presence of RDDs; rather, the disagreement is over how many there are”.

    This is not entirely correct, in that I think the number of true RDD sites of unknown mechanism in human LCLs may actually be zero (and showing that 90% are clearly wrong would make most people consider that possibility). But it’s of course quite difficult to show the exact problem at all 10,000 sites, which I guess is what would be required for a “refutation”. I think we’ve made a very strong case that the results in this paper can’t be trusted; hopefully this will encourage a little more skepticism in the field.

  • Brenton Graveley

    This is an excellent blog post and great follow up comments. I would like to thank all the authors of the three comment papers for pointing out the glaring flaws of the Li et al study.

    One additional source of error has to do with the way in which Li et al performed the alignments. Rather than aligning the reads to the entire genome, they only aligned them to the annotated transcriptome (mostly spliced mRNAs). This represents an extremely small portion of the genome and lacks the vast majority of the transcribed regions of the genome. Thus, reads that align perfectly to a region of the genome not present in the annotated transcriptome, would be forced to align with a mismatch to a region of the genome and thus look like an RDD site, even though it isn’t. It is a shame that Li et al did not align the reads to the entire genome.

    In addition, the western blot shown to validate the RDD site in RPL28 used an antibody that doesn’t recognize the amino acid in the RDD site itself. Li et al interpret this as showing that the RDD site is real, but the data doesn’t directly show this.

    I agree entirely with Joe that all evidence points to the likelihood that all of the RDD sites (with the exception of some A-to-G and C-to-U RDDs) are due to technical or analytical artifacts.

  • Brent, thanks for the comment.

    I agree entirely with Joe that all evidence points to the likelihood that all of the RDD sites (with the exception of some A-to-G and C-to-U RDDs) are due to technical or analytical artifacts.

    Yes, this reminds me that I should be more clear (which I wasn’t in the post) to readers following this that C->U RNA editing is also known to exist. Not sure it happens much in the cell type being studied here, but there might be a few real sites of that type in the Li et al. data.

  • Irrespective of whether the 2011 report of Li et al. is correct or incorrect, a theoretical case lending some support to it, has been on the table for many years. The case is most clearly presented in our 2002 paper in Trends in Immunology (vol. 23, 575-579), and the clearest current support comes from the discovery of CRISPR systems in bacteria.
    For mRNA read “intracellular RNA antibody”. Irrespective of what that mRNA encodes, if its sequence happens to interact with the nucleic acid of a viral pathogen, then possession of that mRNA may be adaptively advantageous. Under normal conditions, a cell has a small specialized population of RNAs, but under stress (e.g. infections) it would seem beneficial for a cell both to synthesize a wider range of RNAs, and to mutate them so offering a wider range of “RNA antibodies” for reacting with potential pathogens. Some investigators may prepare their RNAs from cells under conditions such that some stress is inevitable. Others may prepare their RNAs under less stressful conditions. Thus, different investigators may come to disagree. For further background please see my textbook “Evolutionary Bioinformatics” (2nd edition, Springer, 2011).

  • Great post Joe, and I am glad to finally see these Technical Comments out there! The idea of widespread RDDs seems to be pretty indefensible after reading these, so it is no surprise that Li et al.’s response is woefully inadequate.

    I just have one thing to add about the paralog issue. Li et al. argue that paralogs are not an issue because there is no difference in the read-depth distributions between RDDs and other sites. I would argue that RDDs would be expected to have lower read depth, all else being equal, because the presence of mismatches can lower read depth substantially. In any case, I actually think the distributions shown in Fig. 3B look somewhat different, with RDDs having higher read-depth. No surprise they say nothing quantitative here. It is clear that there is no need to invoke paralogs at all to argue that the vast majority of RDDs are artifactual, I just wanted to point out that Li. et al. fail to defend their results on this point as well as all of the others.

  • Hi Dan, that’s an interesting observation!

    Eyeballing Figure 3B here:

    I totally agree. There does look to be a difference in read depth between the top and bottom plots. I imagine it would be accentuated if they were to look at the entire genome.

    You can imagine many reasons why this might be the case (the whole chromosome must have many regions with reduced mappability), but given the way that plot looks, it’s definitely odd to assert there are no differences without giving the numbers.

  • Most discussions on RDD is centered around in silico pipeline usage and potential false positives that it may create during that process. To me, everyone is forgetting that this is Biology and not Informatics. If, indeed, there are so many previously undiscovered RDD, then to me, doing the following experiment will nail it. Produce a stable cell line with a gene KO that has shown to possess RDD and another that doesn’t (one can do this for a dozen of genes at different regions of the genome, with different GC distribution and with different sizes), transfect the same cell with the genes that shows RDD and the ones that doesn’t, sequence RNA, cDNA and gDNA from all the cell lines (including one with KO that will address the issue of multiple forms being active), amplify genes of interest with upstream gene-specific primers and do Sanger sequencing. That’s all. This will nail things. Also, it would be good to select genes with RDD that possess shorter intergenic sequences so that one can take primers at gene-gene boundaries to amplify before sequencing. Why can’t Li et al. do this?

  • Xinshu (Grace) Xiao

    Thanks Joe for this excellent blog post. I enjoyed reading all the great follow up comments and agree with all.

    Our work (# 9 cited by Joe above: Bahn et al, Genome Res) looked at RDD sites in RNA-Seq of human cancer cells which came online in Sept 2011 (it was under review when Li et al’s 2011 paper came out). Our paper was cited in the Li M et al’s response to the three comments as a “support” of their findings (Table 1). This may have caused the confusion that our results are of the same nature as theirs. I wanted to point out that there are fundamental differences between their results and ours. We did find thousands of potential A-to-I sites, but most of them were in non-coding regions (introns and UTRs). Only 45 sites were in coding regions, as opposed to their claimed 10,000 coding RDDs. In addition to validating >100 predicted A-to-I sites by Sanger, we also did ADAR1 knockdown followed by RNA-Seq in U87 cells and confirmed that the number of predicted A-to-I sites is very small in the knockdown data using our method. I think Li et al should do more extensive validation of their claimed “correct” results, as pointed out in several comments above.

    Another difference between their results and ours is that we found that A-to-I sites constitute the primary type of RDDs. The other types are very rare in general. In U87, 62% of our RDDs were A-to-I; in breast cancer data, 82% were A-to-I. If requiring at least 20% editing level, percentages of A-to-I are much higher. Of course, A-to-I editing is well-known and the DARNED database lists a very large number of such possible sites. Thus, it is not a big surprise to find many of them here.

    Importantly, as Joe pointed out, our validation of non-A-to-I sites using Sanger/clonal sequencing showed that only about 50% of them could be true (keeping in mind that these validation experiments have their own caveats). Thus, our results show that non-A-to-I editing in human (at least in these cancer cells) is rare and cannot be identified accurately in RNA-Seq, which is why we emphasized “A-to-I editing” in the title of our paper. We wanted to convey the message that human RDDs are mostly of the A-to-G type, which differs from Li et al’s claims. Thus our paper should not be viewed as a “support” of their conclusions.

    As Joe pointed out, both our paper and the # 8 cited above (Peng et al, Nature Biotech 2012) used similar filtering steps as in the 3 commentaries to remove potential false positives. I completely agree with the other comments here that rigorous analysis methods need to be used with caution to look at RNA-editing in RNA-Seq and there is a lot of room for future studies in methodology and biology in this field.

  • Hi Grace,

    Thanks for the comment. I very much enjoyed reading your paper, and indeed noticed that your results are fundamentally different from those in Li et al. So I too was surprised that your paper was cited by Li et al. as support for their conclusions!

    For readers, here’s the link to Grace’s paper again:

  • Hi Joe,

    This is a great set of blog posts by you. Your posts, as well as the other comments above, really highlight the difficulties in calling RNA editing sites from high-throughput RNA-sequencing data. It’s not trivial to pinpoint the editing sites while filtering out the false positives.

    Following up on the issues presented in the three technical comments, we (from Billy Li’s lab as well as Cold Spring Harbor Lab) developed a pipeline to accurately call RNA editing sites ( In agreement with Heng Li above, we find that almost all of our editing sites are of the A-to-I type. I just want to make three quick points:

    1. The read mapping tool matters. There is a reason why genomic DNA SNP calling studies such as 1000 Genomes prefer to use a gapped alignment algorithm (BWA) over an ungapped algorithm (Bowtie). Bowtie does not provide usable mapping quality scores or soft clip mismatches occurring at read ends. This helps to avoid the situation where the majority of mismatches occur at read ends (a point mentioned by all three technical comments).

    2. Because A-to-I editing is quite pervasive in the Alu repeats, the real difficulty lies with calling editing sites in non-Alu regions of the genome. Indeed we found that with minimal filtering of variants (removal of genomic DNA variation and artificial mismatches caused by random hexamer priming), 96% of the mismatches in Alu repeats are of the A-to-G type. In stark contrast, we had to utilize a rigmarole of filtering steps for non-Alu editing sites, because the prevalence of editing is much lower.

    3. We were unable to validate any non-A-to-G mismatches using PCR and Sanger sequencing. We took pains to ensure that our primers were specific to the desired regions. We saw that some of the previously “validated” non-A-to-G mismatches were due to the PCR primers amplifying a paralogous region of the genome in addition to the desired transcript. Our opinion is that all of these non-canonical editing sites are caused by technical artifacts.

  • Hi Joe,

    I read the Li et al. paper when it came out. It was referred to me by a friend who asked me to check the proteomics part of the work, as there would be rather significant consequences for proteomics data analysis if this type of frequent RNA editing occurred. A quick examination of the proteomics results showed that the assignments supporting the editing idea were 100% false positives, suggesting that no changes in our current proteomics data analysis methods were necessary. Since RNA experimental data analysis is not my forte, I’m glad to see that the conclusions I came to on the basis of their mass spectrometry data were consistent with your (and others) thorough analyses of their RNA data.

  • Tao Liu & Jie Wang

    Dudes, blast this sequence in NBCI, see what’s happening


Comments are currently closed.

Page optimized by WP Minify WordPress Plugin