I initially came across openSNP when the team won in late 2011 the PLoS/Mendeley binary battle. This competition was open to software that integrate with Mendeley*, a suite of web and desktop tools designed to manage bibliography. So while the scope of the competition was quite broad, the winners self described their project in an interview in a way that directly relates to themes of interest to the Genomes Unzipped crew and readers. Precisely I quote: “we try to be a community-driven platform for people who are willing to share phenotypic and genetic information for the public”. Given these aims, I decided to look into openSNP to understand what the service and aims are. I also contacted Bastian Greshake from the openSNP team who has been very helpful in answering my questions. To make a long story short, this is a fantastic idea and a great implementation, a real must-try for all users interested in the direct-to-consumer (DTC) genetic market. Keep reading for the full story.
Team and funding
OpenSNP is ran by a group of four German researchers/students/web developers, most of them who met during their undergraduate studies. The site was essentially self financed in its early days but costs have been recovered by the $10K prize from the Mendeley/PLoS competition. The team has received further support from the German Wikimedia foundation and this money has been invested toward the purchase of 23andMe kits for users interested in sharing genotype data through the openSNP platform. Much more details about recent developments for openSNP are detailed on this blog.
Uploading genetic data and linking genotypes with the literature
Data sharing using openSNP is simply done by downloading genotype calls from your DTC company of choice and uploading the file to the openSNP website. I had no problem at all uploading my own genetic data and I suppose most users won’t either. Three formats are currently available: 23andMe, DecodeMe and Family Tree DNA. It covers a good chunk of DTC companies out there but not all of them. It is all genotype based so no mention of sequence data, but this what people have right now so it is hard to ask for anything else.
Now the magic starts. Say you are interested in a specific SNP, taking for example the rs9939609 in the gene FTO. This SNP is well known to be associated with body mass index (the “fat gene”). Each SNP links to a page on openSNP that lists for you the variant and allele frequencies (estimated from openSNP users) as well as the scientific literature that implicates this variant. Looking at rs9939609, I can see seven publications in the PLoS open access journals and 25 publications listed by Mendeley users. You also find some effect sizes associated with the SNP and, of course, openSNP tells you what your own genotype is (provided that you uploaded the relevant information).
Arguably the most useful feature is the flexible search. In particular, the search can be driven by trait rather genotype. Say you are interested in the genetics of BMI, you can enter BMI in the search box and find all paper with BMI mentioned in the title and the SNPs that these publications link to. The same goes if your search is based on a gene. A slight missing feature is the fact the resulting SNP list cannot be further filtered by SNPs available in the user’s genotype file, as many users will be interested in pulling results which they can link to their own data. The process of pulling publications is automated, but the summary link with phenotype data is generated from SNPedia. So there is some manual curation involved, even though it is not handled by the openSNP team.
Adding trait data and running association studies with openSNP
It is not enough to upload genetic data as the obvious aim is to be able to use user generated data to run genetic association studies. So what type of traits can be linked to your genotype? Essentially anything you want. A “Crate a new phenotype box” allows you to provide any information you see fit about a phenotype of interest. But more simply I see 106 phenotypes currently proposed by users. Numbers are relatively modest still: 116 have filled the eye colour question, and 101 the right handed/left handed question. The third most popular is height with 88 responses. Bastian mentioned approximately 300 users today, and 150 of them uploaded genetic data. So clearly not enough to run well powered association studies but this database can only grow.
An slightly unusual feature of the trait database is the fact that, while you can use a previously stated answer, replies are completely open-ended if you want them to be. A consequence of this is that for Handedness, for which we could reasonably expect three answers (left, right, both), I see 11 replies right now including “right-handed”, “right handed” (note the absence of “-”), Right-handed (note the “R” instead of “r”), Right… so clearly some confusion here which could be handled at least to some extent by a computer based filtering process. Right now, it definitely adds a layer of complexity to the analysis process.
Otherwise the process to query the trait database is straighforward: click on the trait of interest and you will receive a link to a zipped file containing all the uploaded files by openSNP users who selected this specific phenotype. For the eye colour “brown” I received a link to a 88 Mb zipped file for 10 users. File names include user ID, gender and year of birth. So it is possible (and easy) to identify the individuals with the trait in question (see below for privacy issues).
Data sharing and privacy concerns
OpenSNP largely relies on data sharing from other users to enrich the platform. And the process to share data is indeed very loose. Providing an answer to one of the questionnaire will directly link this answer to the genotype data and user ID. So there is no option to share medical data while preserving privacy. As soon as you put your data on openSNP, these data are fully identifiable and linked with your answer to questionnaires. Your answers can be removed if you decide to do so, and the same goes for genotype data which can be deleted. But the fact remains that you need to be pretty relaxed about exposing trait/genotype to public view if you decide to use openSNP. This is clearly stated when the data are uploaded but if the users glance over this information they may not be fully aware of the extent of the sharing.
Future plans
One area that probably needs further work is a more sophisticated handling of consent, as privacy may raise issues if the user base grows. In our discussion, Bastian mentioned that future versions should include consent handling through a dedicated website called weconsent.us, whose aim is to streamline the consent process. Another improvement that is planned is a more constrained questionnaire system, similar to 23andMe, where the answers from users better fit in predefined boxes. As the user base grows, running association tests within the website is also an aim but the sample size is not yet sufficient to implement and test it properly.
Wrap-up
Overall, the potential of such a platform is amazing. It should be possible to do wonderful things with the tool. Some of the functions are already mature: selecting a SNP will pull your own genotype, show the population frequency, and link to the literature/SNPedia. Search can work by trait, gene, variant and the results appear to be quite exhaustive, It is in my opinion the best tool available today to explore genotype data and link these data to human traits. The data are exhaustive and somewhat messy, so it may be difficult for users not experienced with the scientific publication process to find their way. For this category of users, the 23andMe approach of providing well explained and manually curated search results is probably more appropriate. But if you want the freedom to explore the full scientific literature, openSNP is what you are looking for.
However, going beyond the ability to search and explore will be difficult. The fully automated process and non-curated setup to gather trait information makes the process somewhat disorganized, in stark contrast with the carefully crafted questionnaires put together by 23andMe for example. Reaching a critical sample size to run sufficiently powerful case control studies will be difficult. And if the openSNP manages to go that far they will need to find solutions to scale up properly all the features already in place. Nevertheless, I wish them a lot of success and hopefully the user base will keep growing.
* Try Mendeley it if you haven’t. In my personal opinion, it beats all commercial alternatives for bibliography management (and includes an excellent plugin for integration with Word).
It seems Open Source genetic data is the coming thing: for a current use of the Portable Legal Consent being developed by Consent for Research, see:
http://www.nature.com/news/open-data-project-aims-to-ease-the-way-for-genomic-research-1.10507
which has a certain Dan Vorhaus volunteering his time …