Database Scope and Content

GWAS Central aims to provide an extensive, centralized compilation of summary level findings from human genetic association studies, both large and small. This is needed so that researchers have an easy way to access the totality of association study data in existence for their genes, genome regions, or diseases of interest. Such a depository will allow true positive signals to be more readily distinguished from false positives (type I error) that fail to consistently replicate, it will aid in the identification of technical artefacts in genotyping procedures, it will elucidate population specific signals, and it will minimize the serious problems of publication bias that cannot be solved unless sites like GWAS Central are created where negative studies can be reported just as easily and as quickly as positive studies.

To produce the content of GWAS Central, our curators actively gather large datasets, such as Whole Genome Association Study (GWA) findings, from many different public domain projects. We intend to grow this into a comprehensive effort in the very near future. All the data sources we have approached for help in this regard have been extremely helpful and forthcoming, and many automated data gathering pipelines from the larger projects have already been set up. For smaller datasets, such as gene or region specific investigations and replication efforts by small and medium sized laboratories, we encourage researchers to submit their study findings to us and provide a help text on how to do this.

For the future, we are devising a standalone tool that submitters will be able to download and install locally, which will actively guide researchers through the process of gathering and checking their data before submitting it to us. The tool will organize their submission content into an XML formatted document that is stored on the submitter’s computer.

As mentioned above, data in GWAS Central is restricted to summary level or aggregated information (i.e., results on groups of individuals, but no individual level genotypes or phenotypes). As a consequence of this the project is not impacted by issues such as anonymising individuals, gaining informed consent, or data security. All records do, however, carry links and acknowledgements that lead back to the original data source, so that users who might wish to obtain the non-aggregated data can make suitable requests to the relevant data access authorities.

As of July 2008, GWAS Central content includes the full set of markers listed in dbSNP. In the near future we will also incorporate the data from UniSTS and DBGV, plus all dbSNP current allele and genotype frequency data. Upon each new build of these depositories we synchronise GWAS Central records with any changed content in those new releases (e.g. marker or allele deletions or mergers) using custom software we have built for this purpose, and present everything on the same DNA strand as used by the marker source database. This updating procedure also ensures that correct relationships are retained between markers/alleles and their frequency and association constituents.

Genetic association study data, along with allele and genotype frequency findings, are layered on top of the extensive marker content of GWAS Central. This layer of ‘laboratory results’ is complex in nature, and so to help users navigate their way through it we structured it in a way compatible with a journal article: A ‘Study’ with various Experiments, Sample Panels, and Phenotypes within it, wherein each Experiment contains data on Markers, Frequencies, and Associations. More information on these Study components and their relationships is provided here. The data modelling behind this Study and Experiment concept was with then taken forward by many Institutes worldwide to create what is now an Object Management Group approved standard called the ‘Phenotype And Genotype Experiment Object model’ (PaGE-OM). To find out more about the GWAS Central data model, download the data model diagram and/or the MySQL relational schema definition).