The iReceptor Gateway: FAIR Open Data, Interoperability, and Data Curation Promotes Rapid Response to COVID-19
By Robyn Nicholson, Intern, and Mark Leggott, Executive Director
As researchers around the world continue tirelessly to address the ongoing COVID-19 global pandemic), rapid, open access to reliable data from trustworthy repositories is more important than ever. In particular, research into understanding the immune response to COVID-19 has increased the demand for highly specialized types of data. One such area of research is the molecular basis of the immune response to infection in the context of an individual’s complete set of genes (i.e. the genome), or the science of immunogenomics. Understanding the immune response through immunogenomics is critical to the development of diagnostics and therapeutics against cancer, infections, and autoimmune diseases, leading to advances in biomedical research and improved patient care.
One type of genome data that is increasingly important in developing novel immunotherapies are sequences from the Adaptive Immune Receptor Repertoire, or AIRR-sequencing (AIRR-seq) data. AIRR-seq data comprise the host’s antibody/B-cell and T-cell receptor repertoires, an immensely diverse set of molecules that recognize pathogens – including newly evolving pathogens such as the novel coronavirus – and then mark these pathogens for destruction. AIRR-seq has enormous promise for understanding the dynamics of the immune repertoire in vaccinology, infectious diseases, autoimmunity, and cancer biology (Antibody Society), and rapid developments in sequencing technology are adding to this understanding. While AIRR-seq data is important, it is also very complex, and requires specialized tools and services to make the data more accessible to researchers.
Among the resources providing critical access to vital immunogenomic data is the iReceptor Gateway, a software platform that facilitates the curation, analysis, and sharing of AIRR-seq data from multiple labs and institutions around the world. Located at Simon Fraser University (SFU), the iReceptor Project’s initial funding came from the CANARIE Research Software Program. More recently, iReceptor has received additional funding from CANARIE’s Research Data Management (RDM) Program, which supports the development of software tools that enable Canadian researchers to adopt RDM best practices. This CANARIE funding allowed the iReceptor Team to successfully compete for a CFI Cyberinfrastructure grant in 2016, and a CIHR/EU Horizon 2020 collaborative grant, iReceptor Plus, with 19 participating institutions from 9 countries.
The iReceptor Gateway and COVID-19 Repository
The iReceptor Gateway integrates large, distributed, AIRR-seq data repositories, with the goal of connecting this network of repositories into an AIRR Data Commons, allowing queries across multiple projects, labs, and institutions (What is iReceptor?). The platform follows standards for sharing and interoperability developed by the Adaptive Immune Receptor Repertoire (AIRR) Community, and is committed to enabling researchers to increase the value of their data through sharing with the community. The AIRR Community, part of The Antibody Society, is a grassroots group of immunologists, immunogeneticists, and computer scientists dedicated to sharing data through the AIRR Data Commons. iReceptor is also a leading member of the iReceptor Plus Consortium, an international effort aimed at promoting human immunological data storage, integration, and controlled sharing for a wide range of clinical and scientific purposes (iReceptor Plus, Overview).
Most recently the iReceptor Project received funding through Funding Call 2 of the CANARIE RDM Program, which had a focus on the evolution of existing RDM platforms, repositories, and services to increase interoperability nationally and internationally. Recently, the iReceptor Project launched a COVID-19 data repository that now boasts over 180 million sequences of AIRR-seq data from eight studies of COVID-19 patients (Query for “COVID-19”, August 5, 2020). The platform also allows researchers to compare these COVID-19 data to ~2.7 Billion immune receptor sequences from other infectious diseases, cancer studies, autoimmune patients and healthy control individuals.
The COVID-19 repository has driven a dramatic increase in iReceptor visibility and usage, adding more new users in early July 2020 than what the platform would normally see in six months. The growing success of the iReceptor Project highlights the platform as a remarkable example of the impact of well-designed software, open data sharing, and what is possible when outputs and associated infrastructures are citable, linkable, and interoperable.
A recent study that examined next-generation sequencing of T- and B- cell receptor repertoires from COVID-19 patients specifically cited the iReceptor Gateway as “an actively updated repository … opened for public scientific use that will allow researchers with different backgrounds to test their individual hypotheses on a growing dataset” (Schultheiß et al., 2020). Not only did this citation likely contribute to iReceptor’s recent surge in usage, it also demonstrates the impact of the platform’s capabilities.
Importance of FAIR and Data Curation
An important aspect of iReceptor’s functionality, or of any domain-specific repository platform for that matter, is adherence to the FAIR Principles and effective data curation. As iReceptor notes, “Data curation is a fundamental part of making scientific data Findable, Accessible, Interoperable, and Reusable (FAIR)” (iReceptor, Data Curation). AIRR-seq studies are often diverse and complex, often requiring specialized domain knowledge during the curation process.
The table below summarizes how the iReceptor platform has evolved, and continues to evolve with its latest CANARIE funding, to better support the key elements of the FAIR Principles. The iReceptor platform is an exemplary model for how the FAIR Principles intersect in a software development process (Toward FAIR principles for research software). The international AIRR community also maintains an open process for defining and maintaining associated standards, and intersects with other relevant communities and standards efforts (e.g. GA4GH, RDA, and IUIS).
|FAIR Element||iReceptor Implementation|
|F1: (meta)data have a globally unique and eternally persistent ID||The AIRR Standards link to external PIDs for specific metadata record fields. Such PIDs are typically assigned by either well known data providers (e.g. unique Study IDs from the International Nucleotide Sequence Database Collaboration or DOIs for publications. Internal metadata objects have PIDs that are unique within an AIRR Data Commons Repository. The AIRR Community is currently working on a globally unique PID system for entities within repositories in the AIRR Data Commons.|
|F2: data have rich metadata||The MiAIRR data standard describes the minimal datasets associated with a study; the AIRR schema describes the metadata elements. These have been described in the literature.|
|F3: metadata specifies the data ID||AIRR metadata records include a range of unique IDs, including the Study, Publications, Species, Disease, Study Type, Tissue, CURIE ontology links, and a variety of external PIDs.|
|F4: (meta)data indexed in a searchable resource||The iReceptor Gateway provides simple and advanced search functionality for a number of integrated external repositories via a standard API.|
|A1: (meta)data retrievable by their ID using a standardized protocol||The iReceptor Gateway provides reusable PIDs for queries (which can be bookmarked for later use/citations), and individual metadata records.|
|A1.1: protocol is open, free and universally implementable||All AIRR community standards are open and accessible.|
|A1.2 protocol allows for Authentication/Authorization where needed||The iReceptor Gateway requires a login, and uses Tapis as an Identity Provider and OAuth2.|
|A2: metadata always accessible||Metadata about the iReceptor repositories are registered at FAIRsharing.org with the DOI record https://fairsharing.org/biodbcore-000974/. Metadata from individual repositories is open and accessible using the AIRR Data Commons API.|
|I1: (meta)data use a formal, accessible, shared, broadly applicable language for knowledge representation||The MiAIRR data standard describes the minimal datasets associated with a study; the AIRR schema describes the metadata elements. These have been described in the literature.|
|I2: (meta)data use vocabularies that follow FAIR principles||The AIRR community standards are based on the FAIR Principles (The ADC API).|
|I3: (meta)data include qualified references to other (meta)data||The AIRR schema and canonical AIRR Object record provides unique IDs for all related Study outputs.|
|R1: meta(data) richly described with accurate and relevant attributes||The AIRR and MiAIRR standards provide a rich combination of minimal descriptive metadata records, and fields for describing Study elements.|
|R2: (meta)data released with a clear and accessible data usage license||The AIRR Data Commons query API has a “license” attribute in the web response that provides a mechanism for repositories to state the data usage license for data at the repository level.|
|R3: (meta)data associated with detailed provenance||iReceptor repositories provide detailed provenance information about data that resides within the repositories.|
|R4: (meta)data meet domain-relevant community standards||The AIRR and associated schemas and documentation use the Creative Commons Attribution 4.0 International license.|
The iReceptor platform and the complex nature of immunogenomics data also highlights that data curation often requires domain-specific expertise. It also requires skills that ensure RDM best practices are followed to maximize data quality, access, and reusability. How to best bridge the gap between domain-specific and data management expertise is critical to the future of genomics research, and one requiring a community-based system with a robust approach to training across the spectrum (Huang, Jörgensen & Stvilia, 2015).
In October 2019, the Portage Network, in collaboration with McMaster University Library, hosted the Canadian Data Curation Forum (funded by SSHRC), with the goal of establishing a national Community of Practice to catalyze the development/adoption of data curation standards, practices, tools, and skills across disciplines and institutions. The final event report provided recommendations for a national approach to data curation services in Canada, including engagement with researchers, data producers, and other stakeholders to ensure development of services relevant to current needs, as well as investment in infrastructure (Clary et al., 2020). The iReceptor Gateway is an ideal example of how investment in careful and consistent data curation creates immense added value to scientific data, further enhancing its use and FAIRness.
Researchers interested in sharing data or exploring the AIRR Data Commons through the open iReceptor Gateway can visit gateway.ireceptor.org and request an account by emailing firstname.lastname@example.org. The recently released RDA (Research Data Alliance) COVID-19 Recommendations and Guidelines on Data Sharing will also be of interest for researchers looking to prepare their COVID-19 research outputs for sharing (for more information, see The Value of RDA for COVID-19). The RDA document features specific guidance for Omics researchers, including genomics, proteomics, metabolomics and lipidomics (see Section 4). Please forward this information to your colleagues studying COVID-19 so our efforts can truly make a difference.