Introduction and Background
The launch of the GISAID data science initiative in 2008 marked the beginning of an unprecedented era of global collaboration on influenza data sharing. This was facilitated by measures that fostered scientific etiquette and fairness in data access. These groundbreaking approaches to data stewardship became key to incentivizing the community of data generators across the globe.
Eight years after GISAID pioneered this novel approach to data stewardship, a diverse set of stakeholders, with the objective of defining measurable guidance to help improve the infrastructure supporting the reuse of scholarly data, communicated their FAIR data principles. Drawing on the landscape of various data management approaches, FAIR's collection of 'Guiding Principles for scientific data management and stewardship' was published in 2016 for those wishing to enhance the reusability of their data holdings. FAIR principles are well reflected in GISAID (see below).
With today's technologies continuously evolving, GISAID advances its practices to ensure the best experience for data generators and service to the research community to meet GISAID's primary objective:
Incentivize rapid sharing of data from high-impact pathogens
in a manner that is transparent and fair to those generating data,
while providing public access to the data on equal footing
Executive Summary
The GISAID Data Science Initiative has played a transformative role in advancing global collaboration on the rapid sharing of pathogen data since its launch in 2008. By introducing novel data stewardship measures, GISAID has set industry benchmarks for fairness, transparency, and scientific etiquette in data sharing. The emergence of the FAIR (Findable, Accessible, Interoperable, Reusable) data principles in 2016, established by a diverse coalition of stakeholders, further built upon the ethos that GISAID had pioneered—aiming to enhance the infrastructure for scholarly data reuse globally.
Findability:
Central to the FAIR principles is the ability to easily locate datasets. GISAID ensures this by assigning each data record a unique and persistent identifier (EPI_ISL ID) and, for curated collections, an EPI_SET ID linked to a digital object identifier (DOI). These identifiers enable granular traceability of genetic sequences and metadata, supporting versioning, transparency, and scientific reproducibility. GISAID’s datasets are indexed with global data registries and search engines, ensuring wide visibility and discoverability by both humans and machines.
Accessibility:
GISAID prioritizes transparent and equitable access to data. All data are retrievable via standardized, open, and free HTTPS protocols through web interfaces, DOI URLs, and deep-linking options. User authentication is required, governed by publicly available access agreements, to ensure proper data use and contributor acknowledgment. Registration is free and open to the global community, balancing open access with accountability. Metadata remain accessible even if the underlying data are withdrawn, ensuring persistent data traceability.
Interoperability:
To allow integration with other datasets and analytical workflows, GISAID employs widely recognized data formats (CSV, TSV, JSON, FASTA, FASTQ) and a controlled, documented vocabulary for metadata. Cross-referencing capabilities enable data to be linked with external clinical datasets or peer-reviewed studies via persistent identifiers. GISAID enhances the contextual richness and utility of its data by enabling users to track the use and citation of datasets in scientific literature, supporting interoperable research and facilitating powerful cross-disciplinary analyses.
Reusability:
GISAID’s databases are meticulously curated by a global team of full-time staff to ensure that both data and metadata meet rigorous domain-specific community standards, defined in consultation with international health organizations (e.g., FAO, WHO, WOHA). Detailed provenance is maintained by mandating the inclusion of information on origin and submission, capturing both laboratory and author-level contributions. GISAID’s data are released under a clear, accessible license, with provisions for temporary publishing embargoes to protect contributors’ publication rights. The adoption of sustainable file formats, community-agreed standards, and comprehensive documentation supports reproducibility and facilitates robust data reuse.
Conclusion:
GISAID exemplifies the implementation of FAIR principles, providing a robust and trusted platform for sharing high-impact pathogen data. By fostering transparent, accessible, and equitable data sharing—while protecting data generators’ interests—GISAID supports public health responses, accelerates scientific discovery and reinforces a global ethos of responsible data stewardship. Its evolving practices keep pace with technological advancements to ensure continued relevance and impact for the research community.
FAIR suggests: The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services.
GISAID mints for each data record a globally unique and persistent identifier known as an EPI_ISL ID. This distinctive alphanumeric accession number permits the traceability of a single genetic sequence and associated metadata. For example, GISAID uses the official reference virus sequence (hCoV-19/Wuhan/WIV04/2019) for the betacoronavirus responsible for COVID-19 that is identifiable as EPI_ISL_402124. Versioning is implemented to reflect updates and history of data records.
GISAID also mints for any selected collection of genetic sequences and associated metadata a globally unique and persistent identifier known as an EPI_SET ID. EPI_SET IDs are minted in parallel with a corresponding digital object identifier ("DOI"). Both EPI_SET ID and DOI can be used interchangeably, for example, when publishing a study this option automatically provides a Data Availability Statement to meet criteria set forth by scientific journals while also fulfilling GISAID acknowledgment requirements of data contributors.
GISAID EPI_ISL IDs and EPI_SET IDs with the corresponding DOI facilitate access to all original records, making GISAID data optimally reusable to achieve both transparency and scientific reproducibility of an analysis.
GISAID provides a large number of metadata fields available for submissions to its databases, employing whenever possible controlled vocabulary to ensure content is legible, structured, and searchable. GISAID works with domain experts to select meaningful metadata fields, establishing community standards customized for each pathogen.
GISAID associates each sequence with several metadata variables thus providing a rich array of information, e.g., about the host, epidemiological context, and methods for sequencing and bioinformatics data analysis.
GISAID encourages users to provide as much metadata as possible; however the decision regarding which metadata to share, and to what extent, remains entirely with the submitter. This may vary due to several factors, including (a) patient confidentiality; (b) time and resource availability (c) integration with other areas of public health response and research; (d) other contextual considerations.
GISAID facilitates automatic assignment of variables for users, such that some fields can be populated without additional manual labor. This includes assessments of quality parameters, annotations of clade and lineage where available, as well as nucleotide and amino acid substitutions.
GISAID metadata annotations are subject to additional quality checks using machine and human curation that remain subject to ongoing update and review.
GISAID Metadata and sequence data are part of the same record in GISAID applications and databases. Exported reports and download packages of metadata include the unique EPI_ISL_ID accession number.
GISAID datasets are well-known around the world and among researchers working on viral pathogens. GISAID datasets are indexed with major search engines and registries promoting FAIR data sharing, including with re3data.org and FAIRsharing.org
FAIR suggests: Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorization.
GISAID uses the standard HTTPS protocol. Data can be retrieved through Web-GUI, minted DOI URLs and deep-links with the identifier to all original data.
GISAID uses the HTTPS protocol which is open, free, and universally implementable.
GISAID’s protocols require user authentication for access to the data platform to ensure transparent use of data. Authorization is checked against the list of users with valid access credentials that agreed to be bound by the GISAID database access agreement. Registration for users is open to the public and free (no-cost). Once authenticated, a user can navigate across the platform and its features without having to re-authenticate.
GISAID requires user authentication for access to dataset. Accessibility is specified in such a way that a machine can automatically understand the requirements, and then either automatically execute the requirements or alert the user to the requirements. This allows to authenticate the owner (or contributor) of each dataset, and to potentially set user-specific rights.
GISAID Metadata and genetic sequence data are one entity (see F3). The given identifier and basic metadata remain available at all times with a note on history and availability of the remainder of the data.
FAIR suggests: The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.
GISAID data are exchanged in broadly accepted standards like CSV, TSV, JSON, FASTA and FASTQ in a machine-readable format. Metadata variables are in English or using basic Latin alphanumeric characters.
GISAID uses a controlled vocabulary to describe datasets and (meta)data field standards are documented. This documentation is easily found on the GISAID website and accessible by anyone who uses the dataset.
GISAID allows qualified references to external data, such as clinical data stored in different (meta)data resources. For example, this can be achieved through the use of patient anonymized globally unique and persistent identifiers, provided by the (Originating) laboratory responsible for obtaining the specimen, or the (Submitting) laboratory responsible for generating and submitting to GISAID the sequence and other (meta)data.
GISAID enriches the contextual knowledge about the data by cross-referencing sequences and associated (meta)data with the publications in which they occur, e.g., by linking data to peer-reviewed studies that have referenced, or used a selected set of sequences. Conversely, by selecting a publication, or entering its DOI in the “Publication” search field, it is possible to retrieve the (meta)data linked to that study. This helps identify scientific studies that have referenced, or used a given dataset available in GISAID, thus improving the interoperability and visibility of data submitters’ contributions, while enabling efficient tracking of sequence and (meta)data usage in the literature.
GISAID enhances this process by providing a chronologic listing of peer-reviewed studies of a particular pathogen, permitting the instant retrieval of data cited in a study, e.g., cited in the data availability statement of the (meta)data behind a research project. GISAID continuously updates its DOI library of peer-reviewed studies that date back to 1984, to ensure comprehensive and up-to-date coverage.
GISAID’s sharing mechanism relies on transparent and auditable access control, and therefore does not permit any anonymous access to raw (meta)data from external third-party applications. Thanks to GISAID’s rich tool ecosystem, interoperable workflows across a multitude of interlinked tools for analysis and actionable insights are provided.
FAIR suggests: The ultimate goal of FAIR is to optimize the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.
GISAID's databases contain descriptive and detailed (meta)data curated in each field. Although GISAID accepts free-text on many attributes to facilitate the timely ingestion of (meta)data, examples and explanations are given for every attribute. GISAID works with domain experts to select meaningful and relevant (meta)data fields, establishing community standards that are customized pathogen specific.
GISAID data are released under a publicly accessible usage license published in 2008 that states under which terms and conditions (meta)data may be accessed and used. GISAID also provides for an additional license measure that allows data contributors to place (meta)data submissions under a temporary publishing embargo, to ensure the right of first publication is reserved for the original data contributors.
GISAID mandates the inclusion of the Originating Laboratory (responsible for obtaining the specimen) or the Submitting Laboratory (responsible for generating and submitting to GISAID the sequence and associated (meta)data), along with the authors, to ensure detailed provenance of the data generation e.g., ownership interests in data. Versioning of data ensures an accurate historic record of all changes.
Since its launch in 2008, GISAID has been a pioneer in establishing (meta)data field standards that continue to be defined by leading domain experts in the pathogen surveillance community of national public health and national animal health institutions, including those taking part in international fora facilitated by organizations such as FAO, WHO and WOHA. Adherence to standards and quality assurance is ensured by a global team of full-time GISAID data curators that are experts in their respective fields. GISAID employs well-established and sustainable file formats, standardized documentation (meta)data following SOPs for curation, and a common template with consistent vocabulary.
