Tag: Sequence Read Archive (SRA)

Automated Lineage Definitions Now Available in NCBI Virus SARS-CoV-2 Variants Overview

Automated Lineage Definitions Now Available in NCBI Virus SARS-CoV-2 Variants Overview

Recently, NCBI Virus SARS-CoV-2 Variants Overview moved from a manual to an automated process for selecting mutations required to define a lineage (e.g., Omicron, BA.2, JN.1, etc.). With this update, the SARS-CoV-2 Variant Overview provides coverage for all SARS-CoV-2 lineages and is no longer limited to only lineages with CDC status. The SARS-CoV-2 Variants Overview website reports results from analyzing both GenBank and unassembled Sequence Read Archive (SRA) sequence data. It allows you to view geographic and frequency trends of records assigned to Pango lineages and search for sequence records using lineage-defining or other mutations (example shown in Figure 1)  Continue reading “Automated Lineage Definitions Now Available in NCBI Virus SARS-CoV-2 Variants Overview”

Changes to SRA Data Access on the Google Cloud Platform (GCP)

Changes to SRA Data Access on the Google Cloud Platform (GCP)

Sequence Read Archive (SRA) data available via the Google Cloud Platform (GCP) are migrating from multi-region to single region us-east-1. This migration is projected to be complete by May 2024. To minimize the impact of this change, we recommend updating your workflow to access SRA data in us-east-1 region as soon as conveniently possible. 

Please note this change does not impact SRA data access from Amazon Web Services (AWS) or NCBI servers  Continue reading “Changes to SRA Data Access on the Google Cloud Platform (GCP)”

Update to GenBank Qualifier

Update to GenBank Qualifier

‘Country’ will transition to ‘Geographic Location’ effective June 2024

As announced earlier this year, we will begin to systematically gather ‘location of collection’ and ‘date and time of collection’ for sequence data submitted to GenBank and the Sequence Read Archive (SRA).

As part of this effort and to make location data more accurate and informative, we are also changing the way this information is represented on GenBank records, consistent with the relevant field in BioSample. Continue reading “Update to GenBank Qualifier”

Introducing Pebblescout: Index and Search Petabyte-Scale Sequence Resources Faster than Ever

Introducing Pebblescout: Index and Search Petabyte-Scale Sequence Resources Faster than Ever

NCBI is excited to introduce Pebblescout, a pilot web service that allows you to search for sequence matches in very large nucleotide databases, such as runs in the NIH Sequence Read Archive (SRA) and assemblies for whole genome shotgun sequencing projects in Genbank – faster and more efficiently!  

Pebblescout uses short segments of your query sequences to identify database records with matches. Matches are based on the frequency of a segment’s occurrence in a database. Result produced for each query is a ranked list of matching records where the ranking utilizes informativeness of matching segments.  Continue reading “Introducing Pebblescout: Index and Search Petabyte-Scale Sequence Resources Faster than Ever”

NCBI Virus: Mutation-Based Search for SARS-CoV-2 Data

NCBI Virus: Mutation-Based Search for SARS-CoV-2 Data

Millions of SARS-CoV-2 samples from around the world have been made publicly available as assembled and unassembled sequence data in GenBank and the Sequence Read Archive (SRA). Now you can find sequences with a particular mutation by searching with the protein and the amino acid change (e.g. S:F486V). Visit our SARS-CoV-2 Variant Overview on NCBI Virus and click on the Mutation tab to get started (Figure 1). 

Figure 1: SARS-CoV-2 Variants Overview. Arrows indicate important features on the page, including the “Lineages” and “Mutations” tabs to switch between views, the search box, and the information box describing the mutation format. The results are also indicated, including a summary of the total records found that contain the searched term as well as the results table.   Continue reading “NCBI Virus: Mutation-Based Search for SARS-CoV-2 Data”

Coming Soon! Including Sample Location and Collection Date and Time for Sequences Submitted to GenBank and SRA

Coming Soon! Including Sample Location and Collection Date and Time for Sequences Submitted to GenBank and SRA

As previously announced, in collaboration with our partners at the International Nucleotide Sequence Database Collaboration (INSDC), we will begin to systematically gather ‘location of collection’ and ‘date and time of collection’ for sequence data submitted to GenBank and the Sequence Read Archive (SRA). Gathering information about where and when a biological sample was collected aligns with other global sequence submission standardization efforts and will increase the utility of data made available through GenBank and SRA. These changes will be implemented in a phased approach through December 2024.

What’s new?

Sequence data submitted to GenBank and the SRA will need to include information about location and date and time of sample collection. These metadata will be entered using the pre-existing fields ‘country’ and ‘collection_date.’ Minimum information for these fields is described below. We encourage submitters to provide additional details when available: Continue reading “Coming Soon! Including Sample Location and Collection Date and Time for Sequences Submitted to GenBank and SRA”

Streamlining Access to SRA COVID-19 Datasets on the Cloud

Streamlining Access to SRA COVID-19 Datasets on the Cloud

To make it easier for you to find and access Sequence Read Archive (SRA) data, we are re-organizing and improving our cloud storage systems.  

Beginning April 2023, we will move the SARS-CoV-2 normalized data and source files from the COVID-19 data buckets on Amazon Web Services (AWS) and Google Cloud Platform (GCP) to the NIH NCBI SRA on AWS registry. We will also remove the SARS-CoV-2 original format data from AWS and GCP COVID-19 buckets and make them available in AWS cold storage. If you need these data, you can request them using the Cloud Data Delivery Service (CDDS). 

Where and how will I be able to access SARS-CoV-2 normalized data after this change occurs?

To ensure a smooth transition, we want you to have enough time to adjust your scripts and pipelines to minimize disruption to your analyses.   Continue reading “Streamlining Access to SRA COVID-19 Datasets on the Cloud”

3+ Ways NCBI is Enhancing the SRA Database

3+ Ways NCBI is Enhancing the SRA Database

Do you submit or access Sequence Read Archive (SRA) data? In an ongoing effort to enhance your experience, NCBI is making several improvements to our widely used SRA database. SRA is the largest publicly available repository of high throughput sequencing data. The archive accepts data from all organisms as well as metagenomic and environmental surveys. SRA stores raw sequencing data and alignment information to enable reproducibility and facilitate new discoveries through data analysis. 

What improvements is NCBI making?

  • More transparent: We recently launched the GenBank and SRA Data processing page to help you better understand how sequence data are submitted, processed, and made publicly available. 
  • More efficient: Faster data transfers, downloads, and analyses! We will be incrementally streamlining how you access SRA data as SRA Lite becomes the standard SRA file format. This simplified format reduces the average file size for more efficient analysis and storage of large datasets. 
  • More reliable: A trusted source! SRA is a trustworthy database, and we are continuously improving our processes to ensure system reliability.   
  • And more!  

Continue reading “3+ Ways NCBI is Enhancing the SRA Database”

Scrubbing human sequence contamination from Sequence Read Archive (SRA) submissions

Scrubbing human sequence contamination from Sequence Read Archive (SRA) submissions

Do you work with human-derived sequence data? Do you often struggle with the need to determine if your data is free of human sequence and therefore suitable for public distribution? We encourage submitters to screen for and remove contaminating human reads from data files prior to submission to SRA. To support investigators in this effort, we offer a tool to remove human sequence contamination from your SRA submissions!

Human Read Removal Tool (HRRT)

The Human Read Removal Tool (HRRT; also known as the Human Scrubber) is available on GitHub and DockerHub. The HRRT is based on the SRA Taxonomy Analysis Tool (STAT) that will take as input a fastq file and produce as output a fastq.clean file in which all reads identified as potentially of human origin are masked with ‘N’. Continue reading “Scrubbing human sequence contamination from Sequence Read Archive (SRA) submissions”

Announcing the GenBank and SRA Data Processing Webpage

Announcing the GenBank and SRA Data Processing Webpage

Interested in understanding how sequence data are submitted, processed, and made publicly available in GenBank and the Sequence Read Archive (SRA)? Announcing the GenBank and SRA Data Processing webpage!

Here you can learn about procedures that the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine (NLM), uses for processing submitted data and public posting, as well as key definitions of data status. Continue reading “Announcing the GenBank and SRA Data Processing Webpage”