3000 Rice Genome on AWS

The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries. The collaborating organizations are comprised of the Chinese Academy of Agricultural Sciences, BGI Shenzhen, and the International Rice Research Institute (IRRI). Rice is the leading food source across the globe, and is a vital crop to study to address food security and other global issues. Through analysis of these genomes, researchers can potentially identify genes for important agronomic traits such as better nutrition, climate change tolerance, and disease resistance.

AWS has made the 3000 Rice Genome data freely available on Amazon S3 so that anyone can use our on-demand computing resources to perform analysis and create new products without needing to worry about the cost of storing the data or the time required to download it.

For more information about the 3000 Rice Genomes Project, please visit http://iric.irri.org/resources/3000-genomes-project.

Accessing 3000 Rice Genome Data on AWS

The whole genome sequence data was analyzed on the DNAnexus platform, comparing each of the 3,024 varieties against five different reference genomes. Over 100TB of results consist of:

Alignment of pair-end reads from whole-genome resequencing of 3,024 rice accessions to 5 published rice reference genomes (BWA-MEM version 0.7.10) Discovery of Single Nucleotide Polymorphisms and small indels (GATK version 3.2.2) A description of the analysis steps is available at: s3://3kricegenome/README-snp_pipeline.txt or https://3kricegenome.s3.amazonaws.com/README-snp_pipeline.txt.

The 3,000 Rice Genome on AWS data set makes available the reference alignments and variant calls available in sorted and indexed BAM files and indexed VCF files, respectively.

The data are organized using a simple directory structure based on the reference genome and source sample. For example, given the source sample IRIS_313–15896 analyzed against the 93–11 reference genome, you would find these associated BAM and VCF files in the following locations:

s3://3kricegenome/9311/IRIS_313–15896.realigned.bam

s3://3kricegenome/9311/IRIS_313–15896.snp.vcf.gz

Or:

https://3kricegenome.s3.amazonaws.com/9311/IRIS_313-15896.realigned.bam

https://3kricegenome.s3.amazonaws.com/9311/IRIS_313-15896.snp.vcf.gz

The index of BAM and VCF files are co-located for fast random access of files. As an example, here we query for alignments on chromosome 1 from position 1000 to 1100 using samtools:

# Query for the chromosome 1 from base position 1000 to 1100

samtools view https://3kricegenome.s3.amazonaws.com/9311/IRIS_313-15896.realigned.bam 9311_chr01:1000-1100

A manifest of all files in the bucket is also available at:

s3://3kricegenome/MANIFEST

Or:

https://3kricegenome.s3.amazonaws.com/MANIFEST

Source sequence data, as well as more details on the experimental data, are available from the Sequence Read Archives (SRA) at NCBI (USA), EBI (Europe), and DDBJ (Asia).

The five reference genomes are not part of this Public Data Set, but are available from the following sources:

Nipponbare (IRGSP-1.0_genome.fasta.gz) http://rapdb.dna.affrc.go.jp/
9311 (9311.fa.gz) ftp://public.genomics.org.cn/BGI/rice_seq/93-11/
IR64 (os.ir64.cshl.draft.1.0.scaffold.fa.gz) http://schatzlab.cshl.edu/data/rice/
Kasalath (kasalath_genome.tar.gz) http://rice50ks.dna.affrc.go.jp/
DJ123 (os.dj123.cshl.draft.1.0.scaffold.fa.gz) http://schatzlab.cshl.edu/data/rice/