Cromwell Examples
The following are some example workflows you can use to test Cromwell on AWS.
The curl
commands assume that you have access to a Cromwell server via localhost:8000
.
Simple Hello World
This is a single file workflow. It simply echos "Hello AWS!" to stdout
and exits.
Workflow Definition
simple-hello.wdl
task echoHello{
command {
echo "Hello AWS!"
}
runtime {
docker: "ubuntu:latest"
}
}
workflow printHelloAndGoodbye {
call echoHello
}
Running the workflow
To submit this workflow via curl
use the following command:
$ curl -X POST "http://localhost:8000/api/workflows/v1" \
-H "accept: application/json" \
-F "workflowSource=@/path/to/simple-hello.wdl"
You should receive a response like the following:
{"id":"104d9ade-6461-40e7-bc4e-227c3a49e98b","status":"Submitted"}
If the workflow completes successfully, the server will log the following:
2018-09-21 04:07:42,928 cromwell-system-akka.dispatchers.engine-dispatcher-25 INFO - WorkflowExecutionActor-7eefeeed-157e-4307-9267-9b4d716874e5 [UUID(7eefeeed)]: Workflow w complete. Final Outputs:
{
"w.echo.f": "s3://aws-cromwell-test-us-east-1/cromwell-execution/w/7eefeeed-157e-4307-9267-9b4d716874e5/call-echo/echo-stdout.log"
}
2018-09-21 04:07:42,931 cromwell-system-akka.dispatchers.engine-dispatcher-25 INFO - WorkflowManagerActor WorkflowActor-7eefeeed-157e-4307-9267-9b4d716874e5 is in a terminal state: WorkflowSucceededState
Call Caching
If you submit the same job again Cromwell will find in the metadata database
that the previous call to the echoHello
task was completed successfully
(a cache hit). Rather than submitting the job to AWS Batch the server will simply
copy the previous result
You can disable call caching on a single workflow by providing a JSON options file:
{
"write_to_cache": false,
"read_from_cache": false
}
This file may be submitted along with the workflow:
curl -X POST "http://localhost:8000/api/workflows/v1" \
-H "accept: application/json" \
-F "workflowSource=@workflow.wdl" \
-F "workflowOptions=@options.json"
Hello World with inputs
This workflow is virtually the same as the single file workflow above, but uses an input file to define parameters in the workflow.
Workflow Definition
hello-aws.wdl
task hello {
String addressee
command {
echo "Hello ${addressee}! Welcome to Cromwell . . . on AWS!"
}
output {
String message = read_string(stdout())
}
runtime {
docker: "ubuntu:latest"
}
}
workflow wf_hello {
call hello
output {
hello.message
}
}
Inputs
hello-aws.json
{
"wf_hello.hello.addressee": "World!"
}
Running the workflow
Submit this workflow using:
$ curl -X POST "http://localhost:8000/api/workflows/v1" \
-H "accept: application/json" \
-F "workflowSource=@hello-aws.wdl" \
-F "workflowInputs=@hello-aws.json"
Using data on S3
This workflow demonstrates how to use data from S3.
First, create some data:
$ curl "https://baconipsum.com/api/?type=all-meat¶s=1&format=text" > meats.txt
and upload it to an S3 bucket accessible using the Cromwell server's IAM policy:
$ aws s3 cp meats.txt s3://<your-bucket-name>/
Create the following wdl
and input json
files.
Workflow Definition
s3inputs.wdl
task read_file {
File file
command {
cat ${file}
}
output {
String contents = read_string(stdout())
}
runtime {
docker: "ubuntu:latest"
}
}
workflow ReadFile {
call read_file
output {
read_file.contents
}
}
Inputs
s3inputs.json
{
"ReadFile.read_file.file": "s3://aws-cromwell-test-us-east-1/meats.txt"
}
Running the workflow
Submit the workflow via curl
:
$ curl -X POST "http://localhost:8000/api/workflows/v1" \
-H "accept: application/json" \
-F "workflowSource=@s3inputs.wdl" \
-F "workflowInputs=@s3inputs.json"
If successful the server should log the following:
2018-09-21 05:04:15,478 cromwell-system-akka.dispatchers.engine-dispatcher-25 INFO - WorkflowExecutionActor-1774c9a2-12bf-42ea-902d-3dbe2a70a116 [UUID(1774c9a2)]: Workflow ReadFile complete. Final Outputs:
{
"ReadFile.read_file.contents": "Strip steak venison leberkas sausage fatback pork belly short ribs. Tail fatback prosciutto meatball sausage filet mignon tri-tip porchetta cupim doner boudin. Meatloaf jerky short loin turkey beef kielbasa kevin cupim burgdoggen short ribs spare ribs flank doner chuck. Cupim prosciutto jerky leberkas pork loin pastrami. Chuck ham pork loin, prosciutto filet mignon kevin brisket corned beef short loin shoulder jowl porchetta venison. Hamburger ham hock tail swine andouille beef ribs t-bone turducken tenderloin burgdoggen capicola frankfurter sirloin ham."
}
2018-09-21 05:04:15,481 cromwell-system-akka.dispatchers.engine-dispatcher-28 INFO - WorkflowManagerActor WorkflowActor-1774c9a2-12bf-42ea-902d-3dbe2a70a116 is in a terminal state: WorkflowSucceededState
Real-world example: HaplotypeCaller
This example demonstrates how to use Cromwell with the AWS backend to run GATK4 HaplotypeCaller against public data in S3. The HaplotypeCaller tool is one of the primary steps in GATK best practices pipeline.
The source for these files can be found in Cromwell's test suite on GitHub.
Worflow Definition
HaplotypeCaller.aws.wdl
## Copyright Broad Institute, 2017
##
## This WDL workflow runs HaplotypeCaller from GATK4 in GVCF mode on a single sample
## according to the GATK Best Practices (June 2016), scattered across intervals.
##
## Requirements/expectations :
## - One analysis-ready BAM file for a single sample (as identified in RG:SM)
## - Set of variant calling intervals lists for the scatter, provided in a file
##
## Outputs :
## - One GVCF file and its index
##
## Cromwell version support
## - Successfully tested on v29
## - Does not work on versions < v23 due to output syntax
##
## IMPORTANT NOTE: HaplotypeCaller in GATK4 is still in evaluation phase and should not
## be used in production until it has been fully vetted. In the meantime, use the GATK3
## version for any production needs.
##
## Runtime parameters are optimized for Broad's Google Cloud Platform implementation.
##
## LICENSING :
## This script is released under the WDL source code license (BSD-3) (see LICENSE in
## https://github.com/broadinstitute/wdl). Note however that the programs it calls may
## be subject to different licenses. Users are responsible for checking that they are
## authorized to run all programs before running this script. Please see the dockers
## for detailed licensing information pertaining to the included programs.
# WORKFLOW DEFINITION
workflow HaplotypeCallerGvcf_GATK4 {
File input_bam
File input_bam_index
File ref_dict
File ref_fasta
File ref_fasta_index
File scattered_calling_intervals_list
String gatk_docker
String gatk_path
Array[File] scattered_calling_intervals = read_lines(scattered_calling_intervals_list)
String sample_basename = basename(input_bam, ".bam")
String gvcf_name = sample_basename + ".g.vcf.gz"
String gvcf_index = sample_basename + ".g.vcf.gz.tbi"
# Call variants in parallel over grouped calling intervals
scatter (interval_file in scattered_calling_intervals) {
# Generate GVCF by interval
call HaplotypeCaller {
input:
input_bam = input_bam,
input_bam_index = input_bam_index,
interval_list = interval_file,
gvcf_name = gvcf_name,
ref_dict = ref_dict,
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
docker_image = gatk_docker,
gatk_path = gatk_path
}
}
# Merge per-interval GVCFs
call MergeGVCFs {
input:
input_vcfs = HaplotypeCaller.output_gvcf,
vcf_name = gvcf_name,
vcf_index = gvcf_index,
docker_image = gatk_docker,
gatk_path = gatk_path
}
# Outputs that will be retained when execution is complete
output {
File output_merged_gvcf = MergeGVCFs.output_vcf
File output_merged_gvcf_index = MergeGVCFs.output_vcf_index
}
}
# TASK DEFINITIONS
# HaplotypeCaller per-sample in GVCF mode
task HaplotypeCaller {
File input_bam
File input_bam_index
String gvcf_name
File ref_dict
File ref_fasta
File ref_fasta_index
File interval_list
Int? interval_padding
Float? contamination
Int? max_alt_alleles
String mem_size
String docker_image
String gatk_path
String java_opt
command {
${gatk_path} --java-options ${java_opt} \
HaplotypeCaller \
-R ${ref_fasta} \
-I ${input_bam} \
-O ${gvcf_name} \
-L ${interval_list} \
-ip ${default=100 interval_padding} \
-contamination ${default=0 contamination} \
--max-alternate-alleles ${default=3 max_alt_alleles} \
-ERC GVCF
}
runtime {
docker: docker_image
memory: mem_size
cpu: 1
}
output {
File output_gvcf = "${gvcf_name}"
}
}
# Merge GVCFs generated per-interval for the same sample
task MergeGVCFs {
Array [File] input_vcfs
String vcf_name
String vcf_index
String mem_size
String docker_image
String gatk_path
String java_opt
command {
${gatk_path} --java-options ${java_opt} \
MergeVcfs \
--INPUT=${sep=' --INPUT=' input_vcfs} \
--OUTPUT=${vcf_name}
}
runtime {
docker: docker_image
memory: mem_size
cpu: 1
}
output {
File output_vcf = "${vcf_name}"
File output_vcf_index = "${vcf_index}"
}
}
Inputs
The inputs for this workflow reference public data on S3 that is hosted by AWS as part of the AWS Public Dataset Program.
HaplotypeCaller.aws.json
{
"##_COMMENT1": "INPUT BAM",
"HaplotypeCallerGvcf_GATK4.input_bam": "s3://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam",
"HaplotypeCallerGvcf_GATK4.input_bam_index": "s3://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bai",
"##_COMMENT2": "REFERENCE FILES",
"HaplotypeCallerGvcf_GATK4.ref_dict": "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dict",
"HaplotypeCallerGvcf_GATK4.ref_fasta": "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta",
"HaplotypeCallerGvcf_GATK4.ref_fasta_index": "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai",
"##_COMMENT3": "INTERVALS",
"HaplotypeCallerGvcf_GATK4.scattered_calling_intervals_list": "s3://gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt",
"HaplotypeCallerGvcf_GATK4.HaplotypeCaller.interval_padding": 100,
"##_COMMENT4": "DOCKERS",
"HaplotypeCallerGvcf_GATK4.gatk_docker": "broadinstitute/gatk:4.0.0.0",
"##_COMMENT5": "PATHS",
"HaplotypeCallerGvcf_GATK4.gatk_path": "/gatk/gatk",
"##_COMMENT6": "JAVA OPTIONS",
"HaplotypeCallerGvcf_GATK4.HaplotypeCaller.java_opt": "-Xms8000m",
"HaplotypeCallerGvcf_GATK4.MergeGVCFs.java_opt": "-Xms8000m",
"##_COMMENT7": "MEMORY ALLOCATION",
"HaplotypeCallerGvcf_GATK4.HaplotypeCaller.mem_size": "10 GB",
"HaplotypeCallerGvcf_GATK4.MergeGVCFs.mem_size": "30 GB",
}
Running the workflow
Submit the workflow via curl
:
$ curl -X POST "http://localhost:8000/api/workflows/v1" \
-H "accept: application/json" \
-F "workflowSource=@HaplotypeCaller.aws.wdl" \
-F "workflowInputs=@HaplotypeCaller.aws.json"
This workflow takes about 60-90min to complete.