Skip to content

Cromwell Troubleshooting

The following are some common errors that we have seen and suggested solutions

S3 Access Denied (403)

Possible Cause(s)

A 403 error from S3 indicates that Cromwell is trying to access an S3 object that it doesn't have permission to. Following the priciple of "least access" Cromwell uses an IAM EC2 instance role that grants it read and write access to the S3 bucket you specified in the CloudFormation deployment and read only access to the gatk-test-data and broad-references S3 buckets. If your workflow references other S3 objects (even ones in your account) you will need to allow this via changes to the IAM role. Similarly if a step in your workflow attempts to write to another bucket you will need to add the appropriate permissions.

Suggested Solution(s)

  • Add read access to additional buckets by attaching a policy to the Cromwell server's IAM EC2 instance role with content similar to:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::bucket-a",
                "arn:aws:s3:::bucket-a/*",
                "arn:aws:s3:::another-bucket",
                "arn:aws:s3:::another-bucket/*"
            ]
        }
    ]
}

The exact name of the role will be unique and generated by CloudFormation, however it will contain the words "CromwellServer" and it will be the role attached to the EC2 running the Cromwell server.

S3 File Not Found (404)

Possible Cause(s)

  • A file required by the workflow cannot be found at the specified S3 Path. Your workflow inputs might have the incorrect path OR an expected file was not created by the previous step.

Suggested Solution(s)

  • Check the paths of inputs and that the expected file exists at that path.
  • If the file name is something like <previousTaskName>-rc.txt the previous task failed before it was able to write out the result code. Inspect the stderr.txt and stdout.txt of the previous step for possible reasons.

Cromwell Server OutOfMemory errors

Possible Cause(s)

  • Out of memory errors on the Cromwell Server are typically the result of the JVM running out of memory while attempting to keep track of multiple workflows or workflows with very large scatter steps.

Suggested Solutions

  • Consider upgrading the server instance type to one with more RAM.
  • Investigate tuning Cromwell's job-control limits to find a configuration that appropriately restricts the number of queued Akka messages.
  • Consider increasing the maximum instance RAM available to the JVM. Our Cloudformation templates this set to 85% (-XX:MaxRAMPercentage=85.0) allowing some head room for the OS. On larger instance types you may be able to increase this further.
  • Ensure you are not using an in memory database on the server instance. Our cloudformation templates configure a separate Aurora MySQL cluster to avoid this.

Cromwell Task (Container) OutOfMemory errors

Possible Cause(s)

  • Individual tasks from a workflow run in docker containers on AWS Batch. If those containers have insufficient RAM for the task they can fail.
  • Some older applications (including older versions of the JVM) do not always respect the memory limits imposed by the container and may think they have resources they cannot use.

Suggested solutions

  • Assign more memory to the task in the runtime: {} stanza of the WDL or if the task application allows use command line or configuration parameters to appropriately limit memory.
  • For tasks executed by the JVM investigate -Xmx and -XX:MaxRAMPercentage parameters.

Cromwell submitted AWS Batch jobs hang in 'Runnable' state

Possible Causes

  • The resources requested by the task exceed the largest size of instance available in your AWS Batch Compute Environment
  • Batch worker EC2 instances are not able to join the Compute Environments ECS cluster

Suggested solutions

  • Reduce the resources required by your task to less than the maximum CPU and memory of the largest instance type allowed in your Batch Compute Environment.
  • In your EC2 console determine if any gwf-core workers have started. If they have then ensure they have a route to the internet (for example, does your subnet have a NAT gateway). Worker nodes require access to the internet so that required dependencies can be downloaded by the worker nodes at startup time. If this process fails then Docker will not start, the ecs-agent will not run, and the systems manager will also not run. In addition, the node will also not be able to communicate with the AWS Batch service.