I'm trying to upgrade my engine version from 13.12 to 14.17. Below is my Terraform code:
resource "aws_rds_cluster" "aurora" {
  cluster_identifier              = local.env_config.rds_db_identifier
  engine                          = local.env_config.rds_engine
  engine_mode                     = local.env_config.rds_engine_mode
  engine_version                  = "14.17"
  allow_major_version_upgrade     = true
  apply_immediately               = local.env_config.rds_apply_immediately
  database_name = local.env_config.auroradb_name
  master_username = local.env_config.auroradb_user
  master_password = random_password.aurora_passwd.result
  vpc_security_group_ids = [aws_security_group.aurora.id]
  db_subnet_group_name   = aws_db_subnet_group.aurora_group.id
  copy_tags_to_snapshot        = true
  deletion_protection          = false
  enable_http_endpoint         = true
  preferred_maintenance_window = local.env_config.preferred_maintenance_window
  backup_retention_period      = 14
  preferred_backup_window      = "01:00-02:00"
  skip_final_snapshot          = false
  final_snapshot_identifier    = "fhir-dev-platform-cluster-final-pgadmin"
  snapshot_identifier          = "fhir-dev-platform-cluster-final-pgadmin"
  storage_encrypted            = true
  serverlessv2_scaling_configuration {
    max_capacity = 4
    min_capacity = 2
  }
  tags = local.default_tags
  timeouts {
    create = "120m"
  }
}
resource "aws_rds_cluster_instance" "aurora_provisioned_instance" {
  cluster_identifier = aws_rds_cluster.aurora.id
  instance_class = "db.serverless"
  engine = local.env_config.rds_engine
  engine_version  = local.env_config.rds_engine_version
  publicly_accessible  = false
  db_subnet_group_name = aws_db_subnet_group.aurora_group.id
  apply_immediately = true
}
resource "aws_db_subnet_group" "aurora_group" {
  subnet_ids = module.base_inf.vpc_public_subnets
  tags = local.default_tags
}
resource "aws_security_group" "aurora" {
  name   = "fhir-dev-platform-sg-aurora"
  vpc_id = module.base_inf.vpc_id
  ingress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = concat(module.base_inf.vpc_services_subnets_cidr_blocks, module.base_inf.vpc_public_subnets_cidr_blocks)
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  tags = local.default_tags
}
The error I'm facing is snapshot name not found, which I'm assuming is happening because the final snapshot identifier is not able to create the snapshot in time, and when the snapshot identifier tries to restore the cluster, it fails. The problem is I can't do multiple deployments for this code.
Another way I tried was without using the snapshot identifier to restore the cluster, and it worked. Still, the DB password and cluster password got out of sync since the cluster password gets updated only when it is getting recreated because I'm using random_function to do so.
Could you please tell me how I can handle the upgrade in one go?
Answer
The error I'm facing is snapshot name not found, which I'm assuming is happening because the final snapshot identifier is not able to create the snapshot in time, and when the snapshot identifier tries to restore the cluster, it fails.
That's not the problem. It shouldn't be trying to create a snapshot at all. And certainly it is not creating a final snapshot, as that only happens when you are deleting the database cluster, not when you are upgrading it. If the RDS service needs to create a snapshot as part of the version upgrade process, it will do that automatically behind the scenes. It will not use your final_snapshot settings for that process.
The problem is this line:
snapshot_identifier = "fhir-dev-platform-cluster-final-pgadmin"
That line tells it to create a new Aurora cluster from the snapshot with that name. That snapshot doesn't exist. Unless you are trying to create a new Aurora cluster from an existing snapshot, you should not be setting the snapshot_identifier attribute at all.

