Fixing "Connection to Cloud SQL instance at IP:3307 failed: timed out after 10s"

·

9 min read

This was a rather interesting problem to fix.

I'm running a PHP app as Google Cloud Run Service. This app is connected to the Cloud SQL instance using a Cloud SQL connection like this:

resource "google_cloud_run_service" "pricing" {
  provider                   = google-beta
  project                    = "project-id"
  name                       = "pricing"
  location                   = "europe-west1"
  autogenerate_revision_name = true

  template {
    metadata {
      annotations = {
        "autoscaling.knative.dev/minScale"        = 1
        "autoscaling.knative.dev/maxScale"        = 10
        "run.googleapis.com/startup-cpu-boost"    = "true"
        # connection to sql instance, socket available at /cloudsql/INSTANCE_CONNECTION_NAME/.s.PGSQL.5432
        "run.googleapis.com/cloudsql-instances"   = "project-id:europe-west1:pricing-9e66beb6"
        "run.googleapis.com/cpu-throttling"       = "true"
        "run.googleapis.com/vpc-access-connector" = "cr-connector"
        "run.googleapis.com/vpc-access-egress"    = "all-traffic"
      }
      labels = {
        env     = "production"
        service = "pricing"
      }
    }

    spec {
      container_concurrency = 20
      containers {
        image = "europe-docker.pkg.dev/project-id/pricing/pricing:v202309.20"

        resources {
          limits = {
            cpu    = "1000m"
            memory = "2Gi"
          }
        }
      }
      service_account_name = "pricing-cr@project-id.iam.gserviceaccount.com"
    }
  }

  metadata {
    annotations = {
      "run.googleapis.com/ingress" = "internal-and-cloud-load-balancing"
    }
  }

  traffic {
    percent         = 100
    latest_revision = true
  }
}

It was deployed a few weeks ago and was working pretty fine without issues. But for the last few days, I observed some random-looking spikes in response times and also some errors logged by GCP like:

Cloud SQL connection failed. Please see https://cloud.google.com/sql/docs/mysql/connect-run for additional details: connection to Cloud SQL instance at 34.XX.XXX.XXX:3307 failed: timed out after 10s

I started to debug and those random spikes were not so random. After around a hundred requests, response times started to go up (60+ seconds sometimes). It was like this for a few seconds and then all was back to normal for another hundred requests or so. GCP logged some 502 and also the error above. This error was logged not by the application but by Cloud Run itself. Looked like somehow the SQL connector was failing.

I thought that maybe the Cloud Run revision was broken so I redeployed the app a few times but this did not help. After searching the internet I came across SO question that then pointed me to Cloud Run known issues page.

Yes, I was using Cloud Run service with a VPC connector all-traffic setting.

How to fix this issue

  1. I could not use VPC with Cloud Run.
  2. I could switch VPC egress traffic to private-ranges-only
  3. I could switch from Serverless VPC Access to Direct VPC
  4. I could configure SQL instance to use private IP and then use IP to connect to DB

I need a VPC so I can route all egress traffic through VPC with Cloud NAT translating it to use a single public IP. Option one and two are no-go. The third option will also not work for me since Cloud NAT does not work with Direct VPC.

Let's use a private IP connection

I modified the SQL instance in Terraform config to add private IP.

⚠ Adding a private IP will restart the SQL instance. For me, downtime was around 10 minutes

resource "random_id" "pricing_sql_instance_suffix" {
  byte_length = 4
}

resource "google_sql_database_instance" "pricing" {
  database_version = "POSTGRES_15"
  project          = "project-id"
  region           = "europe-west1"
  name             = "pricing-${random_id.pricing_sql_instance_suffix.hex}"

  settings {
    tier = "db-custom-1-3840"
    ip_configuration {
      ipv4_enabled                                  = true
      require_ssl                                   = true
      # connect to vpc network and get private ip 
      private_network                               = google_compute_network.default.id
      enable_private_path_for_google_cloud_services = true
    }

    availability_type     = "REGIONAL"
    disk_autoresize_limit = null
    disk_size             = 10
  }
  deletion_protection = true

  depends_on = [
    # needed because Terraform doesn't wait for the connection to be enabled
    google_service_networking_connection.default,
  ]
}

resource "google_compute_network" "default" {
  project                 = "project-id"
  name                    = "cr-static-ip-network"
  auto_create_subnetworks = false
}

resource "google_compute_global_address" "peering_block_default" {
  project       = "project-id"
  provider      = google-beta
  name          = "peering-block-default"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  ip_version    = "IPV4"
  prefix_length = 24
  network       = google_compute_network.default.id
}

resource "google_project_service" "servicenetworking" {
  project            = "project-id"
  service            = "servicenetworking.googleapis.com"
  disable_on_destroy = false
}

resource "google_service_networking_connection" "default" {
  provider = google-beta
  network  = google_compute_network.default.id
  service  = google_project_service.servicenetworking.service
  reserved_peering_ranges = [
    google_compute_global_address.peering_block_default.name,
  ]
}

After applying changes, the SQL instance now has a private IP that I can use to connect from the Cloud Run service. I just changed the db host and all will work. NOPE. I get this error now:

An exception occurred in the driver: SQLSTATE[08006] [7] connection to server at "X.X.X.X", port 5432 failed: FATAL: connection requires a valid client certificate connection to server at "X.X.X.X", port 5432 failed: FATAL: pg_hba.conf rejects connection for host "X.X.X.X", user "pricing", database "pricing", no encryption

Looks like SSL is required to connect (require_ssl = true) even for a private IP connection. I could just disable SSL but this is not secure especially when also using a public IP on the same SQL instance.

Let's generate and use client cert.

The plan is:

  1. generate SQL instance client certificate
  2. put generated server CA, cert and private key into secrets in Secret Manager
  3. mount those secrets into Cloud Run service containers under /secrets/cloudsql/client_ca, /secrets/cloudsql/client_cert, /secrets/cloudsql/client_key
  4. use them in DATABASE_URL env variable used by the app to configure the DB connection

Example DATABASE_URL with certs should look like this:

pgsql://user:pass@IP/db_name?sslmode=verify..

Terraform code should look like this:

resource "google_sql_ssl_cert" "client_ssl_cert" {
  project     = "project-id"
  common_name = "pricing"
  instance    = google_sql_database_instance.pricing.name
}

resource "google_secret_manager_secret" "cloudsql_client_ca" {
  secret_id = "cloudsql_client_ca"
  project   = "project-id"
  replication {
    automatic = true
  }
}

resource "google_secret_manager_secret_version" "cloudsql_client_ca_latest" {
  secret      = google_secret_manager_secret.cloudsql_client_ca.id
  secret_data = google_sql_ssl_cert.client_ssl_cert.server_ca_cert
}

resource "google_secret_manager_secret" "cloudsql_client_cert" {
  secret_id = "cloudsql_client_cert"
  project   = "project-id"
  replication {
    automatic = true
  }
}

resource "google_secret_manager_secret_version" "cloudsql_client_cert_latest" {
  secret      = google_secret_manager_secret.cloudsql_client_cert.id
  secret_data = google_sql_ssl_cert.client_ssl_cert.cert
}

resource "google_secret_manager_secret" "cloudsql_client_key" {
  secret_id = "cloudsql_client_key"
  project   = "project-id"
  replication {
    automatic = true
  }
}

resource "google_secret_manager_secret_version" "cloudsql_client_key_latest" {
  secret      = google_secret_manager_secret.cloudsql_client_key.id
  secret_data = google_sql_ssl_cert.client_ssl_cert.private_key
}

I needed to also add IAM permissions to the above secrets for the service account used by the Cloud Run service. I did not put Terraform here to simplify the configuration.

Cloud Run service with mounted secrets should look like this:

resource "google_cloud_run_service" "pricing" {
  provider                   = google-beta
  project                    = "project-id"
  name                       = "pricing"
  location                   = "europe-west1"
  autogenerate_revision_name = true

  template {
    metadata {
      annotations = {
        "autoscaling.knative.dev/minScale"        = 1
        "autoscaling.knative.dev/maxScale"        = 10
        "run.googleapis.com/startup-cpu-boost"    = "true"
        "run.googleapis.com/cloudsql-instances"   = "project-id:europe-west1:pricing-9e66beb6"
        "run.googleapis.com/cpu-throttling"       = "true"
        "run.googleapis.com/vpc-access-connector" = "cr-connector"
        "run.googleapis.com/vpc-access-egress"    = "all-traffic"
      }
      labels = {
        env     = "production"
        service = "pricing"
      }
    }

    spec {
      container_concurrency = 20
      containers {
        image = "europe-docker.pkg.dev/project-id/pricing/pricing:v202309.20"

        resources {
          limits = {
            cpu    = "1000m"
            memory = "2Gi"
          }
        }
        volume_mounts {
          name       = "cloudsql_client_ca"
          mount_path = "/secrets/cloudsql/client_ca"
        }
        volume_mounts {
          name       = "cloudsql_client_cert"
          mount_path = "/secrets/cloudsql/client_cert"
        }
        volume_mounts {
          name       = "cloudsql_client_key"
          mount_path = "/secrets/cloudsql/client_key"
        }
      }
      service_account_name = "pricing-cr@project-id.iam.gserviceaccount.com"
      volumes {
        name = "cloudsql_client_ca"
        secret {
          secret_name = "projects/project-id/secrets/cloudsql_client_ca"
          items {
            key  = "latest"
            path = "."
          }
        }
      }
      volumes {
        name = "cloudsql_client_cert"
        secret {
          secret_name = "projects/project-id/secrets/cloudsql_client_cert"
          items {
            key  = "latest"
            path = "."
          }
        }
      }
      volumes {
        name = "cloudsql_client_key"
        secret {
          secret_name = "projects/project-id/secrets/cloudsql_client_key"
          items {
            key  = "latest"
            path = "."
          }
        }
      }
    }
  }

  metadata {
    annotations = {
      "run.googleapis.com/ingress" = "internal-and-cloud-load-balancing"
    }
  }

  traffic {
    percent         = 100
    latest_revision = true
  }
}

I will apply the new Cloud Run configuration and deploy the app with the new DATABASE_URL. This must work now. Right? NOPE. I get another error:

An exception occurred in the driver: SQLSTATE[08006] [7] connection to server at "X.X.X.X", port 5432 failed: private key file "/secrets/cloudsql/client_key" has group or world access; file must have permissions u=rw (0600) or less if owned by the current user, or permissions u=rw,g=r (0640) or less if owned by root

When mounting secrets, the owner of those files is the root user. Mounted secrets are by default readable by everyone. That is why PHP is complaining about permissions. In theory, I could just fix permissions to 0640 for those files and it will probably work. PHP application uses php-fpm process manager to handle connections. php-fpm process by itself is running as root user but all the workers it spawns to handle requests are owned by the www-data user. This means that changing permissions to 0640 will just remove read access for www-data.

Okay then, I'll just change ownership of secret files to www-data in the docker entrypoint script and all will be fixed and working. Yet again, NOPE. I tried running chown www-data:www-data with all those files but it did not change ownership of any of them.

Change of plans - secrets in environment variables

The new plan is:

  1. Encode CA, cert and key as base64 string and put them as env variables on the Cloud Run service

  2. In entrypoint script, check if variables are set and then

    1. decode

    2. put them in /secrets/cloudsql/client_* files

    3. change ownership to www-data and permissions to 0640

Let's add new env vars into the container.

resource "google_cloud_run_service" "pricing" {
  provider                   = google-beta
  project                    = "project-id"
  name                       = "pricing"
  location                   = "europe-west1"
  autogenerate_revision_name = true

  template {
    metadata {
      annotations = {
        "autoscaling.knative.dev/minScale"        = 1
        "autoscaling.knative.dev/maxScale"        = 10
        "run.googleapis.com/startup-cpu-boost"    = "true"
        "run.googleapis.com/cloudsql-instances"   = "project-id:europe-west1:pricing-9e66beb6"
        "run.googleapis.com/cpu-throttling"       = "true"
        "run.googleapis.com/vpc-access-connector" = "cr-connector"
        "run.googleapis.com/vpc-access-egress"    = "all-traffic"
      }
      labels = {
        env     = "production"
        service = "pricing"
      }
    }

    spec {
      container_concurrency = 20
      containers {
        image = "europe-docker.pkg.dev/project-id/pricing/pricing:v202309.20"

        resources {
          limits = {
            cpu    = "1000m"
            memory = "2Gi"
          }
        }
        env {
          name  = "CLOUDSQL_CLIENT_SSL_CA_B64"
          value = base64encode(google_sql_ssl_cert.client_ssl_cert.server_ca_cert)
        }
        env {
          name  = "CLOUDSQL_CLIENT_SSL_CERT_B64"
          value = base64encode(google_sql_ssl_cert.client_ssl_cert.cert)
        }
        env {
          name  = "CLOUDSQL_CLIENT_SSL_KEY_B64"
          value = base64encode(google_sql_ssl_cert.client_ssl_cert.private_key)
        }
      }
      service_account_name = "pricing-cr@project-id.iam.gserviceaccount.com"
    }
  }

  metadata {
    annotations = {
      "run.googleapis.com/ingress" = "internal-and-cloud-load-balancing"
    }
  }

  traffic {
    percent         = 100
    latest_revision = true
  }
}

The final step is to update entrypoint script and add:

if [[ -n $CLOUDSQL_CLIENT_SSL_CA_B64 ]]; then
  CLOUDSQL_CLIENT_SSL_CA_PATH="/secrets/cloudsql/client_ca"
  echo "[INFO] Decoding  CloudSQL Client CA from env"
  mkdir -p "$(dirname "$CLOUDSQL_CLIENT_SSL_CA_PATH")"
  base64 -d <<< "$CLOUDSQL_CLIENT_SSL_CA_B64" > "$CLOUDSQL_CLIENT_SSL_CA_PATH"
  chown www-data:www-data "$CLOUDSQL_CLIENT_SSL_CA_PATH"
  chmod 640 "$CLOUDSQL_CLIENT_SSL_CA_PATH"
fi

if [[ -n $CLOUDSQL_CLIENT_SSL_CERT_B64 ]]; then
  CLOUDSQL_CLIENT_SSL_CERT_PATH="/secrets/cloudsql/client_cert"
  echo "[INFO] Decoding  CloudSQL Client Cert from env"
  mkdir -p "$(dirname "$CLOUDSQL_CLIENT_SSL_CERT_PATH")"
  base64 -d <<< "$CLOUDSQL_CLIENT_SSL_CERT_B64" > "$CLOUDSQL_CLIENT_SSL_CERT_PATH"
  chown www-data:www-data "$CLOUDSQL_CLIENT_SSL_CERT_PATH"
  chmod 640 "$CLOUDSQL_CLIENT_SSL_CERT_PATH"
fi

if [[ -n $CLOUDSQL_CLIENT_SSL_KEY_B64 ]]; then
  CLOUDSQL_CLIENT_SSL_KEY_PATH="/secrets/cloudsql/client_key"
  echo "[INFO] Decoding  CloudSQL Client Key from env"
  mkdir -p "$(dirname "$CLOUDSQL_CLIENT_SSL_KEY_PATH")"
  base64 -d <<< "$CLOUDSQL_CLIENT_SSL_KEY_B64" > "$CLOUDSQL_CLIENT_SSL_KEY_PATH"
  chown www-data:www-data "$CLOUDSQL_CLIENT_SSL_KEY_PATH"
  chmod 640 "$CLOUDSQL_CLIENT_SSL_KEY_PATH"
fi

FINALLY, It works.

Final thoughts

It would be way better to just fail the SQL connector from the start when using public IP with all traffic sent to VPC. This way people would not stress that much because the app would not start to misbehave suddenly in production.

It would also help to avoid this whole problem if docs would highlight this vpc issue. Right now it is just a tiny note and it is easy to miss.

I'm not sure if I missed it in docs but there seems to be no way of setting the owning user for mounted secret in Cloud Run or changing it at runtime. This small thing complicated things and forced me to use the base64 encoded env vars approach instead of just using secrets which is a more elegant and secure way.