Remote Disconnect Errors

Posted on December 22, 2020 by John Humphreys

Problem Overview

We use AWS EKS (v1.16) kubernetes for our auto scaling Presto deployments, and we and front it with an nginx ingress leveraging a network load balancer.

We found that, once we started auto scaling, we started getting remote disconnect errors from clients fairly frequently. This was pretty hard to explain because we had actually gone to great lengths to make sure Presto itself was gracefully terminating in a way that would not damage live queries.

Where is the Issue?

The root cause of this issue is that:

We use ingress.
Ingress uses a cloud load balancer.
The cloud load balancer talks to the nginx ingress controller as a NodePort service.
This means the LB will route traffic through any random node in the cluster.
So, we gracefully terminate presto, but the NodePort service on the node that is scaling down may still be used for routing traffic to another node (e.g. the coordinator in this case).

It turns out that there really is no good way to fix this in EKS at this point in time. We originally hit this bug: https://github.com/kubernetes/autoscaler/issues/1907, and when we tried the workaround of using externalTrafficPolicy = Local, we hit this other bug: https://github.com/kubernetes/cloud-provider-aws/issues/87.

Other solutions are being developed now and will allow you to exclude certain nodes from the LB config using labels/etc, but they are not ready yet.

What is a Workaround?

Unfortunately, we did not solve this purely using the NGINX ingress. We found that we had to schedule the ingress services on some non-auto-scaling core nodes, and then we added them to the load balancer specifically (actually, to a separate LB we created and manage with terraform). This way, ingress always comes into nodes that do not auto scale, and those nodes route to the other services in a reliable way using the CNI black magic. It’s not a feel-good solution, but it remains stable during auto scaling of the rest of the cluster, so it works until a real k8s/AWS solution is developed.

Kubernetes – Get terminationGracePeriodSeconds and Other Values Missing From Describe Pod/Deployment

Posted on October 23, 2020 by John Humphreys

When checking what is running in kubernetes, people generally do something like this:

kubectl get deploy -n <namespace>
kubectl get pods -n <namespace>

And to describe extended parameters on a deployment or pod:

kubectl describe deploy -n <namespace> <deployment-name>
kubectl describe pod -n <namespace> <pod-name>

Interestingly, these more verbose describe commands are still missing a lot of information. It turns out that the only way to get *all* of the information is to go back to the get command and to tell it to output everything to YAML or a similar format:

kubectl get deploy -n <namespace> -o yaml
kubectl get pods -n <namespace> -o yaml

These commands will yield far more configuration options than the describe commands. Things like terminationGracePeriodSeconds will be readily available here.

Presto – Get and List the Connectors on All Nodes in Cluster

Posted on September 29, 2020 by John Humphreys

Some problems in presto are the result of having connector definitions only on a subset of nodes in the cluster. For example, a recent error on the presto-sql forum during insert into a hive table was:

java.lang.IllegalArgumentException: No page sink provider for catalog 'hive'
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:216)
	at io.prestosql.split.PageSinkManager.providerFor(PageSinkManager.java:67)
	at io.prestosql.split.PageSinkManager.createPageSink(PageSinkManager.java:61)
	at io.prestosql.operator.TableWriterOperator$TableWriterOperatorFactory.createPageSink(TableWriterOperator.java:114)
	at io.prestosql.operator.TableWriterOperator$TableWriterOperatorFactory.createOperator(TableWriterOperator.java:105)
	at io.prestosql.operator.DriverFactory.createDriver(DriverFactory.java:114)
	at io.prestosql.execution.SqlTaskExecution$DriverSplitRunnerFactory.createDriver(SqlTaskExecution.java:941)
	at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1069)
	at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
	at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)

If you have a decent size cluster, it is very painful to go to each node and check its catalogs. This problem can be even worse if you have an old node join the cluster after maintenance or something like that.

In any case, you can use the following URL on presto (/v1/service/presto) to list all nodes and their registered connectors in one shot. This will help you track down the problem fast :). You can even be lazy and parse the JSON in chrome dev tools/etc so you don’t have to eyeball all the nodes.

https://nonprod.presto.your-company.com/v1/service/presto

Example Output

  "environment": "nonprod",
  "services": [
    {
      "id": "a35ae2a7-fa95-43c9-b893-180449a48c5a",
      "nodeId": "blue-presto-worker-865b8db58-g92wn",
      "type": "presto",
      "pool": "general",
      "location": "/blue-presto-worker-865b8db58-g92wn",
      "properties": {
        "node_version": "331-n-2.6.1",
        "coordinator": "false",
        "https": "https://10-234-232-180.nonprod-presto.pod.cluster.local:8443",
        "https-external": "https://10-234-232-180.nonprod-presto.pod.cluster.local:8443",
        "connectorIds": "hive-dl,system,cr-meta,ar-meta,dc-meta"
      }
    },
    {
      "id": "b8dd0f39-00b0-4c78-b0c0-ff8e753419d8",
      "nodeId": "blue-presto-worker-865b8db58-d2nsz",
      "type": "presto",
      "pool": "general",
      "location": "/blue-presto-worker-865b8db58-d2nsz",
      "properties": {
        "node_version": "331-n-2.6.1",
        "coordinator": "false",
        "https": "https://10-234-234-106.nonprod-presto.pod.cluster.local:8443",
        "https-external": "https://10-234-234-106.nonprod-presto.pod.cluster.local:8443",
        "connectorIds": "hive-dl,system,cr-meta,ar-meta,dc-meta"
      }
    },
...

Google Drive API v3 + Sheets + Shared Drives in Java

Posted on September 7, 2020 by John Humphreys

There are plenty of examples of how to use the Google Drive API online. A ton are for old versions though, and most are basic cases (not good with restricted sharing options/etc). Also, virtually none show you how to do things with shared drives.

I had to do all of this recently, so I hope this helps someone else avoid the pain I went through =). The only thing this assumes is that you have a valid credentials file generated from the developer console.

Defining Scopes

These scopes should all be enabled for your credentials on the consent screen part of the developer console. Also list them in your code.

static {
    SCOPES = new ArrayList<>();
    SCOPES.add(SheetsScopes.DRIVE);
    SCOPES.add(SheetsScopes.DRIVE_FILE);
    SCOPES.add(SheetsScopes.SPREADSHEETS);
}

Get Credentials

private HttpRequestInitializer getCredentials(NetHttpTransport httpTransport) {
    GoogleCredential credential = null;
    try {
        credential = GoogleCredential.fromStream(new FileInputStream(credentialsFilePath), httpTransport, JSON_FACTORY)
                .createScoped(SCOPES)
                .createDelegated(svcAccount);
    } catch (IOException e) {
        logger.error("ERROR Occurred while Authorization using the credentials provided...!!!");
    }
    return setHttpTimeout(credential);
}

Get Sheet and Drive Services

private Sheets getSheetService(String applicationName, NetHttpTransport httpTransport) throws FileNotFoundException {
    return new Sheets.Builder(
            httpTransport,
            JSON_FACTORY,
            getCredentials(httpTransport)
    ).setApplicationName(applicationName).build();
}

private Drive getDriveService(String applicationName, NetHttpTransport HTTP_TRANSPORT) throws FileNotFoundException {
    return new Drive.Builder(HTTP_TRANSPORT,
            JSON_FACTORY,
            getCredentials(HTTP_TRANSPORT))
            .setApplicationName(applicationName)
            .build();
}

Create a Spreadsheet and Control Permissions

You can create a sheet easily with the sheet service. But, if you want to put your sheet in a specific parent folder and change permissions/control sharing settings, then you need to create it with the drive service setting a mime-type of sheet.

You can find your folder ID by navigating to your folder in google drive and getting the ID from the URL. Since we set “supports all drives”, we can create this file in a folder in our share drive. Without this setting, share drives fail with some kind of auth error.

private File createSpreadSheet(Drive driveService, String sheetTitle, String userFolderId) {
    try {
        File fileSpec = new File();
        fileSpec.setName(sheetTitle);
        fileSpec.setParents(Collections.singletonList(userFolderId));
        fileSpec.setMimeType("application/vnd.google-apps.spreadsheet");

        File sheetFile = driveService.files()
                .create(fileSpec)
                .setSupportsAllDrives(true) //Share drives don't work without this parameter.
                .execute();

        sheetFile.setViewersCanCopyContent(false);
        sheetFile.setCopyRequiresWriterPermission(true);
        sheetFile.setWritersCanShare(false);
        driveService.files().update(sheetFile.getId(), sheetFile);

        return sheetFile;
    } catch (IOException e) {
        throw new RuntimeException("Error occurred while creating the sheet.\n" + e);
    }
}

Write Data to a Spreadsheet

private void writeToSpreadSheet(Sheets service, String spreadSheetId, String json) {
    final String range = "Sheet1";
    ValueRange body = new ValueRange()
            .setValues(getJsonData(json));
    UpdateValuesResponse response;
    try {
        response = service
                .spreadsheets()
                .values()
                .update(spreadSheetId, range, body)
                .setValueInputOption(VALUE_INPUT_OPTION)
                .execute();
    } catch (IOException e) {
        throw new RuntimeException("ERROR Occurred while insert / updating the values in Google Spread Sheet : " + spreadSheetId + "\n" + e);
    }
    logger.info(response.getUpdatedCells() + " cells updated.");
}

Find a Folder in Another Folder

private String getFolderIdIfExists(Drive driveService, String folderName) throws IOException {

    FileList folders = driveService.files().list()
            .setSupportsAllDrives(true)
            .setIncludeItemsFromAllDrives(true)
            .setQ(String.format("'%s' in parents and mimeType = 'application/vnd.google-apps.folder' and name = '%s'",
                    mainFolderId, folderName))
            .execute();

    return folders.getFiles().size() == 1 ? folders.getFiles().get(0).getId() : null;
}

Create a Folder In a Specific Folder

private String createUserFolderAndGetId(Drive driveService, String folderName) throws IOException {

    File fileSpec = new File();
    fileSpec.setName(folderName);
    fileSpec.setParents(Collections.singletonList(mainFolderId));
    fileSpec.setMimeType("application/vnd.google-apps.folder");

    File targetFolder = driveService.files()
            .create(fileSpec)
            .setSupportsAllDrives(true) //Share drives don't work without this parameter.
            .execute();

    return targetFolder.getId();
}

Helm 3 / GitLab Uninstall If Exists

Posted on July 23, 2020 by John Humphreys

Helm 3 does not seem to have a good way to “uninstall if exists” unfortunately. So, we had to find a way around that to make sure we could wipe out a previous deployment reliably, in CI/CD (in cases where we had to change a deployment version, which is rare).

As we use GitLab, we found this trick in the docs:

If any of the script commands return an exit code different from zero, the job will fail and further commands won’t be executed. This behavior can be avoided by storing the exit code in a variable:

job:
  script:
    - false || exit_code=$?
    - if [ $exit_code -ne 0 ]; then echo "Previous command failed"; fi;

Using this, you can do:

helm uninstall -n your-namespace some-deployment-0-0-5 || exit_code=$?

And, while you’ll receive a note that it didn’t work on any release after it’s gone, the pipeline will continue on fine.

Coding Stream of Consciousness

by John Humphreys – Random code from my life.