Setting up HTTPS on an AWS Load Balancer

Context

This was my first time setting up HTTPS using a real certificate rather than a self-signed one.  This way you get that nice lock symbol + https in your browser without the user seeing the obnoxious “do you trust this site, continue unsafely” warnings.

Steps

Generate a Key Pair

Note that for common name you should use the DNS name that you will be using for the website (or load balancer in front of the website).  This is important.  E.g. *.appname.yourcompany.com.  This one is in case of a wildcard cert to handle anything in front of the app-name as well.

openssl req -new -newkey rsa:2048 -nodes -keyout appname.key -out appname.csr

Do not lose this key pair! Store it somewhere safe and backed up.

Request a Certificate from DigiCert/etc

Go to DigiCert or whatever service your company uses.  Request a standard SSL certificate using the files you generated above.

For the case of a wild-card cert, you will want to add Subject Alternative Names (SANs) for appname.yourcompany.com and *.appname.yourcompany.com.

If you get odd failure messages, these sites tend to have something to validate your CSR.  In my case, I think it made me go remake the CSR without a password before it would accept it.  This was only obvious from the validator; the initial messages were just confusing.

Note: The server name and location fields don’t seem to matter much.  I just put AWS for location and my app name with some environment suffix for the server name (nothing exists with this name).

Add Your Certificate to the AWS Certificate Manager

Once your certificate comes back from Digicert, you can upload it to the Certificate Manager in AWS.  The body is the certificate returned to you from Digicert/etc.  The private key is the one you generated above.  The certificate chain just needed the “intermediate certificate” that Digicert returned to me along with the new certificate (and only that, don’t put your certificate in there as well).

Create Your Load Balancer

Go to EC2 and create a new network load balancer.  Add a TLS (Secure TCP) listener and on the security settings page, pick “Choose a certificate from ACM (recommended)”.  Then you can select your certificate.

Create Your DNS Name

Contact your system admins to create a DNS name for you with the name you used in your certificate request (e.g. appname.yourcompany.com from above).  Point it at the IP of the load balancer.  You will have to resolve the load balancer name to an IP for this (I’m not sure that is best practice, so you may want to read up more before following this last step).

All Done!

Assuming your load balancer points to an application now, you can use https to talk to talk to the DNS name of the load balancer, and the load balancer will take it, terminate the TLS, and forward to your application.  So, you’re good!

 

Ansible – Install AWS CLI and Log in To Amazon ECR Docker Registry via Ansible

There is probably a much cleaner way of doing this using off-the-shelf automations.  But I was just following along with the AWS installation instructions and got this working.

    - name: Download AWS CLI bundle.
      shell: "cd /tmp && rm -rf /tmp/awscli* && curl 'https://s3.amazonaws.com/aws-cli/awscli-bundle.zip' -o 'awscli-bundle.zip'"

    - name: Update repositories cache and install "unzip" package
      apt:
        name: unzip
        update_cache: yes

    - name: Unzip AWS CLI bundle.
      shell: "cd /tmp && unzip awscli-bundle.zip"

    - name: Run AWS CLI installer.
      shell: "/tmp/awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws"

    - name: Log into aws ecr docker registry
      when: jupyterhub__notebook_registry != ''
      shell: "$(/usr/local/bin/aws ecr get-login --no-include-email --region us-east-1)"

In order to do the actual login, you need to endure your EC2 instance has an IAM role assigned to it that has reader privileges. Then you should be all good!

S3 Eventual Consistency

Consistent Distributed File Systems

Historically, I’ve used standard HDFS, MapR’s version of HDFS (MapR-FS), and ADLS (Azure’s data lake service).  All of these behave very much like you would expect a local file system to.  If you write files and another process lists files, it will immediately see them and be able to use them without issue.

Amazon s3 File System Issues

I was surprised when I started learning about Amazon s3 after using all of these prior file systems.  I understand that s3 is an object store… similar to Azure Blob Storage.  I also understand that it is the main data lake solution though.

Maybe it’s just because I’m new and am missing something… but there doesn’t seem to be any AWS version of ADLS.

The s3 storage service is eventually consistent.   This means that if you run Spark, or similar tools on it, they will likely produce improper results or fail.  This is because multiple tasks will write files in parallel and list them and they won’t necessarily get the fully up to date view of the storage.  So, they may write 10 files, list them, and see 5 files, etc.

I came across a very good article describing this in detail here: https://www.opendoor.com/w/blog/why-s3guard-with-s3-as-a-filesystem-spark.

The TLDR is that you have to use a consistency layer between your big data frameworks and s3 to ensure they function well.  You can confirm this by reading the short hadoop documentation site here -> https://wiki.apache.org/hadoop/AmazonS3.

Note that the first article recommends S3Guard which works based on DynamoDB, but there may be other options (e.g. EMR will have a way of dealing with this).

Determine Compatibility of hadoop-aws and aws-java-sdk-bundle JARs

When you’re integrating hadoop and other big-data frameworks into AWS s3, you will quickly run into a situation where you need to include the hadoop-aws and aws-java-sdk-bundle JARs into your class path.

Unfortunately, these JARs are separately versioned and it is hard to figure out compatibility.  The hadoop-aws JAR has to match your hadoop version exactly, so that one is fine.

Determining the Right Version

  1. Check your hadoop version.
  2. Get the hadoop-aws.jar with the same exact version.
  3. Go to the maven central page for the correct version of the hadoop-aws.jar and look at its compile dependencies.  E.g. at https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.9 you can see the SDK dependency is com.amazonaws » aws-java-sdk-bundle 1.11.199.