12 Factor App Best Practices – Quick Notes

Posted on December 13, 2019 by John Humphreys

Why Review This?

The 12 factor app “best practices” are a good overview of most things you should be doing in a modern application. They provide a good overview of many architectural and dev-ops practices which people tend to learn in a more peace-meal and painful way over time.

Sources / References

While reviewing this recently myself, I read and viewed a few sources. I found this video https://www.youtube.com/watch?v=94PxlbuizCU on YouTube gave a very practical summary… especially for those who like to work in Java a lot such as myself. I also found this other one from AWS https://pages.awscloud.com/Building-Microservices-with-the-12-Factor-App-Pattern-on-AWS_0512-CON_OD.html which fills in some areas better than the other, especially when it comes to practical use with containerization and also in the area of logging. I recommend taking the time to watch both.

The Twelve Factors

Provided with my interpretation/commentary of course based on the various sources I looked at.

Code Base – Should be checked into version control (e.g. Git). Should have one application per repository (or sub-module). E.g. if you have an API and a website, separate them so you don’t end up having to combine releases / so you aren’t inclined to couple them and muddy the waters.
Dependencies – Should be declared. Should also be external to your code base (e.g. pull from a maven repo, don’t check JARs into your git repo). Should avoid using anything from outside the project (e.g. if you’re in Node, don’t global install anything, make sure everything is local to the project). This also kind of leads to the fact that you should not rely on something like an app server (e.g. tomcat) external to your app to host your app. Your app should be more like a spring-boot application which has is deployed with its own web server bundled in.
Configuration – Passwords/tokens/connections/etc should be separate from your code. E.g. using profiles and ansible-secrets/etc within your code base to handle multiple environments can be bad because you have to modify your code base to support new environments. These configurations should be moved out to environment variables or some other mechanism. The use of environment variables seems to be recommended in many frameworks/technologies online these days, so I think its good to do when possible. Also note, the use of CNAMES and other flexible configuration mechanisms to abstract the location from which things are running is recommended.
Backing Services – Databases/queues/caches/etc should be decoupled and easily changeable in your configuration. Again, prefer CNAMEs/etc for abstraction.
Build/Release/Run – A deployment = building code, combining it with config to form a release, and executing that release in the target environment. This one sounds a little… flexible to me? E.g. Helm lets you take a template, apply values to it, and deploy the result to kubernetes. So, in that case, I guess the helm template/vars part is the “release” formation, and running it is the execution of helm? We don’t need separate docker images built for each environment in this case. If, on the other hand, we’re taking each environment’s config and generating a new docker image per environment, then we’d have one release per environment (code + config combined), and it would be nicely decoupled from the “run”, in which case this makes more direct sense. Note that the AWS video calls this later case an anti-pattern and recommends deploying the config within the environment and using one standard docker image (which sounds better). This could be achieved with config maps, secrets, etc in kubernetes, for example.
Processes – Stateless (stick sessions = bad).
Port Binding – Back to the end of #2/dependencies – apps should be self-contained. They should provide their own web server and should just expose a port themselves as-is and should not depend on anything else to do so.
Concurrency – Build apps to scale out, not just up. Of course you can have more threads/pools/etc… but if you can break up your application well, you can have separate apps for each purpose and each can scale at its own pace and use resources effectively. E.g. if I have an app reading files and dumping records to Kafka, and then it reads the kafka records back out and sends them to an external service, it can be 2 separate apps. Maybe the first part that reads files needs 4 copies with large memory to keep pace, and the second one needs 10 copies with high CPU and low memory to keep pace. They can be dealt with individually and easily once designed well (and this sort of thing is a cake walk when using something like kubernetes).
Disposability – “Servers are cattle, not pets”. You don’t want a long lived server (like an app server) where people know its name and IP and how to debug it. You want little, ephemeral, easily created and destroyed servers (or rather, containers). In the cloud, you may use auto-scaling-groups / scale sets (AWS/Azure) to easily scale up and scale down copies of a server (machine image). But even this isn’t great; it’s heavyweight and slow. If you have an container orchestrator like kubernetes (or docker swarm, etc) in the cloud, or even on-prem, you can scale up and down docker/etc containers in seconds at large scale. Either way, you should basically never need to look at a running machine image or a container unless you’re debugging something. You can create new ones trivially and you don’t care about them. They are disposable.
Dev/Prod Parity – All your environments should be identical minus their configuration. Use the same databases, same caches, same web server versions (bundling web-servers into code like spring-boot does helps with the last one).
Logs – View logs as a continuous stream of events and store them in an external place like ELK or Splunk. Preferably have an agent separate from your app be responsible for uploading the log stream; just let your app write to standard output.
Admin Processes – Things like database pruning, backups, etc. should be treated as any other application you are running and they should be distinct applications separate from the application(s) they are aiding. E.g. any of those processes should basically follow these steps themselves.

Checkpoint Linux SSL Network Extender VPN Auto Closes after Connecting.

Posted on December 10, 2019 by John Humphreys

Not much of a post here… but FYI for everyone – I’m running on Ubuntu Linux. There is no checkpoint VPN client for Linux, so I have to go through a website and use their “SSL Network Extender”.

A fairly large portion of the time, this seems to hang/break under load (e.g. reading lots of database results). It also randomly stops working periodically. I haven’t been able to figure out why honestly.

Eventually, if I keep reconnecting to it, I get into a situation where it auto-closes right after it connects. I couldn’t fix this until I restarted my laptop.

Anyway, I just figured out that disabling my wireless card and re-enabling it also fixes that issue, and that is much, much faster. So, until we figure out the root cause for the other issues, I hope this helps you too!

Presto + Hive View Storage/Implementation

Posted on December 8, 2019 by John Humphreys

I’ve been learning about how presto handles views lately. This is because we are heavily reliant on presto and we recently ran into multiple use cases where our hive metastore had views which wouldn’t work within presto.

What are Presto Views Exactly?

Presto has its own view implementation which is distinct from a hive’s view implementation. Presto will not use a hive view, and if you try to query one, you will get a clear error immediately.

A presto view is based on a Presto SQL query string. A hive view is based on a a hive query string. A hive query string is written in HQL (Hive Query Language), and presto simply does not know that SQL dialect.

How Are Views Stored?

Presto actually stores its views in the exact same location as hive does. The hive metastore database has a TBLS table which holds every hive table and view. Views have two columns populated that tables ignore – view_original_text and view_expanded_text. Hive views will have plain SQL in the view_original_text column whereas presto views will have some encoded representation prefixed with “/* Pesto View…”. If presto queries a view and does not find it’s “/* Pesto View” prefix, it will consider it a hive view and say that it is not supported.

Making Presto Handle Hive Views

I’ve been doing work for some time to try to make presto-sql support hive views. I’m using the pull request noted in this issue https://github.com/prestodb/presto/issues/7338 as a template. It is fairly old though and was made against presto-db rather than presto-sql, so the exercise has turned out to be non-trivial.

I’m still chugging along and will post more when done. But one thing to note is that this PR does not really make presto support hive views. It actually allows presto to attempt to run hive views as they are. Many hive views will be valid presto SQL – e.g. where you’re just selecting from a table with some basic joins and/or where clause filters.

So, this PR basically prevents presto from outright failing when it sees a view that does not start with “/* Presto View”. It then helps it read the hive query text, possibly lightly modify it, and attempt to run it as if the same had been done for a Presto query.

I plan on doing a number of corrections to the SQL as well; e.g. replacing double quotes, converting back-ticks, replacing obvious function names like NVL with COALESCE, etc. Eventually I may try to fix more by parsing the hive text with ANTLR or something similar to make as many hive views run by default as possible. But it will never be a complete solution. A complete solution would be very hard as it would require a full understanding and conversion of hive language to presto language (which is probably not even possible given some of their differences).

AWS Not Authorized to Use Launch Template (in Terraform or in Console)

Posted on December 2, 2019 by John Humphreys

This is just a quick note for anyone facing this issue.

A few of us lost about a day debugging what we thought was a terraform issue originally. While we were creating an auto scaling group (ASG), we were getting “Invalid details specified: You are not authorized to use launch template…”.

It turned out that the same error was presented in the AWS console when we tried to create the ASG there.

After some substantial debugging, it turned out that terraform was allowed to create a launch template with an AMI (Amazon Machine Image) that did not exist. We had used the AMI ID from our non-prod account in our prod account, but AMIs must exist in each account with unique IDs – so it wasn’t working.

It took us a while to get to this point in our debugging because, frankly, we were very astounded that the error message was so miss-leading. We spent a very long time trying to figure out everything that could trigger a permissions error on the template itself, not realizing that a missing resource used within the template would make the whole template present that error.

Hive Metastore DB (HMS DB) – Get All Tables/Columns including Partition Keys

Posted on December 2, 2019 by John Humphreys

How Hive Stores Schemas

I recently had to sync the schemas from a hive database to a normal MySQL database for reasons I won’t bother going into. The exercise required me to get all columns from all tables in hive for each DB though, and I found that this was not amazingly straight-forward.

The hive metastore DB is a normal MySQL/etc database with a hive schema in it. The hive schema holds the hive tables though. So, the information schema is irrelevant to hive; to get the hive table details, you have to interrogate the TBLS table, for example. To get columns, you need to interrogate COLUMNS_V2, and to get the databases themselves, you look toward the DBS table.

Missing Columns

Other posts I’ve seen would leave you here – but I found when I joined these three tables together (plus some others you have to route through), I was still missing some columns in certain tables for some reason. It turns out that partition columns are implicit in hive. So, in a file system of hive data (like HDFS), a partition column in a table is literally represented by just having the directory named with the partition value; there are no columns with the value in the data.

What this means is that partition columns don’t show up in these normal tables. You have to look to a separate partition keys table to find them with a separate query.

The Working Query

The query below finds all columns of any kind and sorts them in the order they’ll appear when you select from a table in hive/presto/etc. I hope it helps you!

select db_name, table_name, column_name from (
SELECT d.NAME as db_name, t.TBL_NAME as table_name, c.COLUMN_NAME as column_name, c.INTEGER_IDX as idx
FROM DBS d
JOIN TBLS t on t.DB_ID = d.DB_ID
JOIN SDS s on t.SD_ID = s.SD_ID
JOIN COLUMNS_V2 c on c.CD_ID = s.CD_ID
WHERE d.NAME = :dbName
union
SELECT d.NAME as db_name, t.TBL_NAME as table_name, k.PKEY_NAME as column_name, 50000 + k.INTEGER_IDX
FROM DBS d
JOIN TBLS t on t.DB_ID = d.DB_ID
join PARTITION_KEYS k on t.TBL_ID = k.TBL_ID
where d.NAME = :dbName
) x
order by db_name, table_name, idx

Coding Stream of Consciousness

by John Humphreys – Random code from my life.