Postgres Schema Creation

Historically, I have not worked with Postgres much. So, when I had to start using it, one of my first questions was how to create a schema, and how to use it for my new tables and such.

Creating a schema is exactly what you expect:

create schema myschema;

But using it is not quite what I expected.  Of course, you can do the standard thing when you’re managing your objects and use . like this:

create table myschema.mytable (x int);

But what if you just want:

create table mytable (x int);

to go into myschema by default?  To do this in Postgres, you have to add the schema to your search path.  By default your search path will be just set to the public schema; you can view it like this:

SHOW search_path;

You can set it to one or more schemas in reality.  The first schema your query sees a the named table in will be the one it takes it from.  The first schema in the list will be the default one for when you create new objects too.  So, if you did this:

SET search_path TO myschema;
create table mytable (x int);

Then your table would in fact be created in the “myschema” schema properly.

Database Star Schemas and Snowflake Schemas

Schema Confusion

A lot of people very regularly work with databases (even high end ones), but get thrown by terms like star-schema, snowflake-schema, etc. due to lack of formal training or working with data warehousing technologies.

These same people will often be perfectly comfortable with indexing, query optimization, foreign keys, concepts of de-normalization and normal forms, etc.

I personally started working with the actual “Snowflake” database recently https://www.snowflake.com/about/ and had to review what a snowflake shema was when I started looking at it.

Useful Articles

I found an interesting article on Star schemas vs Snowflake schemas pretty quickly, and back tracked it to precursor articles digging into the Star and Snowflake schemas respectively.  Here are each in case you want the original content; I’m just going to paraphrase it below to give people a quick overview and/or refresher.

Star Schema

A star schema just means that your main table has a primary key made out of multiple columns, each of which is a foreign key to a “dimension” table.  Then you have one or more “fact” columns in addition to the primary key.

The dimension columns will be all the relevant attributes you may want to aggregate and/or query the main table on.  For example, you might have a table for the date which breaks out the year, month, day, and day-of-week so they can be directly used.  You may then have another dimension table for the geographical region with columns for the continent, country, and city, for example, so you can aggregate on those.

Each dimension table is NOT de-normalized though.  So, if you have “New York City” as the city for 1 million rows, you are literally repeating that a million times.  This makes queries easy to write but has a penalty in terms of data storage (which can be bad if you’re, say, in the cloud and paying more for more storage over time).

Snowflake Schema

Plain and simple; a snowflake schema is a star schema where the dimension tables are normalized.  This means that, for example, the geographical region dimension table itself would actually be turned into 4 tables (kind of its own star schema).  You would have one table for the continent, one for the country, one for the city, and one main table for the combination of the 3 as a primary key.

This makes queries more complex and possibly a little slower, but it means we have complete normalization and are not wasting any data storage.  Also, if say, a city changed its name, we would have exactly one database cell to update where as in a star schema we would have to update potentially millions of rows with copies of that name.

Why the Names?

If you think of a “Star Schema”, picture a main table with, say, 5 extra dimension tables around it like the 5 points of a star.  Makes sense, right?

Now, for a snowflake, picture each point being 5 tables by itself… so each point is its own star.  This starts to branch out like a snowflake.  Just think of fractals if you don’t believe me :).

 

 

First Steps in Microsoft Azure

As part of a new opportunity, I am lucky enough to have to take a deep dive into the Microsoft Azure cloud offering.

I’m really just getting started, but I found an awesome educational tool which is created by Microsoft itself here -> https://docs.microsoft.com/en-us/learn/.

It is basically an automated tutorial which includes embedded videos and an embedded command line interface.  So, you just read through, follow instructions, and spawn up and manipulate virtual machines (Windows or Linux) in a matter of minutes.

I just did the first level; but it was an amazingly pleasant experience.  So, if you’re learning Azure like me, I suggest starting there!  There appear to be many more modules to go and I’m expecting great things after the first one :).

VectorWise DB – Actian – Set Cores-Per-Query

The Basics

Actian Ingres Vector(wise) is very adept at parallelizing the processing of queries over multiple processor cores. You can choose how many cores to use per query in the server configuration and it can be overridden on a per-query basis using a with-clause.

How do you determine the server default for cores-per-query?

If you go to your local vectorwise distribution and change-directory to “$II_SYSTEM/ingres/data/vectorwise/”, you will see a vectorwise.conf file. Keep in mind that Actian is changing the name of Vectorwise to Vector, so I cannot promise that the file name will be exactly the same in the future. If you edit this file, can you will find the parameter “max_parallelism_level” under the engine section. The value of this setting is the maximum number of cores that your queries will use by default when executing (they may use less). I believe you most likely need to restart the server (ingstop/ingstart) in order to see changes to this value take effect.

Setting parallelism on a per-query basis

It is possible to override the parallelism of a specific query via the SQL command used to execute it (though you have to decide for yourself whether this is a good idea or not). For example, you could do: “select a.* from test_table1 a join test_table2 b on a.id = b.id with max_parallel 16;” to make your query use 16 cores instead of the configured amount.

Interestingly enough, if I do this on my server which is configured to use a maximum of 8 cores, and I set the query parallelism to 8, I get a 4x performance improvement. The server only appears to be using 2 of the 8 allowed cores when it executes by default. Telling the query to explicitly use 8 thus yields a large performance improvement. Even more interestingly, if I bump up the 8 to 16 for the query, I get some more performance improvement (around 20%) which seems to indicate that the explicit query setting can use more than what is defined as the maximum in the vectorwise.conf file.

VectorWise DB (Actian) PDT Issues

Ingres Vector(wise) uses a PDT (positional delta tree) to store in-memory updates to tables which have not yet been flushed to disk. When executing incremental updates to tables using standard SQL operations, instead of the Actian-recommended combine operations, sometimes errors occur due to data in the PDT being used improperly. These errors are just due to bugs in VW which Actian tends to address quickly – but knowing how to detect them and narrow your problems down to them is a useful skill when working with this product. It allows you to create a useful bug report and move on to other problems so you don’t waste too much of your own time.

When does this matter?

Usually these problems manifest themselves when you are moving data (inserts, updates, or deletions) from one table with pending in memory updates to another table. So, for example, if you stage your incremental updates in table X and go to merge them to table Y, you may find that table X’s in-memory changes conflict with the data you are currently working with.

What are the visible effects?

PDT issues can cause a number of problems to occur as you might imagine. You may be executing an update, and it may tell you that the update is ambiguous because you have a record in your current data set and a record in the PDT which both match the given key. Again, this should not happen, but it does sometimes due to various bugs. Similarly, you may be told that a duplciate TID exists while doing a deletion if a similar deletion has occurred in the past in the same table and it’s still in the PDT. Finally, you may also be told that a duplicate insertion is occuring for the same reasons. This last one can result in a primary key violation or you literally ending up with two records in your target table when only one was in your current data set (the PDT version and the version in the staging table both get inserted).

How do I tell if PDTs currently exist?

Assuming I create two test tables and insert two records into table ‘test_one’, the command “vwinfo -M ” displays information as follows:

 +--------------------------------+--------------------------------+--------------------+--------------------+--------------------+--------------------+
 |schema_name                     |table_name                      |pdt_mem             |pdt_inserts_count   |pdt_deletes_count   |pdt_modifies_count  |
 +--------------------------------+--------------------------------+--------------------+--------------------+--------------------+--------------------+
 |$ingres                         |iiprimitives                    |                   0|                   0|                   0|                   0|
 |$ingres                         |iivwprof_expr                   |                   0|                   0|                   0|                   0|
 |$ingres                         |iivwprof_expr_global            |                   0|                   0|                   0|                   0|
 |$ingres                         |iivwprof_io                     |                   0|                   0|                   0|                   0|
 |$ingres                         |iivwprof_io_global              |                   0|                   0|                   0|                   0|
 |$ingres                         |iivwprof_op                     |                   0|                   0|                   0|                   0|
 |$ingres                         |iivwprof_op_global              |                   0|                   0|                   0|                   0|
 |$ingres                         |iivwprof_parse_tree             |                   0|                   0|                   0|                   0|
 |$ingres                         |iivwprof_parse_tree_global      |                   0|                   0|                   0|                   0|
 |$ingres                         |iivwprof_query                  |                   0|                   0|                   0|                   0|
 |$ingres                         |iivwprof_query_global           |                   0|                   0|                   0|                   0|
 |$ingres                         |iivwprof_stage                  |                   0|                   0|                   0|                   0|
 |$ingres                         |iivwprof_stage_global           |                   0|                   0|                   0|                   0|
 |ingres                          |test_one                        |                2542|                   2|                   0|                   0|
 |ingres                          |test_two                        |                   0|                   0|                   0|                   0|
 +--------------------------------+--------------------------------+--------------------+--------------------+--------------------+--------------------+

Most of this is arbitrary output for system tables, bu the bottom two lines show the tables created for this sample. We can see both inserts accounted for as well as the overall memory used for the PDTs for each table. Table 2 uses none since it hasn’t been altered yet. If we had performed deletions or updates over multiple transactions, they would be represented here as well. If you cancel out operations in the same transaction (insert and delete the same record, for example), those operations will not show up in the PDTs when the transaction is committed.

How can I flush the PDTs?

Since PDTs can cause problems, the best thing you can do is to flush them to disk if you know there is an issue, after which all values in this “vwinfo” output will be 0. Two operations can flush PDTs; the first is ‘combine’ and the second is ‘truncate’. The combine command “call vectorwise(combine ‘tablename’);” can be executed from any SQL prompt and it will flush in memory updates to disk. You can also use it to do table arithmetic like table_1 – table_2 + table_2 to delete all of the values of table 2 from table one and then reinsert them. Ingres actually recommends that you use combine to do incremental updates in the first place rather than using straight SQL. The second command, truncate, is of the form: “modify to truncated;” and can also be executed from any SQL prompt. You should not use the truncate command on your data tableas as it will wipe them out. You can ue them on your staging tables after every transaction to ensure their PDTs are totally removed though, assuming you’re okay with the staging tables having no data between transactions.

A new article will be written to talk about the usage of combine in more detail, so please look for that if you are interested in learning more.