Java Regex Capture/Extract Multiple Values

Use Case

When you’re trying to parse complex log lines or extract data from complex strings, regular expression capture groups are about the most useful tool you could possibly ask for.

This example is taken from work where I had to parse and analyze some logs for loading data to a database. A log sample would look like this:

/data/SXF_SX_4906_2019-04-13.01.43.24.143.log:2019-04-13 01:43:28,320 INFO com.x.dc.db.schemagen.batch.listener.JobResultListener [tx.id=IF-TX-ID-a23c195c-673a-47ab-ab0c-7b8591821169] [main] Inside sendEmailNotification method: subject is prod alert:DB copy job STARTED for the dataset:4906

The Code

The relevant part of the code is here:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

private static final String capturePattern =
"^/.*/SXF_SX_(\\d+)_(\\d{4}-\\d{2}-\\d{2}.\\d{2}.\\d{2}.\\d{2}.\\d{3}).log:(.*) INFO.*" +
"copy job (.*) for the dataset:.*"

//Leaving out rest of class, this is just the regex parsing portion.
//isValid, fulLLogEntry, dataSetId, fileTimestamp, logTimestamp, status are all
//member variables in a class where this function is a member.
public DbLoadLog(String line) {

    isValid = true;

    Pattern r = Pattern.compile(capturePattern);
    Matcher m = r.matcher(line);

    //If you wanted to run over a multi-line-string/file, you could put
    //m.find() in a while loop and keep going; but I'm just analyzing specific lines.
    if (m.find()) {
        fullLogEntry = line;
        dataSetId = Integer.valueOf(m.group(1));
        fileTimestamp = m.group(2);
        logTimestamp = m.group(3);
        status = m.group(4);
    }
    else {
        isValid = false;
    }
}

 

Java – Regular/Scheduled Task, One Run at a Time

This will be a very short post, and I’m mostly writing it just so it sticks in my head better.

Common Use Cases

There are many times when you may find that you need to regularly run a task in Java.  Here are a few common examples:

  • You have a cache you need to refresh every X minutes to power a dashboard or something similar.
  • You need to prune old files from a file system once an hour.
  • You need to regularly update stats counters for monitoring.

Coding Options

There are a lot of ways to do this, but the recommended approach would be to use a scheduled executor.  Now… this part is easy to remember, but what is sometimes hard to remember is that you have two options when scheduling a task.  I often find myself picking the wrong one as it pops up in Intelli-sense and I forget there are 2 options.

  1. Run the task every X seconds/minutes/etc no matter what.
  2. Run the task every X seconds/minutes/etc *after* the previous task completed.

These two things can be very different.  If you have a task that only takes a couple of seconds, it probably doesn’t matter much.  But if you have a task that takes 2 minutes and you’re running it every 1 minute, then with option 1 you will always be running at least 2 copies of the task, whereas with option 2 you’ll just be running one copy at a time with a minute of buffer in between each task.

For both options, you can create the scheduled executor service the same way:

ScheduledExecutorService se = Executors.newSingleThreadScheduledExecutor();

But for option #1 (run every interval regardless of previous tasks), you would use this function:

se.scheduleAtFixedRate(this::refreshCache, 10, 120, TimeUnit.SECONDS);

And for option #2 (start counting after previous task completes), you would use this function.

se.scheduleWithFixedDelay(this::refreshCache, 10, 120, TimeUnit.SECONDS);

 

Does Spring JdbcTemplate Close Connections? … Not Always.

Common Advice – Correct?

Decent developers usually know that they have to try/catch/finally to ensure they clean up connections, file handles, or any number of things.  But then, for Java, you hear “just use JdbcTemplate! it does all this boilerplate for you!”.

Uncommon Scenario

Normally when you’re writing an average app, you generally want lots of queries to be able to run in parallel, efficiently, using the same user and password.  In this case, you can easily just use a connection pool and “not worry about it”.  Spring JdbcTemplates will just grab connections from your data source and pool them appropriately based on the data source.  You don’t have to worry about if they are opened, closed, or whatever.

I ran into a scenario today where that was not true though.  I have an app where each user connects to each back-end data-source using their own personal account which is managed by the application itself.  So, each user needs his or her own connection.  So… pooling would not make much sense unless each user had to do parallel operations (which they don’t).

What Happens to the Connections?

So, here’s the fun part.  I had, for the longest time, assumed that JdbcTemplates would clean up connections in addition to results sets.  In fact, you’ll see this online a lot.  But be careful!  This does not appear to be the case, or if it is, it is at least data source dependent… and that actually makes sense if you think about their purose.

Here is how I verified this. I created a JdbcTemplate which is based on a new data source each time (which is needed as the user/password change).

private NamedParameterJdbcTemplate getJdbcTemplate(String email, String password) {
    SimpleDriverDataSource ds = new SimpleDriverDataSource();
    ds.setDriverClass(HiveDriver.class);
    ds.setUrl(url);
    ds.setUsername(email);
    ds.setPassword(password);
    return new NamedParameterJdbcTemplate(ds);
}

Then I used the template for a number of queries in a normal manner (like this):

getDirectHiveJdbcTemplate(email, catalog)
.queryForList("describe extended `mytable`.`mytable`",
new MapSqlParameterSource())

Then I took a heap dump of the process with this command (run it from your command line in your JDK bin folder in Program Files or the Linux install location with minor changes):

jmap.exe -F -dump:format=b,file=C:\temp\dump.bin your-pid

You can get the PID easily by looking at your running process from JVisualVM (which is also in the bin directory).

Once the dump is complete, load the file into JVisualVM (you need to use the 3rd option of file type to make it go in, I think its pattern is . or something.

Finally, go to the classes tab, go to the very bottom of the screen, and search for the class of interest (in my case HiveConnection). I can see as many instances as I have run queries as each query made a new connection from a new data source. They are definitely not being cleaned up.

This surprised me because even though creating a new template/data-source each time is not normal, I expected them to clean up the connections when they were garbage collected or as part of normal operations.  After thinking about it more, I realize operations in my case would not me “normal”, but the lack of clean up when out of scope still definitely is a surprise to me.

 

Extending a LVM Volume (e.g. /opt) in Cenots 7

What Does LVM Mean?

Taking a description from The Geek Diary:

The Logical Volume Manager (LVM) introduces an extra layer between the physical disks and the file system allowing file systems to:

  • Be resized and moved easily and online without requiring a system-wide outage.
  • Use discontinuous space on disk.
  • Have meaningful names to volumes, rather than the usual cryptic device names.
  • Span multiple physical disks.

Extending a LVM Volume (e.g. /opt):

Run “vgs” to display information on the available volume groups. This will tell you if you have “free” space that you can allocate to one of the existing logical volumes. In our case, we have 30 GB free.

$> vgs
  VG     #PV #LV #SN Attr   VSize   VFree
  rootvg   1   7   0 wz--n- <63.00g <30.00g

Run “lvs” to display the logical volumes on your system and their sizes. Find the one you want to extend.

$> lvs
  LV     VG     Attr       LSize
  homelv rootvg -wi-ao----  1.00g
  optlv  rootvg -wi-ao----  2.00g
  rootlv rootvg -wi-ao----  8.00g
  swaplv rootvg -wi-ao----  2.00g
  tmplv  rootvg -wi-ao----  2.00g
  usrlv  rootvg -wi-ao---- 10.00g
  varlv  rootvg -wi-ao----  8.00g

Extend the logical volume using “lvextend”. In our case, I’m moving /opt from 2g to 5g.

$> lvextend -L 5g rootvg/optlv

Display the logical volumes again if you like. You won’t see a change yet, it will still say 2.00g.

Use df -hT to show what kind of file system you are using for the volume you resized. This can change the next command you have to do.

$> df -hT
Filesystem                Type      ...
/dev/mapper/rootvg-rootlv ext4      ...
devtmpfs                  devtmpfs  ...
tmpfs                     tmpfs     ...
tmpfs                     tmpfs     ...
tmpfs                     tmpfs     ...
/dev/mapper/rootvg-usrlv  ext4      ...
/dev/sda1                 ext4      ...
/dev/mapper/rootvg-optlv  ext4      ...
/dev/mapper/rootvg-tmplv  ext4      ...
/dev/mapper/rootvg-varlv  ext4      ...
/dev/mapper/rootvg-homelv ext4      ...
/dev/sdb1                 ext4      ...
tmpfs                     tmpfs     ...

If it is ext4, you can use the following command to tell the system to recognize the extended volume. If it is not, you will have to find the appropriate command for the given file system.

$> resize2fs /dev/mapper/rootvg-optlv

Now you should see the extended volume size in “lvs” or “df -h”; and you’re done!

$> df -h
Filesystem                 Size  Used Avail Use% Mounted on
/dev/mapper/rootvg-rootlv  7.8G   76M  7.3G   2% /
devtmpfs                   3.9G     0  3.9G   0% /dev
tmpfs                      3.9G  4.0K  3.9G   1% /dev/shm
tmpfs                      3.9G  130M  3.8G   4% /run
tmpfs                      3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/mapper/rootvg-usrlv   9.8G  2.6G  6.7G  29% /usr
/dev/sda1                  976M  119M  790M  14% /boot
/dev/mapper/rootvg-optlv   4.9G  1.9G  2.9G  40% /opt
/dev/mapper/rootvg-tmplv   2.0G   11M  1.8G   1% /tmp
/dev/mapper/rootvg-varlv   7.8G  3.2G  4.2G  44% /var
/dev/mapper/rootvg-homelv  976M   49M  861M   6% /home
/dev/sdb1                   16G   45M   15G   1% /mnt/resource
tmpfs                      797M     0  797M   0% /run/user/1000

Apache HTTPD Proxy Add CORS Headers to Use Remote API

I just suffered quite a lot while trying to use an API someone provided me to hack up a dashboard.  This was because it was missing the necessary headers to enable CORS to work from a web-app.

What is CORS?

Suffice to say that CORS is  away to allow your web-app to request resources from a server other than the one it came from.

While we’re not doing Java here, the Spring documentation explains the same origin policy which stops JavaScript from being able to invoke calls for resources from a different origin than it came from.  James Chamber’s blog explains the CORS concepts in terms of a guest list and makes it very comprehensible right here.

Make My Web-App Work!

Chances are, if you’re here, you’re currently writing a web-app (maybe Angular or React?), and you are trying to get data from a URL, and it is failing with the word CORS in your dev-tools console.

This is because the “Access-Control-Allow-Origin” header is not being added by the server.  So, your client app is not allowed to use it.  In terms of the blog referred to earlier, you’re not on the guest list.

If you’re in a company, the correct fix for this is to have the API add the header with the proper origins, or a * if you’re not overly worried about security.  But, often you don’t have time for that, or you’re not in control of the API.  In that case, you can use a proxy!

Assuming you are working on Linux and you’ve already installed Apache (HTTPD), it is pretty easy to fix this with a proxy.  Here is a perfect example. Just add this at the bottom of /etc/httpd/conf/httpd.conf, ensure there is a “Listen 80” in your file, and then restart Apache with “sudo systemctl restart httpd”.

<br /><LocationMatch "/SomePath">
ProxyPass http://:8080/SomePath
Header add "Access-Control-Allow-Origin" "*"


This will make sure all requests to this server, whether they’re from it or another host, which target :8080/SomePath will be proxied to http://:8080/SomePath and will return to the client with the extra header “Access-Control-Allow-Origin” “*”. This will let your web-app correctly talk to URLS under http://:8080/SomePath even though it doesn’t have CORS headers itself.

Note that you can do / instead of SomePath to target the whole server, and of course, you can change the ports/etc.

I hope this helps you!

Oh, and one more thing… If you installed Apache and are having trouble running it in the first place, maybe you’re running SELINUX, in which case you might want to try “/usr/sbin/setsebool -P httpd_can_network_connect 1”. That nailed me before I went on to work on the rest of this.

Also,  if you want to use a different port and it’s not working on Centos/etc, you may need to open the port in SELINUX “semanage port -a -t http_port_t -p tcp 8222”.

 

Azure VM Unresponsive, Can’t SSH

My VM Was Non-Responsive

Today I had an Azure virtual machine go down very unexpectedly.

I received error reports from users and tried to go to the related service endpoint myself… and sure enough, it didn’t come up.  Then, I tried to ssh onto the VM and I couldn’t.

I hopped into the Azure portal, went to the VM, and things actually looked alright… it wasn’t stopped, or de-allocated, or anything.

Why?

After multiple minutes of digging around the Azure portal for more information, suddenly the “Activity Log” popped up with a new entry.   This was relatively disconcerting as the issue had been reported over half an hour ago and I had been on the portal for multiple minutes.

The activity log said I had a “health event” which was “updated”.  Upon expanding it, I could see more events that had been “in progress”.  When you click the “in progress” event, you can get JSON for it and look into the details.  In my case, the bottom of the details said this:

    "properties": {
        "title": "We're sorry, your virtual machine isn't available because an unexpected failure on the host server",
        "details": null,
        "currentHealthStatus": "Unavailable",
        "previousHealthStatus": "Unknown",
        "type": "Downtime",
        "cause": "PlatformInitiated"
    }

So, the physical host which was running my VM in azure died. Azure automatically noticed this and moved it to a new physical host, though much slower than I would have appreciated.

The VM came up after a few more minutes and all was right with the world. So… the moral of the story is that if your VM is unresponsive, it may be because the host died, and you may have to wait quite a while to see information on that in the activity log. But it does auto resolve apparently which is nice.

Azure CLI Get Scale Set Private IP Addresses

Getting Scale Set Private IPs is Hard

I have found that it is impressively difficult to get the private IP addresses of Azure scale set instances in almost every tool.

For example, if you go and create a scale set in Terraform, even Terraform will not provide you the addresses or a way to look them up to act upon them in future steps.  Similarly, you cannot easily list the addresses in Ansible.

You can make dynamic inventories in Ansible based on scripts though.  So, in order to make an ansible playbook target the nodes in a recently created scale set dynamically, I decided to use a dynamic inventory created by the Azure CLI.

Azure CLI Command

Here is an azure CLI command (version 2.0.58) which directly lists the IP addresses of scale set nodes.  I hope it helps you as it has helped me.  It took a while to build it out from the docs but its pretty simple now that it’s done.

az vmss nic list --resource-group YourRgName \
--vmss-name YourVmssName \
--query "[].ipConfigurations[].privateIpAddress"

The output will look similar to this, though I just changed the IP addresses to fake ones here an an example.

[
"123.123.123.123",
"123.123.123.124"
]