Use Case
I was working to deploy hive for a new system. I’ve used hive a fair bit but have not personally deployed it myself. So, I went through some online instructions and ended up installing hadoop, configuring it, starting YARN (which I’ve also used in the past), and then installing hive to run against it.
I was intending on running vs AWS s3 and not HDFS, so I realized I didn’t need the DFS. Then I thought harder and realized that it would be nice to run without YARN as well. A colleague pointed out that in his deployment, hive did the work locally and he ran nothing aside from hiveserver2 and the hive metastore. He was running multiple instances of hiveserver2 and the metastore, and they wouldn’t work together for any map-reduce tasks, but as he didn’t really want people using that execution engine anyway, that was just fine (and it is for me too!).
I didn’t realize this was an option… so I googled around and found this link: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-Hive,Map-ReduceandLocal-Mode (scroll to the heading Hive, Map-Reduce and Local-Mode).
Local Map Reduce
It turns out that in the property mapred.job.tracker is used to control if hive executes local map reduce or not. It is supposedly defaulted to “local”, meaning if you don’t override it then hive actually will execute in local mode by default.
With varying degrees of hive documentation maintenance, this is a little hard to rely on though. This site says it:
https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration
But if you look in your hive-default.xml.template file, you will not find this value. Also, if you run the JDBC command “set” against hive, it will list all its configuration values, and you won’t see it there either!
I pulled down the hive source code though, and in the POM file, you can find an XML block which clearly sets a system property mapred.job.tracker to “local” (I’m kind of surprised it was in the POM).
So, the system property is defaulted to “local”. You can’t find any references to this past this point, but it is apparently used when hive is interacting with hadoop; so I suppose hadoop picks up the property and uses it in a way that isn’t obvious here.
So… you’ll run locally by default as long as you don’t add extra configuration to avoid it (which I did initially when following some other tutorial).