Businessman accessing and controlling his business progress and report from his smart phone

#datavault 2.0 and @MapR – install and setup

I am working with MapR+Drill download in a VMWare VM.  I’ve begun taking the examples from our new book: Building a Scalable Data Warehouse with Data Vault 2.0 and putting them on Hadoop Hive, and Hbase (MapR edition).   I will post additional blogs as I work my way through, but here are some preliminary setup thoughts…

I started with the plain vanilla install, and ran through the command line examples that MapR offers.  I was very pleased, being able to run all of the examples out of the box (including joins across HBase and Hive using Drill).  Performance was awesome.  So I moved on to my next step:  loading data from the CSV files we have in the book.

I took the carrier file (38+ columns), and wrote a Drill bulk-load command, and started it up.  It complained that the target table wasn’t created.  This is the point where I thought perhaps, I should use the hadoop standard file copy to get it in, that worked – easy peasy.  But now, neither HBase nor Hive could see it.  I needed to “add” the schema / serde to the definitions.

I went to Hive, and created a table with all 38 columns defined as string, and then made the table an EXTERNAL table (meaning a link directly to the file), and bam – I was in business.  But then I started thinking: this seems to be a lot of coding work – and yes, it is!  Why? because in order to run any type of HiveQL or SQL through Drill, you *must* define a schema or serde.

Wait, are you telling me that “Hadoop is not all Schema-on-read?”  No. Not what I’m saying, I’m saying: if you WANT to USE the data in a SQL context, then you must define a schema to apply to the data set – otherwise Drill can’t get it out.  Makes sense to me.  So I started looking for the equivalent “GUI” (like Microsoft SQLServer has Enterprise Manager) I thought, there *must* be something available, and that’s when I found HUE…

I’ve spent literally 12 hours straight trying to add HUE to the MapR + Drill VM, to no avail.  I’ve added the YUM install commands and gotten many more services running:

  • mapr-httpfs
  • mapr-oozie
  • mapr-jobtracker
  • mapr-hue
  • mapr-pig
  • mapr-mahout
  • mapr-impala

I must be missing something, I’ve gotten the HUE web server up and running, and can connect from the web-browser.  Now HUE tells me that the JobTracker is not available.  I log in to the root, and try to start the jobtracker services – no go.  MapR instance just complains that Jobtracker is not configured for this instance (single instance, self contained).

So I search the web for ANY information on how to fix this error.  Nothing, nada, zilch…  not even MapR documentation talks about “specific jobtracker configuration”, nor can I find the proper documentation on Apache’s site to get this fixed up.  Well, I have to say, after 12 hours of build / rebuild, re-install, cleaning, updating, etc…  I’m done with Hue (at least for now).  If anyone has some hints, or thoughts or ideas to help, I’m all ears!

I changed my direction, and downloaded Talend Big Data Studio – Now, I should be able to use their GUI to load the data directly through a visual interface.  I will hook it up, and let you know how it goes.  Note: I am in partnership discussions with Talend as we speak.

See you in the next entry when I let you know how all this went.  Do you have questions? thoughts?

Thanks,  Dan

PS: I will be using HortonWorks VMWare instance next.

Tags: , , , , , , , , ,

3 Responses to “#datavault 2.0 and @MapR – install and setup”

  1. Neeraja Rentachintala 2015/09/21 at 6:37 pm #

    Dan
    Good information. Few thoughts/questions here.

    1. You mentioned “I took the carrier file (38+ columns), and wrote a Drill bulk-load command, and started it up. It complained that the target table wasn’t created. “. Were you using Drill “Create Table As” syntax to do that. A sample of such can be found in the sandbox tutorial (http://drill.apache.org/docs/about-the-mapr-sandbox/) and I would be curious to know the error if it is failing for you.

    2. Drill currently is not integrated with Hue. The Drill ODBC driver however comes with a quick GUI called “Drill Explorer’ which lets you navigate the data in Hadoop including the files on the file system or hive tables, hbase tables to see what medatadata exists and create views which then you can take to BI/ETL tools like Tableau/MSTR/Talend etc. You can download the ODBC driver from https://drill.apache.org/docs/configuring-odbc/. There is a video https://www.youtube.com/watch?t=128&v=qq9agjxY_yk which shows Drill explorer in action.

    3. If you are looking to use MapR with Hue, you can actually download a sandbox that comes pre-configured (https://www.mapr.com/products/mapr-sandbox-hadoop/download) . If you want to use Drill, you can just deploy Drill on this sandbox easily.

  2. Dan Linstedt 2015/09/22 at 6:03 am #

    Hi Neeraja,

    Thank-you so much for filling me in on the details. I want to represent MapR in the right light here.

    1) the problem I was having was because I was trying to load a Hive Internal Table – I hadn’t defined the schema BEFORE loading. Once I defined the table in Hive (Create Table command with all strings), and used an EXTERNAL clause (instead of an internal clause), I linked it to the CSV file just fine. No load necessary. I was trying on purpose to simply “get a file in to Hadoop” without using the Hadoop FS PUT command… Trying to push the boundaries of “ease of use”. Next, I will attempt something similar with HBase. The error is operator error, as in: “can’t load to a target table that doesn’t exist”

    2. This is very interesting. I had assumed (now I see incorrectly) that DRILL VM download would allow me to install / overlay Hue on top, and it would work seamlessly. Aren’t these components supposed to be plug & play? Honestly I don’t understand the difference of why I should start with a Hue Install, and THEN add Drill. I understand that Hue does not work with Drill. That wasn’t the point of the exercise. What I ended up needing was: Hue to LOAD the data, run the jobs, write the views, etc… Then use DRILL to execute queries. I was attempting to avoid all potential external tools (Informatica, Talend, Pentaho, etc…). You should know – this error that I experienced with Adding Hue to MapR AFTER install – and it not “finding” or talking to the JobTrackers, is ALL OVER THE WEB…. The sad part is, no one has any answers as to exactly HOW to make it work. And it seems (while your intentions are good, and I thank-you for the information), simply removing the entire install and re-installing with Hue VM first isn’t an answer for everyone.

    3. If you can, would you enlighten me as to why installing the MapR sandbox with Hue FIRST then adding in Drill is different than Installing the MapR sandbox with Drill, then adding in Hue? This is a huge cause for concern.

    Thank-you kindly,
    Dan

  3. Sanjay Pande 2015/12/19 at 10:20 am #

    Hi Dan,
    To answer your questions.

    1. If you want to load it as Hive external which is essentially just metadata definition, then no issues. Drill can query this without Hive metadata, but it operates a lot like the *nix awk command with column names as 0-indexed placeholders. As in

    SELECT * FROM dfs.`file_name with complete path`

    or

    SELECT $0, $1, $2 … FROM same as above

    or

    SELECT $0 as Col1, $1 as Col2 … FROM same as above

    You can use it in embedded mode both on the local file system as well as to your hadoop DFS as long as you can access it via your local file system.

    I still haven’t figured out how to get column names from the first line using the command line. It’s probably a simple switch since most connectors can do it.

    2. Ensure the proxy users are configured for hue in the hadoop configuration files. These are steps to be taken after installing hue since it doesn’t know anything about your Hadoop install. Also, to make it easy, try and create an admin user which is the same name as the user that starts the dfs on the VM and you’ll find hue can easily find everything.

    3. Hue is a PITA to install and configure and in my opinion way over-rated for what it does. It’s nice to have a web interface but for our SQL use cases, you’ll find it’s just a lot easier to use familiar client tools with connectors like JDBC drivers. They’re really easy to connect and use. I think she suggested using the VM with Hue because it’s not easy to install and configure, whereas Drill is quite simple.

    Hope this helps

    Sanjay

Leave a Reply

*