I am working with MapR+Drill download in a VMWare VM. I’ve begun taking the examples from our new book: Building a Scalable Data Warehouse with Data Vault 2.0 and putting them on Hadoop Hive, and Hbase (MapR edition). I will post additional blogs as I work my way through, but here are some preliminary setup thoughts…
I started with the plain vanilla install, and ran through the command line examples that MapR offers. I was very pleased, being able to run all of the examples out of the box (including joins across HBase and Hive using Drill). Performance was awesome. So I moved on to my next step: loading data from the CSV files we have in the book.
I took the carrier file (38+ columns), and wrote a Drill bulk-load command, and started it up. It complained that the target table wasn’t created. This is the point where I thought perhaps, I should use the hadoop standard file copy to get it in, that worked – easy peasy. But now, neither HBase nor Hive could see it. I needed to “add” the schema / serde to the definitions.
I went to Hive, and created a table with all 38 columns defined as string, and then made the table an EXTERNAL table (meaning a link directly to the file), and bam – I was in business. But then I started thinking: this seems to be a lot of coding work – and yes, it is! Why? because in order to run any type of HiveQL or SQL through Drill, you *must* define a schema or serde.
Wait, are you telling me that “Hadoop is not all Schema-on-read?” No. Not what I’m saying, I’m saying: if you WANT to USE the data in a SQL context, then you must define a schema to apply to the data set – otherwise Drill can’t get it out. Makes sense to me. So I started looking for the equivalent “GUI” (like Microsoft SQLServer has Enterprise Manager) I thought, there *must* be something available, and that’s when I found HUE…
I’ve spent literally 12 hours straight trying to add HUE to the MapR + Drill VM, to no avail. I’ve added the YUM install commands and gotten many more services running:
I must be missing something, I’ve gotten the HUE web server up and running, and can connect from the web-browser. Now HUE tells me that the JobTracker is not available. I log in to the root, and try to start the jobtracker services – no go. MapR instance just complains that Jobtracker is not configured for this instance (single instance, self contained).
So I search the web for ANY information on how to fix this error. Nothing, nada, zilch… not even MapR documentation talks about “specific jobtracker configuration”, nor can I find the proper documentation on Apache’s site to get this fixed up. Well, I have to say, after 12 hours of build / rebuild, re-install, cleaning, updating, etc… I’m done with Hue (at least for now). If anyone has some hints, or thoughts or ideas to help, I’m all ears!
I changed my direction, and downloaded Talend Big Data Studio – Now, I should be able to use their GUI to load the data directly through a visual interface. I will hook it up, and let you know how it goes. Note: I am in partnership discussions with Talend as we speak.
See you in the next entry when I let you know how all this went. Do you have questions? thoughts?
PS: I will be using HortonWorks VMWare instance next.