i am working with mapr+drill download in a vmware vm. i’ve begun taking the examples from our new book: building a scalable data warehouse with data vault 2.0 and putting them on hadoop hive, and hbase (mapr edition). i will post additional blogs as i work my way through, but here are some preliminary setup thoughts…
i started with the plain vanilla install, and ran through the command line examples that mapr offers. i was very pleased, being able to run all of the examples out of the box (including joins across hbase and hive using drill). performance was awesome. so i moved on to my next step: loading data from the csv files we have in the book.
i took the carrier file (38+ columns), and wrote a drill bulk-load command, and started it up. it complained that the target table wasn’t created. this is the point where i thought perhaps, i should use the hadoop standard file copy to get it in, that worked – easy peasy. but now, neither hbase nor hive could see it. i needed to “add” the schema / serde to the definitions.
i went to hive, and created a table with all 38 columns defined as string, and then made the table an external table (meaning a link directly to the file), and bam – i was in business. but then i started thinking: this seems to be a lot of coding work – and yes, it is! why? because in order to run any type of hiveql or sql through drill, you *must* define a schema or serde.
wait, are you telling me that “hadoop is not all schema-on-read?” no. not what i’m saying, i’m saying: if you want to use the data in a sql context, then you must define a schema to apply to the data set – otherwise drill can’t get it out. makes sense to me. so i started looking for the equivalent “gui” (like microsoft sqlserver has enterprise manager) i thought, there *must* be something available, and that’s when i found hue…
i’ve spent literally 12 hours straight trying to add hue to the mapr + drill vm, to no avail. i’ve added the yum install commands and gotten many more services running:
i must be missing something, i’ve gotten the hue web server up and running, and can connect from the web-browser. now hue tells me that the jobtracker is not available. i log in to the root, and try to start the jobtracker services – no go. mapr instance just complains that jobtracker is not configured for this instance (single instance, self contained).
so i search the web for any information on how to fix this error. nothing, nada, zilch… not even mapr documentation talks about “specific jobtracker configuration”, nor can i find the proper documentation on apache’s site to get this fixed up. well, i have to say, after 12 hours of build / rebuild, re-install, cleaning, updating, etc… i’m done with hue (at least for now). if anyone has some hints, or thoughts or ideas to help, i’m all ears!
i changed my direction, and downloaded talend big data studio – now, i should be able to use their gui to load the data directly through a visual interface. i will hook it up, and let you know how it goes. note: i am in partnership discussions with talend as we speak.
see you in the next entry when i let you know how all this went. do you have questions? thoughts?
ps: i will be using hortonworks vmware instance next.