sheepsqueezers.com

Hadoop SQL in a Blind Panic! -- Hadoop Administrator E-Mail

Hadoop Administrators:

Tally Ho! My name is Bob Smith and I work for the <insert dept name here> department and, as you may have heard, I've been tasked with moving data off our legacy <insert legacy database name> database to the Hadoop database. I was hoping that you could be my contact for the duration of this conversion.

First, thank you up-front for helping out since this Hadoop shizz is new to me and my team.

Second, you probably won't be surprised that I have about a bazillion questions for you which I've placed below. Your responses will go a long way in helping me and my team move to Hadoop as quickly (and painlessly!) as possible.

Here goes...

• Do you have a Linux edge node server that my team can use? If so, what's the server's host name? My team and I will be automating some processes using Linux scripts, so access to a Linux edge node server will help us out greatly.

• My team and I plan to use PuTTY to connect to the Linux edge node server. I just want to confirm that we must use port 22 (SSH) when setting up a connection to the edge node server. Do you recommend something other than PuTTY?

• On our legacy database, the schema we use is named <insert name of legacy database schema name here>. Can you please set up the same schema name on the Hadoop database?

• Since my team and I will use the edge node server as well as the Hadoop database, can you please set up the following individuals with an account on the Linux edge node server as well as access to the Hadoop database schema requested above? <insert your Team's corporate e-mail addresses here>

Also, the following team members should be given privileged access to run Hadoop commands via hadoop/hdfs from the Linux command line: <insert select team members who should have higher privileges, including yourself, here>

• Not all of my team members are highly technical, but would like to run simple queries against the Hadoop database. Do you have the Hadoop database web interface Hue set up and accessible? If so, what's the URL?

• In order to kill runaway SQL queries, can you please list the URLs to the Hadoop query webpages? I believe these URLs generally use port 25000 (/queries), but don't hold me to that...I'm new to these parts.

• Can you recommend a SQL client application (such as Toad Data Point, DBeaver, SQuirreL, etc.) for use with Hadoop? What do you use?

• Do you have Hive and Impala ODBC (32-bit/64-bit) and JDBC drivers available on the corporate network? If so, I'd like to access them so that I may set up my team's SQL client software (among other things). If not, can you recommend where I may download these drivers?

• Speaking of ODBC and JDBC drivers, can you please provide example connection information/strings for both ODBC and JDBC connections to Hive (port 10000?) as well as Impala (port 21050?)? We'll be using the ODBC connection information with applications such as Microsoft Excel, PowerBI, Tableau, etc. The JDBC connection strings will be used with client software that uses JDBC rather than ODBC such as DBeaver, SQuirreL, etc.

• Does our corporate network run Kerberos? If so, when creating cron jobs to run automatically, we may need to create a keytab file containing Kerberos-related information. Which encryption types do you suggest we include in the keytab file? arcfour-hmac-md5? aes256-cts? rc4-hmac? Any others? Also, what's our Kerberos Realm and Host FQDN? If not Kerberos, then LDAP?

• We would like the ability to access our legacy database (<insert name of legacy database>) from the Linux edge node server for use with sqoop and other tools. Can you please install the software necessary so that my team and I may access the legacy database from there?

• Is there a generic account available on the Linux edge node server for me and a few of my team members to use? We'd like a single account to execute our production code. If so, can you please forward the username and password? If not, can you please create an account on the Linux edge node server whose password is static? Also, please give this account access to the appropriate schemas as well as hadoop/hdfs privileges.

• Is HPL/SQL available from the Linux edge node server? If not, can you please install it so that my team and I can create and execute procedures on the Linux edge node server against the Hadoop database? Also, where is the file hplsql-site.xml located?

• Is there a directory on the Linux edge node server where we can store the team's production code? If not, can you please create a directory accessible by my team as well as the generic account?

• Can you please create a directory in HDFS specifically for me and my team for use with external tables? Something like hdfs://hdpserver/data/prod/ teams/<schema> or whatever your standard is.

• I feel completely comfortable downloading and maintaining many of my department's dimension tables, but some of the fact tables are quite large. I'm hoping you can intercept the process involved in importing the fact tables and incorporate them into your process. Can we have a conversation about that?

• What are the version numbers for the following?

Linux (on the edge node server)
Apache Hadoop
Hive
Impala
HPL/SQL
Hive ODBC Driver
Impala ODBC Driver
Hive JDBC Driver
Impala JDBC Driver

• Can you please install the Linux utility dos2unix on the Linux edge node server? Since our laptops are Windows-based, we may need to convert files using dos2unix.

• Which Thrift Transport Mode should we be using? SASL? Binary? HTTP?

• Does the Hadoop Database use Secure Sockets Layer (SSL) for connections? When I go to set up an ODBC connection, there's an option asking whether I should enable SSL. Should I?

• My team and I will be using the storage formats TEXTFILE, PARQUET and KUDU almost exclusively. Can you please indicate the SQL CREATE TABLE options required to use the KUDU storage format, if any? Can you recommend the number of partitions we should use with KUDU tables? Do we have to include the table property kudu.master_addresses in our SQL code? If so, can you include an example of this?

• In our legacy <insert name of legacy database> database, we have access to useful metadata such as table names, column names, data types, etc. within the database via ALL_TABLES, ALL_TAB_COLUMNS, INFORMATION_SCHEMA, etc. Can you create a view or views to mimic this from within the Hadoop database accessible from our new database schema? If not, can you give us read-only access to the underlying MetaStore database's metadata tables/views?

• Does the version of ImpalaSQL installed on the Hadoop database include the extensions to GROUP BY such as CUBE, ROLLUP, GROUPING SETS, etc.?

• Is Apache Spark installed on the Linux edge node server? If so, what's the version number? As I would like to use Spark with Python, is pyspark available to use?

• My Team and I may create one or more user-defined functions (UDFs) for Impala. Can you create a directory in HDFS where we may place our Java .jar files? Also, can you update the PATH and CLASSPATH so that we have access to java and javac?

Thanks,

Bob Smith

SAS Address Cleaning Program

A while back I created a SAS program to clean U.S. addresses. While not the most elegant thing in the world, it gets the job done (for the most part). The SAS address cleaning program uses regular expressions to clean the addresses and they have placed in a separate include file. Both of these files have been placed in a zip file which can be found here: SASAddressClean.zip. Don't forget that I also created a similar program in C# which can be found here: CSharp_SelfInflictedProject_AddressCleaning.pdf.

Teiid (and related software) Installation Instructions

Here are my notes on how I installed and tested Teiid, the data virtualization software, on my laptop. These instructions include additional information on how to install Eclipse, JBoss, Apache Spark, Apache Maven, Scala, etc. as well as how to test JDBC connections to Oracle and SQL Server using their specific JDBC drivers. I also include an example on how to create your first virtual database as well as how to access it from within Microsoft Access. Note that I'm not at master of this topic, so please consider these instructions as "alpha"!!! Click here to download the installation instructions. Please let me know if you want anything added or changed!

Latest Articles

Hadoop SQL in a Blind Panic! -- Book Code

Hadoop SQL in a Blind Panic! -- Hadoop Administrator E-Mail

SAS Address Cleaning Program

Teiid (and related software) Installation Instructions

SAS Lecture Series

PHP5, Amazon API and Facebook Development

R Lecture Series

Oracle R Enterprise (v1.1) Presentation

SPSS Lecture Series

Android Programming Primer

Database Lecture Series

Technology Lecture Series

.NET Lecture Series

Compiling ROracle for 64-Bit

More Articles ...