Hadoop SQL in a Blind Panic! -- Book Code
Most of the code for the book can be found at this link: HadoopSQLinaBlindPanic_Code.txt
Most of the code for the book can be found at this link: HadoopSQLinaBlindPanic_Code.txt
Hadoop Administrators:
Tally Ho! My name is Bob Smith and I work for the <insert dept name here> department and, as you may have heard, I've been tasked with moving data off our legacy <insert legacy database name> database to the Hadoop database. I was hoping that you could be my contact for the duration of this conversion.
First, thank you up-front for helping out since this Hadoop shizz is new to me and my team.
Second, you probably won't be surprised that I have about a bazillion questions for you which I've placed below. Your responses will go a long way in helping me and my team move to Hadoop as quickly (and painlessly!) as possible.
Here goes...
• Do you have a Linux edge node server that my team can use? If so, what's the server's host name? My team and I will be automating some processes using Linux scripts, so access to a Linux edge node server will help us out greatly.
• My team and I plan to use PuTTY to connect to the Linux edge node server. I just want to confirm that we must use port 22 (SSH) when setting up a connection to the edge node server. Do you recommend something other than PuTTY?
• On our legacy database, the schema we use is named <insert name of legacy database schema name here>. Can you please set up the same schema name on the Hadoop database?
• Since my team and I will use the edge node server as well as the Hadoop database, can you please set up the following individuals with an account on the Linux edge node server as well as access to the Hadoop database schema requested above? <insert your Team's corporate e-mail addresses here>
Also, the following team members should be given privileged access to run Hadoop commands via hadoop/hdfs from the Linux command line: <insert select team members who should have higher privileges, including yourself, here>
• Not all of my team members are highly technical, but would like to run simple queries against the Hadoop database. Do you have the Hadoop database web interface Hue set up and accessible? If so, what's the URL?
• In order to kill runaway SQL queries, can you please list the URLs to the Hadoop query webpages? I believe these URLs generally use port 25000 (/queries), but don't hold me to that...I'm new to these parts.
• Can you recommend a SQL client application (such as Toad Data Point, DBeaver, SQuirreL, etc.) for use with Hadoop? What do you use?
• Do you have Hive and Impala ODBC (32-bit/64-bit) and JDBC drivers available on the corporate network? If so, I'd like to access them so that I may set up my team's SQL client software (among other things). If not, can you recommend where I may download these drivers?
• Speaking of ODBC and JDBC drivers, can you please provide example connection information/strings for both ODBC and JDBC connections to Hive (port 10000?) as well as Impala (port 21050?)? We'll be using the ODBC connection information with applications such as Microsoft Excel, PowerBI, Tableau, etc. The JDBC connection strings will be used with client software that uses JDBC rather than ODBC such as DBeaver, SQuirreL, etc.
• Does our corporate network run Kerberos? If so, when creating cron jobs to run automatically, we may need to create a keytab file containing Kerberos-related information. Which encryption types do you suggest we include in the keytab file? arcfour-hmac-md5? aes256-cts? rc4-hmac? Any others? Also, what's our Kerberos Realm and Host FQDN? If not Kerberos, then LDAP?
• We would like the ability to access our legacy database (<insert name of legacy database>) from the Linux edge node server for use with sqoop and other tools. Can you please install the software necessary so that my team and I may access the legacy database from there?
• Is there a generic account available on the Linux edge node server for me and a few of my team members to use? We'd like a single account to execute our production code. If so, can you please forward the username and password? If not, can you please create an account on the Linux edge node server whose password is static? Also, please give this account access to the appropriate schemas as well as hadoop/hdfs privileges.
• Is HPL/SQL available from the Linux edge node server? If not, can you please install it so that my team and I can create and execute procedures on the Linux edge node server against the Hadoop database? Also, where is the file hplsql-site.xml located?
• Is there a directory on the Linux edge node server where we can store the team's production code? If not, can you please create a directory accessible by my team as well as the generic account?
• Can you please create a directory in HDFS specifically for me and my team for use with external tables? Something like hdfs://hdpserver/data/prod/ teams/<schema> or whatever your standard is.
• I feel completely comfortable downloading and maintaining many of my department's dimension tables, but some of the fact tables are quite large. I'm hoping you can intercept the process involved in importing the fact tables and incorporate them into your process. Can we have a conversation about that?
• What are the version numbers for the following?
• Can you please install the Linux utility dos2unix on the Linux edge node server? Since our laptops are Windows-based, we may need to convert files using dos2unix.
• Which Thrift Transport Mode should we be using? SASL? Binary? HTTP?
• Does the Hadoop Database use Secure Sockets Layer (SSL) for connections? When I go to set up an ODBC connection, there's an option asking whether I should enable SSL. Should I?
• My team and I will be using the storage formats TEXTFILE, PARQUET and KUDU almost exclusively. Can you please indicate the SQL CREATE TABLE options required to use the KUDU storage format, if any? Can you recommend the number of partitions we should use with KUDU tables? Do we have to include the table property kudu.master_addresses in our SQL code? If so, can you include an example of this?
• In our legacy <insert name of legacy database> database, we have access to useful metadata such as table names, column names, data types, etc. within the database via ALL_TABLES, ALL_TAB_COLUMNS, INFORMATION_SCHEMA, etc. Can you create a view or views to mimic this from within the Hadoop database accessible from our new database schema? If not, can you give us read-only access to the underlying MetaStore database's metadata tables/views?
• Does the version of ImpalaSQL installed on the Hadoop database include the extensions to GROUP BY such as CUBE, ROLLUP, GROUPING SETS, etc.?
• Is Apache Spark installed on the Linux edge node server? If so, what's the version number? As I would like to use Spark with Python, is pyspark available to use?
• My Team and I may create one or more user-defined functions (UDFs) for Impala. Can you create a directory in HDFS where we may place our Java .jar files? Also, can you update the PATH and CLASSPATH so that we have access to java and javac?
Thanks,
Bob Smith
A while back I created a SAS program to clean U.S. addresses. While not the most elegant thing in the world, it gets the job done (for the most part). The SAS address cleaning program uses regular expressions to clean the addresses and they have placed in a separate include file. Both of these files have been placed in a zip file which can be found here: SASAddressClean.zip. Don't forget that I also created a similar program in C# which can be found here: CSharp_SelfInflictedProject_AddressCleaning.pdf.
Here are my notes on how I installed and tested Teiid, the data virtualization software, on my laptop. These instructions include additional information on how to install Eclipse, JBoss, Apache Spark, Apache Maven, Scala, etc. as well as how to test JDBC connections to Oracle and SQL Server using their specific JDBC drivers. I also include an example on how to create your first virtual database as well as how to access it from within Microsoft Access. Note that I'm not at master of this topic, so please consider these instructions as "alpha"!!! Click here to download the installation instructions. Please let me know if you want anything added or changed!
The SAS Lecture Series is complete and consists of the following lectures:
You can find these lectures by clicking on the Slidedecks link on the left and then clicking on the SAS Presentations link, or click here.
I finally had enough time to learn PHP5, some of the Amazon Product Advertising API as well as a little bit of Facebook Development. I combined all of these technologies into one website I created called Top-Itz (http://www.top-itz.com) as well as a Facebook Application called topitzapp (http://apps.facebook.com/topitzapp/). The goal was to give Facebook users of the app the ability to add their own Amazon Associate Tag (aka, Amazon Tracking ID) into Top-Itz. Each night, Top-Itz randomly chooses a new set of users and generates links based on those Amazon Associate Tags. If someone clicks your link and purchases the item, you get some $$$ from Amazon! Woo-hoo! As usual, I documented how I did this in one document called Programming Top-Itz with PHP5, Amazon API and Facebook. Click on the Documents link at the left, then click on Web Documents, or click here. Note that Top-Itz is no longer up-and-running, but much of the information available in this document is valid.
The R Lecture Series is complete and consists of the following lectures:
You can find these lectures by clicking on the Slidedecks link on the left and then clicking on the R Presentations link, or click here.
Finally, I had the time to create a presentation on how to use Oracle R Enterprise (v1.1). Click on the Slidedecks link at the left, then click on Database Presentations, or click here.
The SPSS Lecture Series is complete and consists of the following lectures:
You can find this lecture by clicking on the Slidedecks link on the left and then clicking on the SPSS Presentations link, or click here.
I finally got around to learning and writing about Android programming. Although I still have a lot of learning to do, this book may help a beginning Android programmer through some tough times. Please visit Amazon.com and search for the Kindle eBook "Android from A to D:SECOND EDITION" or click on this link: Android from A to D.
If you have already purchased the book -- THANK YOU! As I stated in the book, Chapter 32 (renumbered to Chapter 99) as well as all of the appendixes have been placed in a freely available document downloadable here.
I have also created text files containing all of the code examples throughout the book. If you have the FIRST EDITION, the code is available here, and if you have the SECOND EDITION is it available here
Note that I have placed the Eclipse project for EquityYoStocks (the app used to demonstrate fragments) on Github here (https://github.com/sheepsqueezers/EquityYoStocks). You should be able to download a zipped file to your desktop, unzip it, and then open it up in Eclipse. I'm no git guru, so if it doesn't work, let me know!
You can get a copy of this book from Amazon here.
The Database Lecture Series consists of the following lectures:
You can find these lectures by clicking on the Slidedecks link on the left and then clicking on the Database Presentations link, or click here.
The Technology Lecture Series consists of the following lectures:
You can find these lectures by clicking on the Slidedecks link on the left and then clicking on the Technology Presentations link, or click here.
The .NET Lecture Series consists of the lectures listed below. Please be aware that some of the namespace lectures have information taken from Microsoft's MSDN .NET documentation webiste. No copyright infringement is intended and is solely there as a way to quickly jog my own memory. With that said, all of the examples appearing throughout this lecture series are my own examples which I painstakingly coded, compiled and executed.
You can find these lectures by clicking on the Slidedecks link on the left and then clicking on the .NET Presentations link, or click here. I am currently working on additional .NET lectures and will upload them once they are complete. Also, don't forget to look in the Documents section under .NET Documents for my C# Self-Inflicted projects you might be interested in.
I tried and tried and tried and FINALLY got ROracle to compile on my 64-bit Windows Vista machine! Woohoo!! I have created a document explaining the sordid details in the hopes that you may be able to get ROracle to compile on your Windows machine. Click on the Documents link to the left and then click on the R Documents link, or click here.