The Hadoop ecosystem has been around for quite some time now, and there are a lot of technologies that cater to different types of use cases. Hive and Impala are a couple of technologies typically used for storing the data in the Hadoop ecosystem in the form of Relations that can be queried using SQL.
While Hive and Impala have their own pros and cons in execution speed and handling large tables, there are few differences in how UDF(User Define Function) are created in Hive and Impala. Especially when it comes to Impala, there would be some cases where UDFs should be implemented using C++ instead of Java.
In this blog, we will discuss a simple use case where a C++ UDF is needed and how one can go about writing a C++ UDF.
A use case where C++ UDF is required
For example, let’s say you want to create an Impala table with a column “created” with datatype as TIMESTAMP. However, to load the data into that table, you may want to convert the value (say epoch time in integer format) into the “created” column to the TIMESTAMP datatype. This can be done using a custom UDF. There are two ways of implementing UDFs in Impala. One uses Java, and the other uses C++. However, you can’t create a java UDF that can accept or return TIMESTAMP as Impala throws an error (AnalysisException ) which says that type TIMESTAMP is not supported for Java UDF that can return the Timestamp value. Hence, the only option left is to go with C++ UDF.
The second reason to write C++ UDF is that a C++ definition is compiled to the native code that can yield higher performance, resulting in a 10x faster execution time than Java UDF.
Below, we create C++ UDF for Impala, which takes the BigInt value as input(millisecond time) and divides the input by 1000. After manipulating the input, the UDF returns the output as a Timestamp value which contains Date and time in nanoseconds.
Getting Started with UDF Coding (on Linux)
In C++, to implement a UDF, there are two steps.
-
Create a .h file: Declare the function that needs to be implemented in a .cc file.
Firstly, import the header file /usr/include/impala_udf/udf.h into your C++ file. This is the only Impala header file required to develop UDFs. It contains all the types along with the FunctionContext that are needed to develop a UDF. A deep dive into this header file will help you understand the layout, member variables, and functions of the predefined UDF data types.
See below the header file timeconverter-function.h, that describes a function named timeconverter() with the basic declarations required to write a scalar UDF
#ifndef IMPALA_UDF #define IMPALA_UDF #include <impala_udf/udf.h> #include <math.h> using namespace impala_udf; TimestampVal timeconverter(FunctionContext* context, const BigIntVal& val); #endif
-
Create a .cc file: Implement the function by importing the .h file (timeconverter-function.h) into the .cc file.
See below the source file TimeConverter.cc, for sample C++ code for function named timeconverter()
#include “timeconverter-function.h” #include <string.h> #include <boost/date_time/compiler_config.hpp> #include <boost/date_time/gregorian/gregorian.hpp> #include <boost/date_time/date.hpp> #include <boost/cstdint.hpp> #include <boost/date_time/local_time/local_time.hpp> using namespace impala_udf;
using namespace std; using namespace impala;
TimestampVal timeconverter(FunctionContext* context, const BigIntVal & arg1) {
int64_t hour_ns = 3600000000000; //number of total nenoseconds in one hour int64_t minute_ns = 60000000000; //number of total nenoseconds in one minute int64_t second_ns = 1000000000; //number of total nenoseconds in one second
std::time_t epoch_time = arg1.val/1000; //Converting time in milliseconds into second value
boost::posix_time::time_facet *facet = new boost::posix_time::time_facet (“%Y/%m/%d %H:%M:%S-UTC”);
std::stringstream date_time;
date_time.imbue(std::locale(date_time.getloc(), facet));
date_time << boost::posix_time::from_time_t(epoch_time); //sample date_time format:- 26/08/2018 20:16:50-UTC
std::string time_date_str = date_time.str(); // converting date_time into string represention
std::string delimiter = “ “;
std::string date_str = time_date_str.substr(0,
time_date_str.find(delimiter)); // splitting time_date by space to
get date part
boost::gregorian::date dateObj =
boost::gregorian::from_string(date_str); // getting
boost::gregorian::date object by date_str
uint32_t date_val = dateObj.day_number(); // Convert a dateObj into a
day number. The day number is an absolute number of days since the
epoch start. This will used as first parameter of TimestampVal class.
std::string time_str =
time_date_str.substr(time_date_str.find(delimiter)+1,8);// splitting
time_date to get time part i.e “hh:mm:ss”
std::string hour = time_str.substr(0,2); // hour part
std::string mint = time_str.substr(3,2); //minute part
std::string sec = time_str.substr(6,2); //second part
// converting string values into int representation
int h_val = atoi(hour.c_str());
int m_val = atoi(mint.c_str());
int s_val = atoi(sec.c_str());
int64_t time_of_day = h_val*hour_ns + m_val*minute_ns +
s_val*second_ns; //time_of_day is Nanoseconds in current day.
return TimestampVal(date_val,time_of_day); // Return TimestampVal
object for Date and Time in Impala. }
COMPILING THE UDF
Create a folder named “impala-udf” where we place files timeconverter-function.h and TimeConverter.cc. Build the project and generate “.so” file named timeconverter_udf.so using the below command.
gcc -shared -o timeconverter_udf.so -fPIC TimeConverter.cc
DEPLOYING AND TESTING THE UDF
-
After compiling the UDF, a timeconverter_udf.so file will be generated in your current working directory ( impala-udf ). Copy this timeconverter_udf.so file to hdfs directory. Create a new directory for the UDF if it does not exist. Perform this step by using the below command.
hadoop dfs -mkdir /udf
-
After creating the directory “udf” in hdfs, copy the file timeconverter_udf.so on the node where Impala deamon is running. To do so, execute the command below.
hadoop dfs -put timeconverter_udf.so /udf/
-
Now, create the new user define function. For example, bigint_to_time in Impala accepts Bigint value and returns TIMESTAMP value. While creating the UDF, provide the hdfs location path of timeconverter_udf.so file which we created in the above steps. Lastly, provide the value of SYMBOL parameter which is the name of the function we created in TimeConverter.cc file for performing the user-defined task. To complete please perform the below step.
create function bigint_to_timeConv(Bigint) returns TIMESTAMP location ‘/udf/timeconverter_udf.so’ SYMBOL=’timeconverter’;
-
After performing the above step the udf named bigint_to_time is created. Now its time to execute the query. You should provide the column name which is Bigint datatype, say column3 as input to udf. The pseudo for this step is given below.
select column1,column2,bigint_to_timeConv(column3) from table_name;
Hope this blog helps you get started with writing a C++ UDF for Impala. The code mentioned above can be accessible for the below git location. https://github.com/teamclairvoyant/impala-udf
In case of any queries, please leave a comment below and will try to revert asap. Happy coding.
About our company: Clairvoyant is a data and decision engineering company. We design, implement operate data management platforms and analytics products and deliver transformative business value to our customers.
To get the best data engineering solutions for your business, reach out to us at Clairvoyant.