How To Register Piggybank.jar In Pig
In the previous post we have discussed about the basic introduction on log files and the compages of log assay in hadoop. In this post, we will enter into much deeper details on processing logs in grunter.
As discussed in the previous post, there will be iii types of log files majorly.
- Web Server Admission Logs
- Web Server Error Logs
- Application Server Logs
All these log files will be in anyone of the beneath three formats.
- Common Log File format
- Combined Log File format
- Custom Log File Format
In the following sections nosotros will discuss more about these log file formats and processing these log formats separately in pig.
Table of Contents
- Common Log Format Files:
- Example Use example of CommonLogLoader:
- Combined Log Format Files:
- Instance Utilize instance of CommonLogLoader:
- Verify Output:
- Instance Utilize instance of CommonLogLoader:
- Custom Log Format Files:
- Example Phone call to MyRegExLoader class:
- Use Case: Parsing Hadoop Daemon Logs:
-
- Verify Output:
-
- Use Case: Parsing Hadoop Daemon Logs:
- Share this:
- Example Phone call to MyRegExLoader class:
Common Log Format Files:
Common Log files format will be as shown in the below. This is also chosen as Apache's Mutual Log Format every bit this format is originated from Apache Web Server logs.
"%hostname %logname %username %time \"%r\" %>statuscode %bytes" |
Example Log line volition be as shown beneath:
64.242.88.10 - - [07/Mar/2014:16:47:12 -0800] "Get /robots.txt HTTP/one.1" 200 68 |
Sample Common Log file for testing —>common_access_log
As nosotros already know virtually Load functions in Pig from previous post on Pig Load Functions, This department can be considered equally the best case for Custom Load functions. Fortunately Piggybank, a repository of user-submitted UDF, contains a custom loader office CommonLogLoader to load Apache'due south Mutual Log Format files into pig. This java class extends RegExLoader class which is custom UDF for Load role.
To know more than about writing UDF's for Custome Load function refer the post.
CommonLogLoader user below regular expression to parse the Common Log Format files:
"^(\\S+)\\south+(\\Due south+)\\southward+(\\S+)\\s+.(\\S+\\s+\\S+).\\s+.(\\S+)\\due south+(\\S+)\\due south+(\\Due south+.\\S+).\\south+(\\S+)\\southward+(\\South+)$" |
Example Utilize example of CommonLogLoader:
Lets put the to a higher place Apache Mutual log file into HDFS location /in/ and Annals piggybank jar file and define a temporary role for CommonLogLoader and utilise it to parse the Apache Common_access_log file successfully.
From Pig-0.13.0 release piggybank.jar file is included in pig lib directory only, so we exercise not need to register this jar file each time we login to pig grunt shell otherwise we need to register this jar file every bit shown below to use any UDFs from this piggybank.
REGISTER '/path_to_piggbank/piggybank.jar';
Lets put the Grunter Latin commands in a script file common_log_process_script.pig and procedure information technology.
Ascertain ApacheCommonLogLoader org .apache .hog .piggybank .storage .apachelog .CommonLogLoader ( ) ; logs = LOAD '/in/common_access_log' USING ApacheCommonLogLoader As ( addr : chararray , logname : chararray , user : chararray , time : chararray , method : chararray , uri : chararray , proto : chararray , condition : int , bytes : int ) ; addrs = Grouping logs BY addr ; counts = FOREACH addrs GENERATE flatten ( $ 0 ) , COUNT ( $ 1 ) as count ; DUMP counts ; |
We are trying to display the counts of addresses/host names from the log file:
$ hadoop fs - put common_access_log / in / $ pig - f common_log_process_script .pig |
Beneath is the output of above sus scrofa script run:
Nosotros can verify the count (270) of the in a higher place highlighted IP address (x.0.0.153) in the input log file in local file system to confirm that log parsing is done correctly with the below control:
$ true cat common_access_log | grep 10.0.0.153 | wc - fifty |
And so, we tin can confirm that Apache's Mutual log files are processed successfully in pig and these results tin be stored into a hdfs file and tin exist feed to hive external tabular array, So that we tin can enable this table available to whatever visualization tool like Hunk, Tableau to pull this data.
Combined Log Format Files:
Combined Log files format will be as shown in the below. It is having two actress fields referrer and User agent when compared to Common Log format.
"%hostname %logname %username %time \"%r\" %>statuscode %bytes \"%{Referer}i\" \"%{User-Amanuensis}i\"" |
Example Log line will be as shown beneath:
122.172.200.100 - - [17/Oct/2014:00:04:36 -0400] "Get /tag/hbase-sink/ HTTP/one.1" 200 15997 "https://www.google.co.in/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36" |
Sample Combined Log file for testing —>combined_access_log
Similar to CommonLogLoader,Piggybank, provides a custom loader function CombinedLogLoader to load Combined Log Format files into hog. This java class likewise extends RegExLoader class which is custom UDF for Load function.
CombinedLogLoader user below regular expression to parse the Combine Log Format files:
"^(\\S+)\\south+(\\South+)\\southward+(\\S+)\\southward+.(\\Southward+\\s+\\S+).\\south+\"(\\S+)\\s+(.+?)\\s+(HTTP[^\"]+)\"\\s+(\\S+)\\s+(\\S+)\\s+\"([^\"]*)\"\\southward+\"(.*)\"$" |
Example Use case of CommonLogLoader:
Similar to above CommonLogLoader Use case, Lets copy the higher up sample combined log format file into HDFS, /in/ location and lets execute beneath squealer latin script file to go tiptop 10 referrers and their counts in the access log file.
logs = LOAD '/in/combined_access_log' USING org .apache .pig .piggybank .storage .apachelog .CombinedLogLoader ( ) Equally ( addr : chararray , logname : chararray , user : chararray , fourth dimension : chararray , method : chararray , uri : chararray , proto : chararray , status : int , bytes : int , referer : chararray , userAgent : chararray ) ; refers = GROUP logs BY referer ; refer_counts = FOREACH refers GENERATE flatten ( $ 0 ) , COUNT ( $ one ) as count ; sorted_refer_counts = ORDER refer_counts Past count DESC ; top10_refers = LIMIT sorted_refer _counts ten ; Shop top10_refers INTO '/out/topreferrers/' USING PigStorage ( '\t' ) ; |
Push the sample combined log file into HDFS and execute the above hog script.
$ hadoop fs - put combined_access_log / in / $ grunter - f combined_log_process_script .pig |
Verify Output:
$ hadoop fs - cat / out / topreferrers / part - r - 00000 |
Just for your information, the to a higher place combined log file provided is the admission log file of this site hadooptutorial.info for 17th Nov 2014. The below are the meridian 10 referrer pages in hadooptutorial.info site on that day.
Thus as shown in the above, nosotros tin can use this log processing machinery to get some real time statistics out of log files of any web server. In this example we figured out the top 10 referrers of hadooptutorial.info site and similarly we can prepare many other statistics appointment wise no of user counts, date wise meridian ten page counts, etc… And this data can be saved in a hive table with 2 columns (folio: string, count: int) and connect this data to Tableau and provide visualization on top of this.
And so this department gives an example for a real time project on processing log files and producing visualizations on the statistics prepared in hadoop. For Connectivity with Tableau on Hive information refer the postal service Tableau with Hive.
Custom Log Format Files:
Custom Log Format Files tin can incorporate log files in whatever of the format other than to a higher place 2. Similar to the in a higher place regular expressions, we need to our custom regular expressions to parse these log files. One of the examples for these custom log files are hadoop logs generated by hadoop/yarn daemons on nodes.
Sample Custom Log file for testing —>hadooplogs
For processing custom log format files, Piggybank providesMyRegExLoader form which extendsRegExLoader class, It is similar to CommonLogLoader but nosotros need to provide the Regular expression pattern to parse custom log files. This Regular expression should exist passed every bit argument through hog latin.
Example Call to MyRegExLoader course:
A = LOAD 'test.txt' USING org.apache.pig.piggybank.storage.MyRegExLoader( '(\\d+)!+(\\w+)~+(\\w+)' ) ; |
This would parse lines like
i! ! !1~i two! !two~ ~ii 3!three~ ~ ~iii |
into arrays like
{1, "one" , "i" } , {two, "two" , "ii" } , {3, "three" , "iii" } |
Utilize Case: Parsing Hadoop Daemon Logs:
Lets upload the higher up provided hadooplogs file into HDFS directory /in/ location and list out simply the WARN messages and ERROR messages to debug the error scenarios to ameliorate the functioning of hadoop nodes.
Salve below pig latin script into hadoop_log_process_script.pig file.
data = LOAD '/in/hadooplogs' USING org .apache .pig .piggybank .storage .MyRegExLoader ( '^(\\d{four}-\\d{2}-\\d{two})\\southward+(\\d{two}.\\d{2}.\\d{2}.\\d{3})\\southward+(\\S+)\\s+(\\S+)\\due south+(.*)$' ) AS ( dt : chararray , time1 : chararray , type : chararray , class : chararray , msg : chararray ) ; logs_by_type = GROUP data Past type ; counts_by_type = FOREACH logs_by_type GENERATE flatten ( $ 0 ) , COUNT ( $ 1 ) as count ; STORE counts_by_type INTO '/out/hdlogs/counts/' USING PigStorage ( '\t' ) ; error_warn_logs = FILTER data By ( type == 'WARN' ) OR ( blazon == 'Fault' ) ; error_warn_msgs = FOREACH error_warn_logs GENERATE type , msg ; STORE error_warn_msgs INTO '/out/hdlogs/msgs/' USING PigStorage ( '\t' ) ; |
$ hadoop fs - put hadooplogs / in / $ pig - f hadoop_log_process_script .pig |
Verify Output:
Lets verify the output in below 2 output files:
$ hadoop fs - ls - R / out / hdlogs / $ hadoop fs - true cat / out / hdlogs / counts / function - r - 00000 $ hadoop fs - cat / out / hdlogs / msgs / part - m - 00000 | head |
Thus we take successfully parsed hadoop logs in a single file and this tin can be used to process the entire cluster wide and bring the statistics on no of error messages and warning per day per node and based on that we can runway the wellness status of a huge production hadoop cluster and can improve the cluster performance based on analysis on WARN messages and Fault letters. This is besides some other real time use case for parsing logs and generating existent time benefit out of this log processing.
How To Register Piggybank.jar In Pig,
Source: http://hadooptutorial.info/processing-logs-in-pig/
Posted by: laddliamed.blogspot.com
0 Response to "How To Register Piggybank.jar In Pig"
Post a Comment