Monday, October 31, 2011

Write Nutch Plugin Code in Groovy

In this post, let us look at how we can integrate groovy into nutch code base so that one can write plugin code in groovy.

Below are the steps

1.Edit "src/plugin/myplugin/ivy.xml" of your plugin to include groovy jar. After edit, dependencies section of the file should look like below
<dependencies>
 <dependency org="org.codehaus.groovy" name="groovy-all" rev="1.7.4"/>
</dependencies>


2.Edit "src/plugin/myplugin/plugin.xml" to include groovy jar. Runtime section of the file after edit should look like below
<runtime>
      <library name="my-plugin.jar">
         <export name="*"/>
      </library>
      <library name="groovy-all-1.7.4.jar"/>
   </runtime>


3.Edit "ivy/ivy.xml" to include below in the dependencies section.
<dependency org="org.codehaus.groovy" name="groovy-all" rev="1.7.4"/>


4.Edit "src/plugin/build-plugin.xml" and add the below taskdef
<taskdef name="groovyc"
 classname="org.codehaus.groovy.ant.Groovyc">
    <classpath refid="classpath"/>
  </taskdef>

add the below target
<target name="groovyCompile">
   <echo message="Compiling groovy classes in plugin: ${name}"/>
   <groovyc
     srcdir="${src.dir}"
     includes="**/*.groovy"
     destdir="${build.classes}">
      <classpath refid="classpath"/>
   </groovyc>
  </target>

and modify target named "compile" to depend on "groovyCompile"
change below
<target name="compile" depends="init,deps-jar, resolve-default">

to <target name="compile" depends="init,deps-jar, resolve-default, groovyCompile">
That's it. Now you should be able to add groovy classes to your plugin and used them in java classes. Also, if you are working in eclipse, do not forget to add groovy-all-1.7.4.jar to your classpath.

Creating Nutch Distribution with Custom Plugin Code and Running

In this post, we are going to talk about how we can build nutch with custom plugin code. Look here to see how we can create our custom plugin. After the plugin development is done, to make a distribution, we need to do below
  1. Go to nutch project folder (lets assume it is "~/workspaces/nutch")
  2. Run "ant tar"
  3. The above command creates "~/workspaces/nutch/dist" (here you can find the distribution nutch-1.3.tar.gz) 
  4. nutch-1.3.tar.gz is also unzipped into the folder "~/workspaces/nutch/dist/nutch-1.3" 
  5. change directory to "~/workspaces/nutch/dist/nutch1.3/runtime/local"
  6. verify that folder "urls" exists in this directory. If not, create it. Add some seed  urls to it.
  7. While developing, "plugin.folders" property in "conf/nutch-site.xml" has a value of "./src/plugins". This does not work when you are working with distribution. Change the value of this property to wherever plugin jars are located. By default, these jars are in "~/workspaces/nutch/dist/nutch1.3/runtime/local/plugins" folder. Since we are already in folder "local", change the value of "plugin.folders"  property value in "conf/nutch-site.xml" to "./plugins".  If you for got to do this, you might see an error: Caused by: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.  at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:122)at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
  8. Now run nutch using command like "bin/nutch crawl urls -dir crawl -depth 2 -topN 4". A good explanation of what everything means in this command can be found here.      
That's it... 

Thursday, October 27, 2011

Java Regular Expressions Test Harness

Regular Expressions are best understood when we try them using java program with different inputs and analyzing output while reading the definitions.

Java Regex page provides a class that we can use to try out regular expressions with different inputs. But, the issue with this program is, it does not work in eclipse because of the access to Console from eclipse. If you are experiencing that issue, the below program can be used to try them out.

package org.apache.nutch.util;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTestHarness {
 public static void main(String[] args){
  try {
         BufferedReader console = new BufferedReader(new InputStreamReader(System.in));
         System.out.println("Enter regual expression: ");
         String regex;
   regex = console.readLine();
         while (!regex.equalsIgnoreCase("exit")) {
 
             Pattern pattern = Pattern.compile(regex);
 
             System.out.println("Enter input string to search: ");
             Matcher matcher = pattern.matcher(console.readLine());
 
             boolean found = false;
             while (matcher.find()) {
                 System.out.println("I found the text '"+ matcher.group()+ "' starting at index " +matcher.start()+ " and ending at index "  +matcher.end());
                 found = true;
             }
             if(!found){
              System.out.println("No match found.%n");
             }
          System.out.println("Enter regual expression: ");
    regex = console.readLine();
         }
  } catch (IOException e) {
   e.printStackTrace();
  }
    }
}

Sunday, October 23, 2011

Parse String to Java Date and Format Date to Localized String

In this post, Let's look into how we can convert a String to Date object in Java and vice versa.

Solid understanding of Date class is essential to understand this post or anything related to java dates. I have written a post explaining the basics of Date class.

I have a string "october 21, 2011 19:08". I would like to convert this to Date object. How can I do this? Also, how can I convert that Date back to localized String?

First, Lets look at the code.
1:  import java.text.ParseException;  
2:  import java.text.SimpleDateFormat;  
3:  import java.util.Date;  
4:  import java.util.TimeZone;  
5:    
6:  public class DateTest {  
7:    
8:       public static void main(String[] args) {  
9:            try {  
10:                 String dateStr = "october 21, 2011 19:08";  
11:                 String pattern = "MMMM dd, yyyy HH:mm";  
12:                 TimeZone timezoneOfDateStr = TimeZone.getTimeZone("Asia/Kolkata");  
13:                   
14:                 SimpleDateFormat sd = new SimpleDateFormat(pattern);  
15:                 sd.setTimeZone(timezoneOfDateStr);  
16:                 Date date = sd.parse(dateStr);  
17:                   
18:                 System.out.println("Converting : ");  
19:                 System.out.println("     Date String: "+dateStr);  
20:                 System.out.println("     pattern: "+pattern);  
21:                 System.out.println("     time zone: Asia/kolkata");  
22:                 System.out.println();  
23:                 System.out.println("Converted.");  
24:                 System.out.println("System Stored the above Date as : "+date.getTime());  
25:                 System.out.println();  
26:                 String patternWithTimezone = "MMMM dd, yyyy HH:mm zzzz";  
27:                 SimpleDateFormat sdWithTZ = new SimpleDateFormat(patternWithTimezone);  
28:                 sdWithTZ.setTimeZone(timezoneOfDateStr);  
29:                 System.out.println("Formatting the date back to human readable format in the same timezone");  
30:                 System.out.println("     here it is: "+sdWithTZ.format(date));  
31:                   
32:                 System.out.println();  
33:                 sdWithTZ.setTimeZone(TimeZone.getTimeZone("GMT"));  
34:                 System.out.println("Formatting the date to human readable format in GMT");  
35:                 System.out.println("     here it is: "+sdWithTZ.format(date));  
36:            } catch (ParseException e) {  
37:                 e.printStackTrace();  
38:            }  
39:       }  
40:    
41:  }  
42:    

Output
Converting : 
 Date String: october 21, 2011 19:08
 pattern: MMMM dd, yyyy HH:mm
 time zone: Asia/kolkata

Converted.
System Stored the above Date as : 1319204280000

Formatting the date back to human readable format
 here it is: October 21, 2011 19:08 India Standard Time

Formatting the date to human readable format in GMT
 here it is: October 21, 2011 13:38 Greenwich Mean Time

Explanation
  1. Line 10: The date that needs to be converted
  2. Line 11: Pattern of the date that is defined in line 10. To understand how to build this pattern, read the table in the java doc of SimpleDateFormat class which explains about the letters and how they are interpreted (example: M is interpreted as month, y as year, d as Date, H as hour, m as min, z as timezone etc)
  3. Line 12: Timezone of the above date. Why timezone? A date by itself does not tell the whole story. (example: Oct 21, 2011 08:00 in newyork is same as Oct 21, 2011 11:00 in Las Angeles). So, just saying Oct 21, 2011 would not tell the whole story, we also need to tell the system what timezone that time belongs to. In this case, we are saying that the date belongs to "Asia/Kolkata timezone" / "India Standard Time".
  4. Lines 14,15,16: Create SimpleDateformat class. Tell it what pattern the string is in, What timezone it belongs to. Parse the date string and get date object.
  5. Lines 18 to 25: Print some details. It also prints the date in milliseconds. Why in milliseconds? Read Java Date Explanation
  6. Lines 26 to 30: Format the date to human readable format in India Standard Time and print.
  7. Lines 32 to 35: Format the date to human readable format in GMT and print.


Java Date Explanation

In this post, let's look into some basics about Date class which creates lots of confusion not only among beginners but even among experienced programmers. It is understandable given that even sun folks couldn't understand it correctly when they initially wrote the Date class. That is the reason the Date class has below explanation in it's java class.

Prior to JDK 1.1, the class Date had two additional functions. It allowed the interpretation of dates as year, month, day, hour, minute, and second values. It also allowed the formatting and parsing of date strings. Unfortunately, the API for these functions was not amenable to internationalization.

What I am going to discuss below is very important. It will make your understanding of dates whole lot easier. Pay special attention.

When you think of Date, think of milliseconds from epoch (January 1, 1970) . Date in java is just a wrapper around milliseconds. Let me clarify this with an example.
1:  import java.util.Date;  
2:  import java.util.TimeZone;  
3:    
4:  public class DateTest {  
5:    
6:       public static void main(String[] args) {  
7:            TimeZone.setDefault(TimeZone.getTimeZone("GMT-5"));  
8:            Date gmtMinusFiveDate = new Date();  
9:              
10:            TimeZone.setDefault(TimeZone.getTimeZone("GMT+1"));  
11:            Date gmtPlusOneDate = new Date();  
12:              
13:            TimeZone.setDefault(TimeZone.getTimeZone("GMT+7"));  
14:            Date gmtPlusSevenDate = new Date();  
15:              
16:            System.out.println("GMT-5 date milliseconds: "+gmtMinusFiveDate.getTime());  
17:            System.out.println("GMT+1 date milliseconds: "+gmtPlusOneDate.getTime());  
18:            System.out.println("GMT+7 date milliseconds: "+gmtPlusSevenDate.getTime());  
19:     }  
20:  }  

Output
GMT-5 date milliseconds: 1319391112399
GMT+1 date milliseconds: 1319391112400
GMT+7 date milliseconds: 1319391112400

Above we created Date objects in different time zones (GMT-5, GMT+1, and GMT+7) and printed their millisecond value.

Surprised that all the date instances print same value for milliseconds? (the 1 millisecond difference between 1st and 2nd line in the output is because of when in the program the date object is created). Shouldn't they be hours apart? Not really. This is how it works.

Let's say, we created a Date instance in a machine running in Arizona Timezone (GMT-7) on Oct 21, 2011 at 06:00. Think of it as below is what happens (although the machine would always have millis calculated in GMT from Jan 1, 1970 midnight)
  • Convert the time to GMT. So, Oct 21, 2011 06:00 will be converted to Oct 21, 2011 13:00 (remember GMT-7, so just add 7 hours)
  • Calculate number of milliseconds elapsed from Jan 1, 1970 to Oct 21, 2011 13:00 (lets say )
  • Create a Date object that is a wrapper around this milliseconds value.

that is the reason, regardless of where the machine is running (or what the default timezone is), the date object would only represent the number of milliseconds since January 1, 1970, 00:00:00 GMT. That's it. That is the reason all three lines in the output show the same value.

Solid understanding of the above concept is essential when one is dealing with dates in Java. If you did not understand the above, go back again and read one more time. Once you understand the above concept everything else should be easy to understand.

See my post about Parse String to Java Date and vice versa

Wednesday, October 12, 2011

Configure SSH for multiple remote machines without the need for defining parameters every time

I usually connect to remote machines and giving ssh key location, hostname, and username etc.. every time is painful, isn't there a simple way to configure it?

Yes, there is.

1. Create a .ssh directory under home directory if it does not exist.
$ mkdir -p ~/.ssh

2. Open a file named config (create it if it does not exist)
$ vi ~/.ssh/config

3. Configure hosts in this file. Copy the following code and replace the values accordingly.
#Demo server configuration
Host demo      
HostName 12.14.134.123
User john
IdentityFile ~/.ssh/ssh_key

#Test server configuration
Host test      
HostName 12.14.134.124
User john
IdentityFile ~/.ssh/ssh_key


4. To connect to remote machine.
$ ssh demo

5. To copy a file from demo server to your machine
$ scp demo:~/file1.txt ~/Downlods/

MapReduce Simplified

What is it? 

MapReduce is a programming model to simplify processing of huge data on large number of machines. This programming model was introduced in the paper published by google's Jeffrey Dean and Sanjay Ghemawat. More details at http://labs.google.com/papers/mapreduce.html.

Why?

Programmers without any experience of parallel programming and distributed systems can easily write programs to process huge data sets on large number of machines using this model.

When?

Need to process lots of data

How?

Write map, reduce functions and feed them to the MapReduce framework. The framework takes care of slicing and distributing the work to multiple machines, processing, handling failures, and giving the result back.

Since picture is a thousand words, let's see how we can take 1,000,000 text documents, look through them, and find how many times each word is used. Actually wait a minute, to make this example simple, lets do just do 2 documents. But, huge data is where this programming model shines. I borrowed this example from hadoop tutorial at http://hadoop.apache.org/common/docs/current/mapred_tutorial.html


1) In the first step, we feed these two documents to the master node. Master node splits the data and sends it to worker nodes (perform map function) to process the data.












2) Now, worker nodes process the data and give the result back to the master node. Master node then collects the data from all the worker nodes, arranges them by key, and sends them to other worker nodes to perform reduce function.



3)After performing the reduce function, worker nodes give the result back to the master node. Now, the master node processes the resulting data and gives the result (which is count of all words from both documents).

When you think about huge amount of data, leveraging cluster of machines using MapReduce programming model is an efficient way to deal with it.




Followers