Monday, October 31, 2011

Creating Nutch Distribution with Custom Plugin Code and Running

In this post, we are going to talk about how we can build nutch with custom plugin code. Look here to see how we can create our custom plugin. After the plugin development is done, to make a distribution, we need to do below
  1. Go to nutch project folder (lets assume it is "~/workspaces/nutch")
  2. Run "ant tar"
  3. The above command creates "~/workspaces/nutch/dist" (here you can find the distribution nutch-1.3.tar.gz) 
  4. nutch-1.3.tar.gz is also unzipped into the folder "~/workspaces/nutch/dist/nutch-1.3" 
  5. change directory to "~/workspaces/nutch/dist/nutch1.3/runtime/local"
  6. verify that folder "urls" exists in this directory. If not, create it. Add some seed  urls to it.
  7. While developing, "plugin.folders" property in "conf/nutch-site.xml" has a value of "./src/plugins". This does not work when you are working with distribution. Change the value of this property to wherever plugin jars are located. By default, these jars are in "~/workspaces/nutch/dist/nutch1.3/runtime/local/plugins" folder. Since we are already in folder "local", change the value of "plugin.folders"  property value in "conf/nutch-site.xml" to "./plugins".  If you for got to do this, you might see an error: Caused by: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.  at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:122)at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
  8. Now run nutch using command like "bin/nutch crawl urls -dir crawl -depth 2 -topN 4". A good explanation of what everything means in this command can be found here.      
That's it... 

No comments:

Post a Comment

Followers