This tutorial uses 5 CentOS Linux Servers to build Hadoop cluster.
machine1(192.168.0.1) : namenode, jobtracker
machine2-5(192.168.0.2-5) : datanode
1. Add a user and create working directories for Nutch and Hadoop.
sudo useradd -d /home/nutch nutch
sudo mkdir /home/search
sudo chown -R hadoop:hadoop /home/search
sudo as hadoop and change to the home directory.
Apply this step to machine1-5.
2. Download and upgrade JAVA
download the latest java package from the link blow:
http://www.java.com/en/download/manual.jsp#lin
chmod a+x jdk-6u14-linux-i586-rpm.bin
./jdk-6u14-linux-i586-rpm.bin
Apply this step to machine1-5.
3. Modify /etc/hosts
Add these hosts to /etc/hosts
machine1 192.168.0.1
machine2 192.168.0.2
machine3 192.168.0.3
machine4 192.168.0.4
machine5 192.168.0.5
Apply this step to machine1-5.
4. Download Hadoop-0.19.1
login to machine1.
wget http://www.meisei-u.ac.jp/mirror/apache/dist/hadoop/core/hadoop-0.19.1/hadoop-0.19.1.tar.gz
5. Download Nutch-1.0
wget http://www.meisei-u.ac.jp/mirror/apache/dist/lucene/nutch/nutch-1.0.tar.gz
6. Unpack hadoop-0.19.1.tar.gz and nutch-1.0.tar.gz
tar xvfz hadoop-0.19.1.tar.gz
tar xvfz nutch-1.0.tar.gz
mv ~/hadoop-0.19.1 /home/search/hadoop
mv ~/nutch-1.0 /home/search/nutch
cd /home/search/hadoop
7. Edit hadoop-env.sh in Hadoop Directory.
vi conf/hadoop-env.sh
Attach the following to the hadoop-env.sh.
export JAVA_HOME=/usr/java/jdk1.6.0_14
export HADOOP_HOME=/home/search/hadoop
export HADOOP_CONF_DIR=/home/search/hadoop/conf
export HADOOP_SLAVES=$HADOOP_CONF_DIR/slaves
export HADOOP_LOG_DIR=/tmp/hadoop/logs
export HADOOP_PID_DIR=/tmp/hadoop/pid
export NUTCH_HOME=/home/search/nutch
export NUTCH_CONF_DIR=/home/search/nutch/conf
8. Export the environment variables.
source conf/hadoop-env.sh
9. Edit the hadoop-site.xml
vi conf/hadoop-site.xml
The contents are as follow.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.0.1:9000/</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>192.168.0.1:9001</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop/hadoop-${user.name}</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>mapred.task.tracker.http.address</name>
<value>0.0.0.0:0</value>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
</property>
<property>
<name>dfs.block.size</name>
<value>5120000</value>
</property>
</configuration>
10. Edit the conf/masters and conf/slave
The contents of conf/masters:
machine1
The contents of conf/slaves
machine2
machine3
machine4
machine5
11. Deploy Hadoop Env to datanode.
scp -r /home/search/hadoop machine2:/home/search
scp -r /home/search/hadoop machine3:/home/search
scp -r /home/search/hadoop machine4:/home/search
scp -r /home/search/hadoop machine5:/home/search
12. Start up Hadoop.
bin/hadoop namenode -format
bin/start-all.sh
13. Edit the nutch-site.xml
cd /home/search/nutch
vi conf/nutch-site.xml
The contents are as follow:
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch</value>
<description>HTTP 'User-Agent' request header. </description>
</property>
<property>
<name>http.agent.description</name>
<value>Nutch_Test</value>
<description>Further description</description>
</property>
<property>
<name>http.agent.url</name>
<value>localhost</value>
<description>A URL to advertise in the User-Agent header. </description>
</property>
<property>
<name>http.agent.email</name>
<value>test@test.org.tw</value>
<description>An email address
</description>
</property>
<property>
<name>http.robots.agents</name>
<value>nutch</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>plugin.folders</name>
<value>/home/search/nutch/plugins</value>
<description>Directories where nutch plugins are located. </description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(text|html|js|ext|msexcel|mspowerpoint|msword|oo|pdf|rss|swf|zip)|index-(more|basic|anchor)|query-(more|basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description> Regular expression naming plugin directory names</description>
</property>
<property>
<name>parse.plugin.file</name>
<value>parse-plugins.xml</value>
<description>The name of the file that defines the associations between
content-types and parsers.</description>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
<description> </description>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
</property>
<property>
<name>indexer.mergeFactor</name>
<value>500</value>
<description>The factor that determines the frequency of Lucene segment
merges. This must not be less than 2, higher values increase indexing
speed but lead to increased RAM usage, and increase the number of
open file handles (which may lead to "Too many open files" errors).
NOTE: the "segments" here have nothing to do with Nutch segments, they
are a low-level data unit used by Lucene.
</description>
</property>
<property>
<name>indexer.minMergeDocs</name>
<value>500</value>
<description>This number determines the minimum number of Lucene
Documents buffered in memory between Lucene segment merges. Larger
values increase indexing speed and increase RAM usage.
</description>
</property>
</configuration>
14. Edit the crawl-urlfilter.txt
vi conf/crawl-urlfilter.txt
The contents are as follow.
# skip ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[*!@]
# accept anything else
+.*
15. Create a directory with a flat file of root urls.
mkdir urls
echo "http://www.apache.org" >> urls/urls.txt
16. Copy seeds list to hadoop DFS
../hadoop/bin/hadoop fs -put urls urls
17. Copy conf/nutch-site.xml and conf/crawl-urlfilter.txt to hadoop conf folder.
cp conf/nutch-site.xml conf/crawl-urlfilter.txt /home/search/hadoop/conf
18. Deploy nutch configure files to namenode.
scp /home/search/hadoop/conf/nutch-site.xml /home/search/hadoop/conf/crawl-urlfilter.txt machine2:/home/search/hadoop/conf
scp /home/search/hadoop/conf/nutch-site.xml /home/search/hadoop/conf/crawl-urlfilter.txt machine3:/home/search/hadoop/conf
scp /home/search/hadoop/conf/nutch-site.xml /home/search/hadoop/conf/crawl-urlfilter.txt machine4:/home/search/hadoop/conf
scp /home/search/hadoop/conf/nutch-site.xml /home/search/hadoop/conf/crawl-urlfilter.txt machine5:/home/search/hadoop/conf
19. Start crawling.
cd /home/search/nutch
../hadoop/bin/hadoop jar nutch-1.0.job org.apache.nutch.crawl.Crawl urls -dir crawl -depth 5 -topN 10000