C-H-R-I-S Paradise: [Linux] 在現有的Hadoop Cluster上安裝與運行NUTCH

This tutorial uses 5 CentOS Linux Servers to build Hadoop cluster.
machine1(192.168.0.1) : namenode, jobtracker
machine2-5(192.168.0.2-5) : datanode

1. Add a user and create working directories for Nutch and Hadoop.

sudo useradd -d /home/nutch nutch
sudo mkdir /home/search
sudo chown -R hadoop:hadoop /home/search
sudo as hadoop and change to the home directory.

Apply this step to machine1-5.

2. Download and upgrade JAVA

download the latest java package from the link blow:
http://www.java.com/en/download/manual.jsp#lin
chmod a+x jdk-6u14-linux-i586-rpm.bin
./jdk-6u14-linux-i586-rpm.bin

Apply this step to machine1-5.

3. Modify /etc/hosts
Add these hosts to /etc/hosts

machine1 192.168.0.1
machine2 192.168.0.2
machine3 192.168.0.3
machine4 192.168.0.4
machine5 192.168.0.5

Apply this step to machine1-5.

4. Download Hadoop-0.19.1
login to machine1.

wget http://www.meisei-u.ac.jp/mirror/apache/dist/hadoop/core/hadoop-0.19.1/hadoop-0.19.1.tar.gz

5. Download Nutch-1.0

wget http://www.meisei-u.ac.jp/mirror/apache/dist/lucene/nutch/nutch-1.0.tar.gz

6. Unpack hadoop-0.19.1.tar.gz and nutch-1.0.tar.gz

tar xvfz hadoop-0.19.1.tar.gz
tar xvfz nutch-1.0.tar.gz
mv ~/hadoop-0.19.1 /home/search/hadoop
mv ~/nutch-1.0 /home/search/nutch
cd /home/search/hadoop

7. Edit hadoop-env.sh in Hadoop Directory.

vi conf/hadoop-env.sh

Attach the following to the hadoop-env.sh.

export JAVA_HOME=/usr/java/jdk1.6.0_14
export HADOOP_HOME=/home/search/hadoop
export HADOOP_CONF_DIR=/home/search/hadoop/conf
export HADOOP_SLAVES=$HADOOP_CONF_DIR/slaves
export HADOOP_LOG_DIR=/tmp/hadoop/logs
export HADOOP_PID_DIR=/tmp/hadoop/pid
export NUTCH_HOME=/home/search/nutch
export NUTCH_CONF_DIR=/home/search/nutch/conf

8. Export the environment variables.

source conf/hadoop-env.sh

9. Edit the hadoop-site.xml

vi conf/hadoop-site.xml

The contents are as follow.

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.0.1:9000/</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>192.168.0.1:9001</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop/hadoop-${user.name}</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>mapred.task.tracker.http.address</name>
<value>0.0.0.0:0</value>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
</property>
<property>
<name>dfs.block.size</name>
<value>5120000</value>
</property>
</configuration>

10. Edit the conf/masters and conf/slave
The contents of conf/masters:

machine1

The contents of conf/slaves

machine2
machine3
machine4
machine5

11. Deploy Hadoop Env to datanode.

scp -r /home/search/hadoop machine2:/home/search
scp -r /home/search/hadoop machine3:/home/search
scp -r /home/search/hadoop machine4:/home/search
scp -r /home/search/hadoop machine5:/home/search

12. Start up Hadoop.

bin/hadoop namenode -format
bin/start-all.sh

13. Edit the nutch-site.xml

cd /home/search/nutch
vi conf/nutch-site.xml

The contents are as follow:

<configuration>
<property>
<name>http.agent.name</name>
<value>nutch</value>
<description>HTTP 'User-Agent' request header. </description>
</property>
<property>
<name>http.agent.description</name>
<value>Nutch_Test</value>
<description>Further description</description>
</property>
<property>
<name>http.agent.url</name>
<value>localhost</value>
<description>A URL to advertise in the User-Agent header. </description>
</property>
<property>
<name>http.agent.email</name>
<value>test@test.org.tw</value>
<description>An email address
</description>
</property>
<property>
<name>http.robots.agents</name>
<value>nutch</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>plugin.folders</name>
<value>/home/search/nutch/plugins</value>
<description>Directories where nutch plugins are located. </description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(text|html|js|ext|msexcel|mspowerpoint|msword|oo|pdf|rss|swf|zip)|index-(more|basic|anchor)|query-(more|basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description> Regular expression naming plugin directory names</description>
</property>
<property>
<name>parse.plugin.file</name>
<value>parse-plugins.xml</value>
<description>The name of the file that defines the associations between
content-types and parsers.</description>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
<description> </description>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
</property>
<property>
<name>indexer.mergeFactor</name>
<value>500</value>
<description>The factor that determines the frequency of Lucene segment
merges. This must not be less than 2, higher values increase indexing
speed but lead to increased RAM usage, and increase the number of
open file handles (which may lead to "Too many open files" errors).
NOTE: the "segments" here have nothing to do with Nutch segments, they
are a low-level data unit used by Lucene.
</description>
</property>
<property>
<name>indexer.minMergeDocs</name>
<value>500</value>
<description>This number determines the minimum number of Lucene
Documents buffered in memory between Lucene segment merges. Larger
values increase indexing speed and increase RAM usage.
</description>
</property>
</configuration>

14. Edit the crawl-urlfilter.txt

vi conf/crawl-urlfilter.txt

The contents are as follow.

# skip ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[*!@]
# accept anything else
+.*

15. Create a directory with a flat file of root urls.

mkdir urls
echo "http://www.apache.org" >> urls/urls.txt

16. Copy seeds list to hadoop DFS

../hadoop/bin/hadoop fs -put urls urls

17. Copy conf/nutch-site.xml and conf/crawl-urlfilter.txt to hadoop conf folder.

cp conf/nutch-site.xml conf/crawl-urlfilter.txt /home/search/hadoop/conf

18. Deploy nutch configure files to namenode.

scp /home/search/hadoop/conf/nutch-site.xml /home/search/hadoop/conf/crawl-urlfilter.txt machine2:/home/search/hadoop/conf
scp /home/search/hadoop/conf/nutch-site.xml /home/search/hadoop/conf/crawl-urlfilter.txt machine3:/home/search/hadoop/conf
scp /home/search/hadoop/conf/nutch-site.xml /home/search/hadoop/conf/crawl-urlfilter.txt machine4:/home/search/hadoop/conf
scp /home/search/hadoop/conf/nutch-site.xml /home/search/hadoop/conf/crawl-urlfilter.txt machine5:/home/search/hadoop/conf

19. Start crawling.

cd /home/search/nutch
../hadoop/bin/hadoop jar nutch-1.0.job org.apache.nutch.crawl.Crawl urls -dir crawl -depth 5 -topN 10000

C-H-R-I-S Paradise

Thursday, June 18, 2009

[Linux] 在現有的Hadoop Cluster上安裝與運行NUTCH

No comments:

Post a Comment

WebMaster

My Blog Log Profile