machine1(192.168.0.1) : namenode, jobtracker
machine2-5(192.168.0.2-5) : datanode
1. Add a user and create working directories for Nutch and Hadoop.
sudo useradd -d /home/nutch nutchApply this step to machine1-5.
sudo mkdir /home/search
sudo chown -R hadoop:hadoop /home/search
sudo as hadoop and change to the home directory.
2. Download and upgrade JAVA
download the latest java package from the link blow:Apply this step to machine1-5.
http://www.java.com/en/download/manual.jsp#lin
chmod a+x jdk-6u14-linux-i586-rpm.bin
./jdk-6u14-linux-i586-rpm.bin
3. Modify /etc/hosts
Add these hosts to /etc/hosts
machine1 192.168.0.1Apply this step to machine1-5.
machine2 192.168.0.2
machine3 192.168.0.3
machine4 192.168.0.4
machine5 192.168.0.5
4. Download Hadoop-0.19.1
login to machine1.
wget http://www.meisei-u.ac.jp/mirror/apache/dist/hadoop/core/hadoop-0.19.1/hadoop-0.19.1.tar.gz5. Download Nutch-1.0
wget http://www.meisei-u.ac.jp/mirror/apache/dist/lucene/nutch/nutch-1.0.tar.gz6. Unpack hadoop-0.19.1.tar.gz and nutch-1.0.tar.gz
tar xvfz hadoop-0.19.1.tar.gz7. Edit hadoop-env.sh in Hadoop Directory.
tar xvfz nutch-1.0.tar.gz
mv ~/hadoop-0.19.1 /home/search/hadoop
mv ~/nutch-1.0 /home/search/nutch
cd /home/search/hadoop
vi conf/hadoop-env.shAttach the following to the hadoop-env.sh.
export JAVA_HOME=/usr/java/jdk1.6.0_148. Export the environment variables.
export HADOOP_HOME=/home/search/hadoop
export HADOOP_CONF_DIR=/home/search/hadoop/conf
export HADOOP_SLAVES=$HADOOP_CONF_DIR/slaves
export HADOOP_LOG_DIR=/tmp/hadoop/logs
export HADOOP_PID_DIR=/tmp/hadoop/pid
export NUTCH_HOME=/home/search/nutch
export NUTCH_CONF_DIR=/home/search/nutch/conf
source conf/hadoop-env.sh9. Edit the hadoop-site.xml
vi conf/hadoop-site.xmlThe contents are as follow.
<configuration>10. Edit the conf/masters and conf/slave
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.0.1:9000/</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>192.168.0.1:9001</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop/hadoop-${user.name}</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>mapred.task.tracker.http.address</name>
<value>0.0.0.0:0</value>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
</property>
<property>
<name>dfs.block.size</name>
<value>5120000</value>
</property>
</configuration>
The contents of conf/masters:
machine1The contents of conf/slaves
machine211. Deploy Hadoop Env to datanode.
machine3
machine4
machine5
scp -r /home/search/hadoop machine2:/home/search12. Start up Hadoop.
scp -r /home/search/hadoop machine3:/home/search
scp -r /home/search/hadoop machine4:/home/search
scp -r /home/search/hadoop machine5:/home/search
bin/hadoop namenode -format13. Edit the nutch-site.xml
bin/start-all.sh
cd /home/search/nutchThe contents are as follow:
vi conf/nutch-site.xml
14. Edit the crawl-urlfilter.txt<configuration>
<property>
<name>http.agent.name</name>
<value>nutch</value>
<description>HTTP 'User-Agent' request header. </description>
</property>
<property>
<name>http.agent.description</name>
<value>Nutch_Test</value>
<description>Further description</description>
</property>
<property>
<name>http.agent.url</name>
<value>localhost</value>
<description>A URL to advertise in the User-Agent header. </description>
</property>
<property>
<name>http.agent.email</name>
<value>test@test.org.tw</value>
<description>An email address
</description>
</property>
<property>
<name>http.robots.agents</name>
<value>nutch</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>plugin.folders</name>
<value>/home/search/nutch/plugins</value>
<description>Directories where nutch plugins are located. </description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(text|html|js|ext|msexcel|mspowerpoint|msword|oo|pdf|rss|swf|zip)|index-(more|basic|anchor)|query-(more|basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description> Regular expression naming plugin directory names</description>
</property>
<property>
<name>parse.plugin.file</name>
<value>parse-plugins.xml</value>
<description>The name of the file that defines the associations between
content-types and parsers.</description>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
<description> </description>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
</property>
<property>
<name>indexer.mergeFactor</name>
<value>500</value>
<description>The factor that determines the frequency of Lucene segment
merges. This must not be less than 2, higher values increase indexing
speed but lead to increased RAM usage, and increase the number of
open file handles (which may lead to "Too many open files" errors).
NOTE: the "segments" here have nothing to do with Nutch segments, they
are a low-level data unit used by Lucene.
</description>
</property>
<property>
<name>indexer.minMergeDocs</name>
<value>500</value>
<description>This number determines the minimum number of Lucene
Documents buffered in memory between Lucene segment merges. Larger
values increase indexing speed and increase RAM usage.
</description>
</property>
</configuration>
vi conf/crawl-urlfilter.txtThe contents are as follow.
# skip ftp:, & mailto: urls15. Create a directory with a flat file of root urls.
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[*!@]
# accept anything else
+.*
mkdir urls16. Copy seeds list to hadoop DFS
echo "http://www.apache.org" >> urls/urls.txt
../hadoop/bin/hadoop fs -put urls urls17. Copy conf/nutch-site.xml and conf/crawl-urlfilter.txt to hadoop conf folder.
cp conf/nutch-site.xml conf/crawl-urlfilter.txt /home/search/hadoop/conf18. Deploy nutch configure files to namenode.
scp /home/search/hadoop/conf/nutch-site.xml /home/search/hadoop/conf/crawl-urlfilter.txt machine2:/home/search/hadoop/conf19. Start crawling.
scp /home/search/hadoop/conf/nutch-site.xml /home/search/hadoop/conf/crawl-urlfilter.txt machine3:/home/search/hadoop/conf
scp /home/search/hadoop/conf/nutch-site.xml /home/search/hadoop/conf/crawl-urlfilter.txt machine4:/home/search/hadoop/conf
scp /home/search/hadoop/conf/nutch-site.xml /home/search/hadoop/conf/crawl-urlfilter.txt machine5:/home/search/hadoop/conf
cd /home/search/nutch
../hadoop/bin/hadoop jar nutch-1.0.job org.apache.nutch.crawl.Crawl urls -dir crawl -depth 5 -topN 10000
No comments:
Post a Comment