Welcome 微信登录

首页 / 软件开发 / JAVA / 如何使用java调用python下载网页

如何使用java调用python下载网页2014-08-17本篇参考:http://tonl.iteye.com/blog/1918245

python版本:2.7 64bit window版本;

下载python:http://www.python.org/getit/

Python 2.7.5 Windows X86-64 Installer (Windows AMD64 / Intel 64 / X86-64 binary [1] -- does not include source),进行安装:

首先编写下面的spider.py脚本:

# -*- coding: utf-8 -*-#import urllib2from urllib import urlopenimport osimport sys class Spider:""" download web site from the given file """def __init__(self,filename,downloadPath):""" init the filename ,if the filename is not raise a error """if not os.path.isfile(filename):print "the given file does not exist,the program will exit"sys.exit(0)else:self.fname=filenameif not os.path.isdir(downloadPath):print "the given download path does not exist ,the programe will exit"else:self.dpath=downloadPathdef download(self):""" download the web site from the given file by line """fp=open(self.fname,"r")while True:line=fp.readline()if not line:breakif "html" in line:tempname=filter(str.isalnum,line).replace("html",".html")else:tempname=filter(str.isalnum,line)+".html"self.download_html(line,self.dpath+"\"+tempname)fp.close() def download_html(self,website,filename):""" download the html by the given web site and save to name """response=urlopen(website)data=response.read()fp=file(filename,"a+")fp.write(data)fp.close() def test():""" test program """filename=sys.argv[1]downloadPath=sys.argv[2]spider=Spider(filename,downloadPath)spider.download() if __name__ =="__main__": test()
上面的脚本,要输入两个参数,一个是要下载的网页的地址文件,格式一般如下(websites.txt):

http://blog.csdn.net/fansy1990http://www.baidu.com
另外一个参数是下载的网页的存放地点。

然后可以在命令行运行:

python D:\spider.py D:\websites.txt D:\download_tmp

然后到D盘的download_tmp下面查找下载的文件,如果找到,则说明配置正确;

最后编写下面的java程序,需要导入jython-*.jar包(lz下载的是2.2的):

package test; import java.io.IOException; public class PyTest { /*** @param args* @throws IOException * @throws InterruptedException */public static void main(String[] args) throws IOException, InterruptedException { String py_path="D:\spider.py";String websites="D:\websites.txt";String outDir="D:\tmp";// Process pr=Runtime.getRuntime().exec("python "+py_path+" "+websites+" "+outDir );pr.waitFor();System.out.println("done ...");} }
运行上面的命令,需要设置eclipse中的Environment属性,添加一个PATH变量,值是python的安装目录;

运行后,会提示:

*sys-package-mgr*: can"t create package cache dir, *jython-2.2.jarcachedirpackages"

这个可以不用管,不会影响程序运行。