wget和html2txt使用
时间:2010-04-01 来源:kangle000
这段时间,学习Shell脚本,将wget下载的网页用html2txt转化成普通的文本,发现对于各大网站的首页并不适合用html2txt转换。里面的一些动态的新闻什么的,用wget并不能下载下来。此外,使用wget时,当url含有&等特殊字符时,要用'\'进行转换
下面说下html2text的使用说明
代码: This is html2text, version 1.3.2a
Usage:
html2text -help
html2text -version
html2text [ -unparse | -check ] [ -debug-scanner ] [ -debug-parser ] \
[ -rcfile <file> ] [ -style ( compact | pretty ) ] [ -width <w> ] \
[ -o <file> ] [ -nobs ] [ -ascii ] [ <input-url> ] ...
Formats HTML document(s) read from <input-url> or STDIN and generates ASCII
text.
-help Print this text and exit
显示本页文本并退出
-version Print program version and copyright notice
-unparse Generate HTML instead of ASCII output
-check Do syntax checking only
做语法检查
-debug-scanner Report parsed tokens on STDERR (debugging)
-debug-parser Report parser activity on STDERR (debugging)
-rcfile <file> Read <file> instead of "$HOME/.html2textrc"
-style compact Create a "compact" output format (default)
-style pretty Insert some vertical space for nicer output
-width <w> Optimize for screen widths other than 79
-o <file> Redirect output into <file>
将输入重新输出至 <file>
-nobs Do not use backspaces for boldface and underlining
这个选项要用着。不然的话转换后的文件 里会有很多没用的符号
-ascii Use plain ASCII for output instead of ISO-8859-1
下面说下html2text的使用说明
代码: This is html2text, version 1.3.2a
Usage:
html2text -help
html2text -version
html2text [ -unparse | -check ] [ -debug-scanner ] [ -debug-parser ] \
[ -rcfile <file> ] [ -style ( compact | pretty ) ] [ -width <w> ] \
[ -o <file> ] [ -nobs ] [ -ascii ] [ <input-url> ] ...
Formats HTML document(s) read from <input-url> or STDIN and generates ASCII
text.
-help Print this text and exit
显示本页文本并退出
-version Print program version and copyright notice
-unparse Generate HTML instead of ASCII output
-check Do syntax checking only
做语法检查
-debug-scanner Report parsed tokens on STDERR (debugging)
-debug-parser Report parser activity on STDERR (debugging)
-rcfile <file> Read <file> instead of "$HOME/.html2textrc"
-style compact Create a "compact" output format (default)
-style pretty Insert some vertical space for nicer output
-width <w> Optimize for screen widths other than 79
-o <file> Redirect output into <file>
将输入重新输出至 <file>
-nobs Do not use backspaces for boldface and underlining
这个选项要用着。不然的话转换后的文件 里会有很多没用的符号
-ascii Use plain ASCII for output instead of ISO-8859-1
相关阅读 更多 +