Hot Search : Source embeded web remote control p2p game More...
Location : Home Search - Unknown
Search - Unknown - List

Java编写HTML文件分析程序

 一、概述

    

    Web服务器的核心是对Html文件中的各标记(Tag)作出正确的分析,一种编程语言的解释程序也是对源文件中的保留字进行分析再做解释的。实际应用中,我们也经常会碰到需要对某一特定类型文件进行要害字分析的情况,比如,需要将某个HTML文件下载并同时下载与之相关的.gif.class等文件,此时就要求对HTML文件中的标记进行分离,找出所需的文件名及目录。在Java出现以前,类似工作需要对文件中的每个字符进行分析,从中找出所需部分,不仅编程量大,且易出错。笔者在近期的项目中利用Java的输入流类StreamTokenizer进行HTML文件的分析,效果较好。在此,我们要实现从已知的Web页面下载HTML文件,对其进行分析后,下载该页面中包含的HTML文件(假如在Frame中)、图像文件和ClassJava Applet)文件。

    

    二、StreamTokenizer

    

    StreamTokenizer即令牌化输入流的作用是将一个输入流中变成令牌流。令牌流中的令牌实体有三类:单词(即多字符令牌)、单字符令牌和空白(包括JavaC/C++中的说明语句)。

    

    StreamTokenizer类的构造器为: StreamTokenizer(InputStream in)

    

    该类有一些公有实例变量:ttypesvalnval ,分别表示令牌类型、当前字符串值和当前数字值。当我们需要取得令牌(即HTML中的标记)之间的字符时,应访问变量sval。而读向下一个令牌的方法是调用nextToken()。方法nextToken()的返回值是int型,共有四种可能的返回:

    

    StreamTokenizer.TT_NUMBER: 表示读到的令牌是数字,数字的值是double型,可以从实例变量nval中读取。

    

    StreamTokenizer.TT_Word: 表示读到的令牌是非数字的单词(其他字符也在其中),单词可以从实例变量sval中读取。

    

    StreamTokenizer.TT_EOL: 表示读到的令牌是行结束符。

    

    假如已读到流的尽头,则nextToken()返回TT_EOF

    

    开始调用nextToken()之前,要设置输入流的语法表,以便使分析器辨识不同的字符。WhitespaceChars(int low, int hi)方法定义没有意义的字符的范围。WordChars(int low, int hi)方法定义构造单词的字符范围。

    

    三、程序实现

    

    1HtmlTokenizer类的实现

    

    对某个令牌流进行分析之前,首先应对该令牌流的语法表进行设置,在本例中,即是让程序分出哪个单词是HTML的标记。下面给出针对我们需要的HTML标记的令牌流类定义,它是StreamTokenizer的子类:

    

    

    import java.io.*;

    import java.lang.String;

    class HtmlTokenizer extends

    StreamTokenizer {

    //定义各标记,这里的标记仅是本例中必须的,

    可根据需要自行扩充

     static int HTML_TEXT=-1;

     static int HTML_UNKNOWN=-2;

     static int HTML_EOF=-3;

     static int HTML_IMAGE=-4;

     static int HTML_FRAME=-5;

     static int HTML_BACKGROUND=-6;

     static int HTML_APPLET=-7;

    

    boolean outsideTag=true; //判定是否在标记之中

    

     //构造器,定义该令牌流的语法表。

     public HtmlTokenizer(BufferedReader r) {

    super(r);

    this.resetSyntax(); //重置语法表

    this.wordChars(0,255); //令牌范围为全部字符

    this.ordinaryChar('< '); //HTML标记两边的分割符

    this.ordinaryChar('>');

     } //end of constrUCtor

    

     public int nextHtml(){

    int token; //令牌

    try{

    switch(token=this.nextToken()){

    case StreamTokenizer.TT_EOF:

    //假如已读到流的尽头,则返回TT_EOF

    return HTML_EOF;

    case '< ': //进入标记字段

    outsideTag=false;

    return nextHtml();

    case '>': //出标记字段

    outsideTag=true;

    return nextHtml();

    case StreamTokenizer.TT_WORD:

    //若当前令牌为单词,判定是哪个标记

    if (allWhite(sval))

     return nextHtml(); //过滤其中空格

    else if(sval.toUpperCase().indexOf("FRAME")

    !=-1 && !outsideTag) //标记FRAME

     return HTML_FRAME;

    else if(sval.toUpperCase().indexOf("IMG")

    !=-1 && !outsideTag) //标记IMG

     return HTML_IMAGE;

    else if(sval.toUpperCase().indexOf("BACKGROUND")

    !=-1 && !outsideTag) //标记BACKGROUND

     return HTML_BACKGROUND;

    else if(sval.toUpperCase().indexOf("APPLET")

    !=-1 && !outsideTag) //标记APPLET

     return HTML_APPLET;

    default:

    System.out.println ("Unknown tag: "+token);

    return HTML_UNKNOWN;

     } //end of case

    }catch(IOException e){

    System.out.println("Error:"+e.getMessage());}

    return HTML_UNKNOWN;

     } //end of nextHtml

    

    protected boolean allWhite(String s){//过滤所有空格

    //实现略

     }// end of allWhite

    

    } //end of class

    

    以上方法在近期项目中测试通过,操作系统为Windows NT4,编程工具使用Inprise Jbuilder3


Date : 2008-07-07 Size : 1.04kb User : tiberxu

设 计未知表格行数 -未知表格列数 -系数矩阵 -常数数组 -表格间距 -边界点 值 -边界点 值 -边界点 值 -未知点处 值数组 所有点处 X方向正应一个解决一般弹性力学应力问题的程序-design forms a few unknown-unknown out a few forms-coefficient matrix-array constant-form space-point boundary value - Boundary Value-point boundary value-unknown point value to all points array X direction is to be a solution to the general problem of elastic stress procedures
Date : 2008-10-13 Size : 2.34kb User : gfdgdg

设 计未知表格行数 -未知表格列数 -系数矩阵 -常数数组 -表格间距 -边界点 值 -边界点 值 -边界点 值 -未知点处 值数组 所有点处 X方向正应一个解决一般弹性力学应力问题的程序-design forms a few unknown-unknown out a few forms-coefficient matrix-array constant-form space-point boundary value- Boundary Value-point boundary value-unknown point value to all points array X direction is to be a solution to the general problem of elastic stress procedures
Date : 2025-12-29 Size : 2kb User : gfdgdg

我找了很久才找到,一位未知名的老鸟写的程序禁用和启用网卡。发出来与大家一起分享。 -I am looking for a long time to find an unknown name老鸟write the procedures to disable and enable the network card. Sent to work with everybody to share.
Date : 2025-12-29 Size : 172kb User : cando

仿Skype通过URL触发自己的程序 IURLSearchHook接口被浏览器用来转换一个未知的URL协议地址。当浏览器企图去打开一个未知协议的URL地址时,浏览器首先尝试从这个地址得到当前的协议,如果不成功,浏览器将创建在系统中注册的URL Search Hook对象并调用每一个对象的Translate方法,直到地址被转换或所有的URL Search Hook都尝试过。   也就是说,我们可以注册一种目前不存在的协议(类似HTTP),当浏览器遇到新的协议时会自动调用Translate方法来翻译我们的协议,甚至激活我们自己的程序。 -Like Skype through URL to trigger its own procedures IURLSearchHook browser interface is used to convert an unknown URL Protocol addresses. When a browser attempts to open a URL address unknown agreement, the browser first of all try to get this address from the current agreement, if unsuccessful, the browser will create in the system registered URL Search Hook objects and call each object Translate method, until the addresses are converted or all of the URL Search Hook have tried. In other words, we can sign an agreement currently does not exist (similar to HTTP), when the browser encountered new agreement will automatically call Translate ways to translate our agreement, and even the activation of our own procedures.
Date : 2025-12-29 Size : 3.42mb User : wanqing

用omnet++仿真TopDisc算法, TopDisc提出了三色和四色算法,三色算法将节点分别着上白色、黑色和灰色。白色节点 代表未被发现的节点或者是没有收到任何拓扑请求包的节点;黑色节点为簇头节点;灰色节点为至少被一个黑色节点覆盖的节点 - using Omnet++ to simulate TopDisc algorithm, TopDisc proposed a three-color and four-color algorithm, three-color algorithm of the node, respectively, white, black and gray. White node Node on behalf of unknown or has not received any request packet node topology black node for the cluster head node gray node to node at least covered by a black node
Date : 2025-12-29 Size : 947kb User : 孟湘琴

Morty Abzug s RMON package license and code maturity unknown.-Morty Abzug s RMON package license and code maturity unknown.
Date : 2025-12-29 Size : 85kb User : Dr Who

Sockets, which provide a mechanism for communication between two computers.The socket is the software abstraction used to represent the "terminals" of a connection between two machines. For a given connection, there s a socket on each machine, and you can imagine a hypothetical "cable" running between the two machines with each end of the "cable" plugged into a socket. Of course, the physical hardware and cabling between machines is completely unknown. The whole point of the abstraction is that we don t have to know more than is necessary.
Date : 2025-12-29 Size : 141kb User : TigerWoods

用VB开发一款网站/服务器监测软件,带免费短信通知功能(通过邮件免费短信提醒功能实现)。非本人开发,原开发者未知。 来源:源码中心(www.lelecode.com)-With the VB developers a website/server monitoring software with free SMS notification feature free SMS alerts (via e-mail functions). I developed, the original developer unknown. Source: Source Center (www.lelecode.com)
Date : 2025-12-29 Size : 174kb User : WJC

服务发送邮件的这个服务扫描在一个目录(C:\ FILESERVICE \收件箱)通过使用FileWatcher组件(net)(不是真的扫描. .检查创建的文件)当文件到达,然后服务会检查文件的扩展名,如果文件扩展。然后它会读取DBX文件的内容和与数据库(C:\ FILESERVICE \ DB \ mydb mdb)和插入/更新/删除数据库表中的记录。然后文件将会移动到子目录\备份。新:当文件扩展名。邮件然后文件将通过SMTP服务器发出读en(IIS必须安装和运行)当一个错误发生,该错误将会记录在事件日志。如果一个其他文件到达收件箱,不是一个DBX或邮件文件将它移动到子目录\未知。-This service to send e-mail service scan a directory (C: \ fileservice \ Inbox) by FileWatcher components (net) (not really scan .. check the created files) When a file arrives, and then checks the file extension of service name, if the file extension. It then reads the contents of the DBX files and database (C: \ fileservice \ DB \ mydb mdb) and insert/update/delete records in the database table. Then the file will be moved to the subdirectory \ backup. New: When the file extension. E-mail and files will be sent through the SMTP server read en (IIS must be installed and running) when an error occurs, the error will be recorded in the event log. If a file arrives in the Inbox, not a DBX mail file and move it to the subdirectory \ unknown.
Date : 2025-12-29 Size : 27kb User : 锅包肉

未知功能,需要大家一起研究才行,请回的告知我,谢谢了-Unknown function, we need to come together to study the job, please inform me back, thank you
Date : 2025-12-29 Size : 4.67mb User : peter

指滤波器的性能与信号的特性取得某种一致,使滤波器输出端的信号瞬时功率与噪声平均功率的比值最大.即当信号与噪声同时进入滤波器时,它使信号成分在某一瞬间出现尖峰值,而噪声成分受到抑制。-In signal processing, a matched filter (originally known as a North filter[1]) is obtained by correlating a known signal, or template, with an unknown signal to detect the presence of the template in the unknown signal. This is equivalent to convolving the unknown signal with a conjugated time-reversed version of the template. The matched filter is the optimal linear filter for maximizing the signal to noise ratio (SNR) in the presence of additive stochastic noise. Matched filters are commonly used in radar, in which a known signal is sent out, and the reflected signal is examined for common elements of the out-going signal. Pulse compression is an example of matched filtering. It is so called because impulse response is matched to input pulse signals. Two-dimensional matched filters are commonly used in image processing, e.g., to improve SNR for X-ray.
Date : 2025-12-29 Size : 1kb User : 何兴宇

动态未知环境下移动机器人路径规划遗传算法(Genetic algorithm for mobile robot path planning in dynamic unknown environment)
Date : 2025-12-29 Size : 6kb User : 小飞侠213
CodeBus is one of the largest source code repositories on the Internet!
Contact us :
1999-2046 CodeBus All Rights Reserved.