Welcome![Sign In][Sign Up]
Location:
Search - Unknown

Search list

[Internet-Network用Java编写HTML文件分析程序

Description:

Java编写HTML文件分析程序

 一、概述

    

    Web服务器的核心是对Html文件中的各标记(Tag)作出正确的分析,一种编程语言的解释程序也是对源文件中的保留字进行分析再做解释的。实际应用中,我们也经常会碰到需要对某一特定类型文件进行要害字分析的情况,比如,需要将某个HTML文件下载并同时下载与之相关的.gif.class等文件,此时就要求对HTML文件中的标记进行分离,找出所需的文件名及目录。在Java出现以前,类似工作需要对文件中的每个字符进行分析,从中找出所需部分,不仅编程量大,且易出错。笔者在近期的项目中利用Java的输入流类StreamTokenizer进行HTML文件的分析,效果较好。在此,我们要实现从已知的Web页面下载HTML文件,对其进行分析后,下载该页面中包含的HTML文件(假如在Frame中)、图像文件和ClassJava Applet)文件。

    

    二、StreamTokenizer

    

    StreamTokenizer即令牌化输入流的作用是将一个输入流中变成令牌流。令牌流中的令牌实体有三类:单词(即多字符令牌)、单字符令牌和空白(包括JavaC/C++中的说明语句)。

    

    StreamTokenizer类的构造器为: StreamTokenizer(InputStream in)

    

    该类有一些公有实例变量:ttypesvalnval ,分别表示令牌类型、当前字符串值和当前数字值。当我们需要取得令牌(即HTML中的标记)之间的字符时,应访问变量sval。而读向下一个令牌的方法是调用nextToken()。方法nextToken()的返回值是int型,共有四种可能的返回:

    

    StreamTokenizer.TT_NUMBER: 表示读到的令牌是数字,数字的值是double型,可以从实例变量nval中读取。

    

    StreamTokenizer.TT_Word: 表示读到的令牌是非数字的单词(其他字符也在其中),单词可以从实例变量sval中读取。

    

    StreamTokenizer.TT_EOL: 表示读到的令牌是行结束符。

    

    假如已读到流的尽头,则nextToken()返回TT_EOF

    

    开始调用nextToken()之前,要设置输入流的语法表,以便使分析器辨识不同的字符。WhitespaceChars(int low, int hi)方法定义没有意义的字符的范围。WordChars(int low, int hi)方法定义构造单词的字符范围。

    

    三、程序实现

    

    1HtmlTokenizer类的实现

    

    对某个令牌流进行分析之前,首先应对该令牌流的语法表进行设置,在本例中,即是让程序分出哪个单词是HTML的标记。下面给出针对我们需要的HTML标记的令牌流类定义,它是StreamTokenizer的子类:

    

    

    import java.io.*;

    import java.lang.String;

    class HtmlTokenizer extends

    StreamTokenizer {

    //定义各标记,这里的标记仅是本例中必须的,

    可根据需要自行扩充

     static int HTML_TEXT=-1;

     static int HTML_UNKNOWN=-2;

     static int HTML_EOF=-3;

     static int HTML_IMAGE=-4;

     static int HTML_FRAME=-5;

     static int HTML_BACKGROUND=-6;

     static int HTML_APPLET=-7;

    

    boolean outsideTag=true; //判定是否在标记之中

    

     //构造器,定义该令牌流的语法表。

     public HtmlTokenizer(BufferedReader r) {

    super(r);

    this.resetSyntax(); //重置语法表

    this.wordChars(0,255); //令牌范围为全部字符

    this.ordinaryChar('< '); //HTML标记两边的分割符

    this.ordinaryChar('>');

     } //end of constrUCtor

    

     public int nextHtml(){

    int token; //令牌

    try{

    switch(token=this.nextToken()){

    case StreamTokenizer.TT_EOF:

    //假如已读到流的尽头,则返回TT_EOF

    return HTML_EOF;

    case '< ': //进入标记字段

    outsideTag=false;

    return nextHtml();

    case '>': //出标记字段

    outsideTag=true;

    return nextHtml();

    case StreamTokenizer.TT_WORD:

    //若当前令牌为单词,判定是哪个标记

    if (allWhite(sval))

     return nextHtml(); //过滤其中空格

    else if(sval.toUpperCase().indexOf("FRAME")

    !=-1 && !outsideTag) //标记FRAME

     return HTML_FRAME;

    else if(sval.toUpperCase().indexOf("IMG")

    !=-1 && !outsideTag) //标记IMG

     return HTML_IMAGE;

    else if(sval.toUpperCase().indexOf("BACKGROUND")

    !=-1 && !outsideTag) //标记BACKGROUND

     return HTML_BACKGROUND;

    else if(sval.toUpperCase().indexOf("APPLET")

    !=-1 && !outsideTag) //标记APPLET

     return HTML_APPLET;

    default:

    System.out.println ("Unknown tag: "+token);

    return HTML_UNKNOWN;

     } //end of case

    }catch(IOException e){

    System.out.println("Error:"+e.getMessage());}

    return HTML_UNKNOWN;

     } //end of nextHtml

    

    protected boolean allWhite(String s){//过滤所有空格

    //实现略

     }// end of allWhite

    

    } //end of class

    

    以上方法在近期项目中测试通过,操作系统为Windows NT4,编程工具使用Inprise Jbuilder3


Platform: | Size: 1066 | Author: tiberxu | Hits:

[Windows DevelopMessageBoxTimeout API

Description: For reasons unknown, Microsoft has never documented the MessageBoxTimeout API located in user32.dll so here it is for those seeking a Message Box that times out and auto closes if the user does not respond to it first.
Platform: | Size: 6874 | Author: 步络名 | Hits:

[TCP/IP stack18286_port_scan

Description: 这是一个端口扫描的程序,可以查看机器上某个端口是否被未知程序打开-This is a port scan procedures, machines can check whether a particular port open unknown procedures
Platform: | Size: 24485 | Author: 宋小明 | Hits:

[Other resource复件 match

Description: 心电数据的匹配,对数据中出现的情况分为三类进行匹配,正常,不正常,和未知,对不正常的数据进一步分析-ECG data matching, the data is divided into the three types of matches, normal, abnormal, and unknown to normal data for further analysis
Platform: | Size: 3212 | Author: 唐娜 | Hits:

[Industry researchOn the Energy Detection of Unknown Signals Over Fading Channels

Description: On the Energy Detection of Unknown Signals Over Fading ChannelsAbstract—This letter addresses the problem of energy detection of an unknown signal over a multipath channel. It starts with the no-diversity case, and presents some alternative closed-form expressions for the probability of detection to those recently reported in the literature. Detection capability is boosted by implementing both square-law combining and square-law selection diversity schemes. Index Terms—Diversity schemes, energy detection, fading channels, low-power applications, square-law detector, unknown signal detection.
Platform: | Size: 174453 | Author: chenpeng3361 | Hits:

[Game Programsimon

Description: 不知道是一个什么游戏 -unknown game
Platform: | Size: 353280 | Author: 站长 | Hits:

[OtherEnergy-detection-of-unknown-deterministic-signals.

Description: 一篇关于能量检测的经典论文,非常经典,并具有启发性-A classic papers on the energy detection, very classic, and instructive
Platform: | Size: 752640 | Author: 刘苗 | Hits:

[Otherevar

Description: 假设被噪声污染的信号服从高斯分布,估计高斯噪声的方差.-A signal corrupted by a Gaussian noise with unknown variance. It is often of interest to know more about this variance. The function thus returns an estimated variance of the additive noise.
Platform: | Size: 2048 | Author: frankcy | Hits:

[OtherOn-the-Energy-Detection-of-Unknown

Description: 能量检测,是一篇很好的关于能量检测的文章,很有参考价值-On the Energy Detection of Unknown signal
Platform: | Size: 105472 | Author: shuijiao` | Hits:

[OtherUNKNOWN-OBJECT-IDENTIFICATION-AND-ALARM-SYSTEM.ra

Description: unknown object identification project document abstract
Platform: | Size: 8192 | Author: Nagesh | Hits:

[Industry research2010_AES_Estimating-unknown-clutter-intensity-for

Description: Estimating unknown clutter intensity for PHD filter
Platform: | Size: 2204672 | Author: wang | Hits:

[CSharpUnknown-approximate-coordinates

Description: 测边三角网平差中未知近似点坐标计算的功能实现代码,大家相互交流使用。-Unknown approximate coordinates the Trilaterational triangulation adjustment function code, we exchange use.
Platform: | Size: 20480 | Author: 马飞强 | Hits:

[AlgorithmQuadratic-Equation-with-one-unknown

Description: 输入一元二次方程的三个系数,输出方程式和一元二次方程的两个解.-answer quadratic equation with one unknown
Platform: | Size: 58368 | Author: vincent | Hits:

[ConsoleuNknown-Ftp-Client-V1.0

Description: uNknown Ftp Client V1.0, is a simple ftp client, it connect to the host with username and password and you can upload, download, delete files etc...
Platform: | Size: 17408 | Author: ship | Hits:

[Software Engineering(unknown-file-types)-trid_net

Description: Piece of software to find out the unknown filenames in the given folder
Platform: | Size: 32768 | Author: avi bkn | Hits:

[Software EngineeringUnknown-environmental-plan

Description: 动态未知环境运动规划是算法中重要的研究之一,可以应用于对算法初学者有一定帮助-Dynamic unknown environments motion planning is an important study algorithm, the algorithm can be applied to have some help for beginners
Platform: | Size: 15867904 | Author: 刘同学 | Hits:

[Mathimatics-Numerical algorithmsadaptive-nonlinear-syatem-ident---Unknown

Description: Adaptive control provides techniques for the automatic adjustment of control parameters in real time either to achieve or to maintain a desired level of control system performance when the dynamic parameters of the process to be controlled are unknown and/or time-varying.
Platform: | Size: 2633728 | Author: masood | Hits:

[OtherIn-Pursuit-of-the-Unknown-17-Equations-That-Chang

Description: In Pursuit of the Unknown, celebrated mathematician Ian Stewart uses a handful of mathematical equations to explore the vitally important connections between math and human progress. We often overlook the historical link between mathematics and technological advances, says Stewart—but this connection is integral to any complete understanding of human history.-In In Pursuit of the Unknown, celebrated mathematician Ian Stewart uses a handful of mathematical equations to explore the vitally important connections between math and human progress. We often overlook the historical link between mathematics and technological advances, says Stewart—but this connection is integral to any complete understanding of human history.
Platform: | Size: 3774464 | Author: lamia | Hits:

[GUI Developquadratic-equation-with-one-unknown

Description: 一元二次方程求解,可以求解实根与虚根等多种情况下的解。-quadratic equation with one unknown solution, you can solve the real root and virtual root and other situations of the solution.
Platform: | Size: 140288 | Author: 吴俊 | Hits:

[File FormatHigh-Order-Sliding-Mode-and-an-Unknown-Input

Description: High Order Sliding Mode and an Unknown Input
Platform: | Size: 274432 | Author: Noureddine | Hits:
« 12 3 4 5 6 7 8 9 10 ... 50 »

CodeBus www.codebus.net