EmailTokenizer

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.jasen.core.token
Class EmailTokenizer

java.lang.Object
  org.jasen.core.token.EmailTokenizer

All Implemented Interfaces:: MimeMessageTokenizer

public class EmailTokenizer
extends Object
implements MimeMessageTokenizer

Converts the subject, text and html parts of a MimeMessage into discrete String "tokens".

Each token represents either a word, or a specialized representation of certain key information.

For example:

Often the subject line in a message is all that is required to identify it as spam. This can be a very good source of information because it will almost always be free from obfuscation (not withstanding the use of non-ascii characters). Hence, tokens found in the subject are annotated with the word "Subject" and delimited with a question mark.

For example:

The subject line "Buy viagra!" would be tokenized as:

Subject?Buy
Subject?viagra!

Author:: Jason Polites

Field Summary
`static char`	`HEADER_TOKEN_DELIMITER` This is just a rare character user to identify mail header tokens It looks like two pipes \|\|
`static String[]`	`IGNORED_HEADERS` Deprecated. This should be done in config
`static String[]`	`INCLUDED_HEADERS` Deprecated. This should be done in config

Constructor Summary
`EmailTokenizer()`

Method Summary
`int`	`getLinguisticLimit()` Gets the maximum number of linguistic errors tolerated before tokenization is aborted.
`int`	`getTokenLimit()` Gets the maximum number of tokens extracted before tokenization is aborted
`boolean`	`isIgnoreHeaders()` Tells us if we are ignoring the list of IGNORED_HEADERS when tokenizing
`static void`	`main(String[] args)` Internal test harness only.
`void`	`setIgnoreHeaders(boolean b)` Flags the tokenizer to ignore list of IGNORED_HEADERS when tokenizing
`void`	`setLinguisticLimit(int linguisticLimit)` Sets the maximum number of linguistic errors tolerated before tokenization is aborted.
`void`	`setTokenLimit(int i)` Sets the maximum number of tokens extracted before tokenization is aborted
`String[]`	`tokenize(javax.mail.internet.MimeMessage mail, JasenMessage message, ParserData data)` Tokenizes the given message into meaningful string tokens

Methods inherited from class java.lang.Object

equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

HEADER_TOKEN_DELIMITER

public static final char HEADER_TOKEN_DELIMITER

This is just a rare character user to identify mail header tokens It looks like two pipes ||

See Also:: Constant Field Values

IGNORED_HEADERS

public static String[] IGNORED_HEADERS

Deprecated. This should be done in config

INCLUDED_HEADERS

public static String[] INCLUDED_HEADERS

Deprecated. This should be done in config

Constructor Detail

EmailTokenizer

public EmailTokenizer()
               throws IOException

Method Detail

tokenize

public String[] tokenize(javax.mail.internet.MimeMessage mail,
                         JasenMessage message,
                         ParserData data)
                  throws JasenException

Description copied from interface: MimeMessageTokenizer

Tokenizes the given message into meaningful string tokens

Specified by:: tokenize in interface MimeMessageTokenizer

Parameters:: mail -; message -
Returns:: The reduced message tokens
Throws:: JasenException

getLinguisticLimit

public int getLinguisticLimit()

Gets the maximum number of linguistic errors tolerated before tokenization is aborted.

The tokenizer uses the LinguisticAnalyzer to determine if each token is a real word. After linguisticLimit tokens have successively failed, tokenization is aborted.

Returns:: Returns the linguisticLimit.

setLinguisticLimit

public void setLinguisticLimit(int linguisticLimit)

Sets the maximum number of linguistic errors tolerated before tokenization is aborted.

Parameters:: linguisticLimit - The linguisticLimit to set.
See Also:: getLinguisticLimit()

isIgnoreHeaders

public boolean isIgnoreHeaders()

Tells us if we are ignoring the list of IGNORED_HEADERS when tokenizing

Returns:: True if the tokenizer is ignoring headers in the IGNORED_HEADERS set
See Also:: IGNORED_HEADERS

setIgnoreHeaders

public void setIgnoreHeaders(boolean b)

Flags the tokenizer to ignore list of IGNORED_HEADERS when tokenizing

Parameters:: b -

getTokenLimit

public int getTokenLimit()

Gets the maximum number of tokens extracted before tokenization is aborted

Returns:: The maximum number if tokens that will be returned

setTokenLimit

public void setTokenLimit(int i)

Sets the maximum number of tokens extracted before tokenization is aborted

Specified by:: setTokenLimit in interface MimeMessageTokenizer

Parameters:: i -

main

public static void main(String[] args)

Internal test harness only. DO NOT USE

Parameters:: args -

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.jasen.core.token Class EmailTokenizer

HEADER_TOKEN_DELIMITER

IGNORED_HEADERS

INCLUDED_HEADERS

EmailTokenizer

tokenize

getLinguisticLimit

setLinguisticLimit

isIgnoreHeaders

setIgnoreHeaders

getTokenLimit

setTokenLimit

main

org.jasen.core.token
Class EmailTokenizer