org.jasen.core.token
Class EmailTokenizer

java.lang.Object
  extended byorg.jasen.core.token.EmailTokenizer
All Implemented Interfaces:
MimeMessageTokenizer

public class EmailTokenizer
extends Object
implements MimeMessageTokenizer

Converts the subject, text and html parts of a MimeMessage into discrete String "tokens".

Each token represents either a word, or a specialized representation of certain key information.

For example:

Often the subject line in a message is all that is required to identify it as spam. This can be a very good source of information because it will almost always be free from obfuscation (not withstanding the use of non-ascii characters). Hence, tokens found in the subject are annotated with the word "Subject" and delimited with a question mark.

For example:

The subject line "Buy viagra!" would be tokenized as:

Subject?Buy
Subject?viagra!

Author:
Jason Polites

Field Summary
static char HEADER_TOKEN_DELIMITER
          This is just a rare character user to identify mail header tokens It looks like two pipes ||
static String[] IGNORED_HEADERS
          Deprecated. This should be done in config
static String[] INCLUDED_HEADERS
          Deprecated. This should be done in config
 
Constructor Summary
EmailTokenizer()
           
 
Method Summary
 int getLinguisticLimit()
          Gets the maximum number of linguistic errors tolerated before tokenization is aborted.
 int getTokenLimit()
          Gets the maximum number of tokens extracted before tokenization is aborted
 boolean isIgnoreHeaders()
          Tells us if we are ignoring the list of IGNORED_HEADERS when tokenizing
static void main(String[] args)
          Internal test harness only.
 void setIgnoreHeaders(boolean b)
          Flags the tokenizer to ignore list of IGNORED_HEADERS when tokenizing
 void setLinguisticLimit(int linguisticLimit)
          Sets the maximum number of linguistic errors tolerated before tokenization is aborted.
 void setTokenLimit(int i)
          Sets the maximum number of tokens extracted before tokenization is aborted
 String[] tokenize(javax.mail.internet.MimeMessage mail, JasenMessage message, ParserData data)
          Tokenizes the given message into meaningful string tokens
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

HEADER_TOKEN_DELIMITER

public static final char HEADER_TOKEN_DELIMITER
This is just a rare character user to identify mail header tokens It looks like two pipes ||

See Also:
Constant Field Values

IGNORED_HEADERS

public static String[] IGNORED_HEADERS
Deprecated. This should be done in config


INCLUDED_HEADERS

public static String[] INCLUDED_HEADERS
Deprecated. This should be done in config

Constructor Detail

EmailTokenizer

public EmailTokenizer()
               throws IOException
Method Detail

tokenize

public String[] tokenize(javax.mail.internet.MimeMessage mail,
                         JasenMessage message,
                         ParserData data)
                  throws JasenException
Description copied from interface: MimeMessageTokenizer
Tokenizes the given message into meaningful string tokens

Specified by:
tokenize in interface MimeMessageTokenizer
Parameters:
mail -
message -
Returns:
The reduced message tokens
Throws:
JasenException

getLinguisticLimit

public int getLinguisticLimit()
Gets the maximum number of linguistic errors tolerated before tokenization is aborted.

The tokenizer uses the LinguisticAnalyzer to determine if each token is a real word. After linguisticLimit tokens have successively failed, tokenization is aborted.

Returns:
Returns the linguisticLimit.

setLinguisticLimit

public void setLinguisticLimit(int linguisticLimit)
Sets the maximum number of linguistic errors tolerated before tokenization is aborted.

Parameters:
linguisticLimit - The linguisticLimit to set.
See Also:
getLinguisticLimit()

isIgnoreHeaders

public boolean isIgnoreHeaders()
Tells us if we are ignoring the list of IGNORED_HEADERS when tokenizing

Returns:
True if the tokenizer is ignoring headers in the IGNORED_HEADERS set
See Also:
IGNORED_HEADERS

setIgnoreHeaders

public void setIgnoreHeaders(boolean b)
Flags the tokenizer to ignore list of IGNORED_HEADERS when tokenizing

Parameters:
b -

getTokenLimit

public int getTokenLimit()
Gets the maximum number of tokens extracted before tokenization is aborted

Returns:
The maximum number if tokens that will be returned

setTokenLimit

public void setTokenLimit(int i)
Sets the maximum number of tokens extracted before tokenization is aborted

Specified by:
setTokenLimit in interface MimeMessageTokenizer
Parameters:
i -

main

public static void main(String[] args)
Internal test harness only. DO NOT USE

Parameters:
args -