org.jasen.core.linguistics
Class LexicalTreeAnalyzer

java.lang.Object
  extended byorg.jasen.core.linguistics.LexicalTreeAnalyzer

public class LexicalTreeAnalyzer
extends Object

Employes a lexical tree approach to word recognition.

Based on a sample corpus, the analyser builds a tree of characters such that each characters in a word is a node in the tree.

When a word with a similar character sequence is found, the path to the next character is strengthened

Author:
Jason Polites

Constructor Summary
LexicalTreeAnalyzer()
           
 
Method Summary
 double computeWordValue(String word)
          Computes the probability that the given sequence of characters is an English word.
 void initialize()
          Creates and initialized the analyzer
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LexicalTreeAnalyzer

public LexicalTreeAnalyzer()
Method Detail

initialize

public void initialize()
                throws IOException
Creates and initialized the analyzer

Throws:
IOException

computeWordValue

public double computeWordValue(String word)
Computes the probability that the given sequence of characters is an English word.

This works on the premise that most English words exhibit a similar set of character sequence patterns in both their prefix, body and suffix.

The value of the word is determined by analysis if the characters in the word against the values in both the forward and backward lexical trees.

The maximium possible value a word can have is 1 (100%), thus for each character in the word which is correctly positioned in accordance with the rules in the tree, the computed value is increased by 1/W where 'W' is the length of the word; such that if a word perfectly matches a branch of the tree a result of 1/W x W (or 1) will be returned.

Where a word fails to match a forward branch perfectly, two things are done:
  1. For each remaining character in the token, the current total is reduced by the same percentile fraction as used to calculate the total.
  2. The token is given a "second chance" by repeating the initial calculation process with the reverse tree.

Parameters:
word - The word to be tested
Returns:
A value between 0.0 and 1.0 indicating the probability that the String is an English word.