opennlp.grok.preprocess.tokenize
Class TokenizerME

java.lang.Object
  |
  +--opennlp.grok.preprocess.tokenize.TokenizerME
All Implemented Interfaces:
opennlp.common.preprocess.Pipelink, opennlp.common.preprocess.Tokenizer
Direct Known Subclasses:
EnglishTokenizerME

public class TokenizerME
extends java.lang.Object
implements opennlp.common.preprocess.Tokenizer

A Tokenizer for converting raw text into separated tokens. It uses Maximum Entropy to make its decisions. The features are loosely based off of Jeff Reynar's UPenn thesis "Topic Segmentation: Algorithms and Applications.", which is available from his homepage: .

Version:
$Revision: 1.12 $, $Date: 2002/11/26 03:27:51 $
Author:
Tom Morton

Constructor Summary
TokenizerME(opennlp.maxent.MaxentModel mod)
          Class constructor which takes the string locations of the information which the maxent model needs.
 
Method Summary
static void main(java.lang.String[] args)
          Trains a new model.
 void process(opennlp.common.xml.NLPDocument doc)
          Tokenize an NLPDocument.
 java.util.Set requires()
           
 java.lang.String[] tokenize(java.lang.String s)
          Tokenize a String.
static void train(java.lang.String[] args)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TokenizerME

public TokenizerME(opennlp.maxent.MaxentModel mod)
Class constructor which takes the string locations of the information which the maxent model needs.

Method Detail

process

public void process(opennlp.common.xml.NLPDocument doc)
Tokenize an NLPDocument.

Specified by:
process in interface opennlp.common.preprocess.Pipelink

requires

public java.util.Set requires()
Specified by:
requires in interface opennlp.common.preprocess.Pipelink

tokenize

public java.lang.String[] tokenize(java.lang.String s)
Tokenize a String.

Specified by:
tokenize in interface opennlp.common.preprocess.Tokenizer
Parameters:
s - The string to be tokenized.
Returns:
A string array containing individual tokens as elements.

train

public static void train(java.lang.String[] args)

main

public static void main(java.lang.String[] args)
Trains a new model. Call from the command line with "java opennlp.grok.preprocess.tokenize.TokenizerME trainingdata modelname"



Copyright © 2003 Jason Baldridge and Gann Bierner. All Rights Reserved.