Limit yourself: NLP in Java without Cloud Solutions

Can Stanford CoreNLP do enough for us?

With the rise of chatbots and other linguistical applications the field of Natural Language Processing (NLP) has attracted quite some attention. My experience with NLP thus far has consisted of the following steps:

  • Take a String.class
  • Send it over to an online NLP engine (wit.ai, luis.ai or other friends)
  • Get back the intents/entities
  • Use them in a switch(intent)

I know many people have a similar experience because it is so damn easy, but it has always bothered me to be dependent on a remote service. This is why I got all excited to see a NLP library for Java in the latest Technology Radar: Stanford CoreNLP. A quick Github search shows that this library has existed for at least 5 years and has seen multiple iterations, but it is never too late to start tinkering with it!

Tinkering with Stanford Core NLP

Stanford CoreNLP consists of a core library and extra jars for the trained models, all available on Maven Central!

    <dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-corenlp</artifactId>
        <version>3.9.2</version>
    </dependency>
    <dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-corenlp</artifactId>
        <version>3.9.2</version>
        <classifier>models</classifier>
    </dependency>

There are additional dependencies for models of different languages. Fair warning: downloading the models (~1gb each) may take a while on hotel wifi…

The library has 2 main API flavors:

Since I usually learn by writing tests… time to start testing!

Test driving the library

I have no shame in admitting I have little to no knowledge about NLP. After a quick glance, it seems like the Simple API can do quite a lot without losing myself too much into the details of NLP.

The package recognizes 2 things we can wrap our heads around.

  • Document.class: The Java representation for a full document.
  • Sentence.class: The Java representation for a sentence (dhu!).

Let’s see what we can do with this!

Entities in sentences

NLP APIs usually return a list of entities discovered in a sentence. Lucky for us, CoreNLP provides exactly such a feature.

        String text = "Maarten is a good architect.";

        Document document = new Document(text);

        // CoreNLP handles entity extraction on sentence level.
        List<Sentence> sentences = document.sentences();

        // Execute Named Entity Recognition (NER)
        sentences.stream()
            .map(Sentence::nerTags)
            .flatMap(List::stream)
            .forEach(System.out::print);

The code above results in: PERSONOOOTITLEO. Every word either gets an entity attached to, based on the index of the word in the sentence. Maarten gets the entity Person (See Maarten! You are still a human!) and architect is clearly a title. If CoreNLP can’t find an entity for a word, it just returns O.

Part-Of-Speech Tagging

Part-Of-Speech Tagging is the art of identifying words as nouns, verbs, adjectives, adverbs, etc. I used to be really bad at this as a kid but I got machines now to do it for me! Let’s see how CoreNLP does this.

    String text = "Maarten is a good architect. He is someone who accepts being challenged.";

    Document document = new Document(text);

    // Split into sentences
    List<Sentence> sentences = document.sentences();

    // Execute Part-Of-Speech Tagger (POS)
    sentences.stream()
            .map(Sentence::posTags)
            .flatMap(List::stream)
            .forEach(System.out::print);

The code above results in: NNPVBZDTJJNN.PRPVBZNNWPVBZVBGVBN.. Every word gets a resulting code. Maarten was labeled as NNP, which means Proper noun, singular. A full list of the codes can be found here.

Result MappingMaartenisagoodarchitectNNP:Proper noun singularVBZ:Verb, 3rd person singular presentDT:DeterminerJJ:AdjectiveNN:Noun, singular or mass

Finding Coreferences in Document

In our languages we often refer to previously shared information.

Maarten is a good architect. He is someone who accepts being challenged.

In the sentence above the word he refers to an entity mentioned in a previous sentence. While the online tools like wit.ai are good at resolving entities in a single sentence, they usually struggle with coreferences across multiple sentences.

Using Standford CoreNLP it is almost too easy. The following code shows how we can get the correlation between words in a document.

        String text = "Maarten is a good architect. He is someone who accepts being challenged.";

        Document document = new Document(text);

        // Find coreferences in document and print them
        document.coref()
                .values()
                .forEach(System.out::println);

The code above results in: CHAIN3-[“Maarten” in sentence 1, “He” in sentence 2]

Cool! The coreference was found. A quick Wikipedia search on coreferences shows there are multiple types. Lets try some different types to see if CoreNLP can find it!

TypeTextReference found?
AnaphoraThe music was so loud that it couldn’t be enjoyed.Yes
CataphoraIf they are angry about the music, the neighbors will call the cops.Yes
Split antecedentsCarol told Bob to attend the party. They arrived together.No
Coref Noun phrasesThe project leader is refusing to help. The jerk thinks only of himself.No, but does match jerk with himself
All examples from Wikipedia

So it’s not perfect… but it is way better than nothing.

Detecting sentiment with the Standford API

While there are some more methods available on the Sentence and Document classes some features require the use of the full API. The Standford API allows configuration of custom Annotators. One of these is the SentimentAnnotator, which can discover the general feeling of a sentence.

        // Setup text and create a CoreDocument from it.
        String text = "Maarten is a good architect. He is someone who accepts being challenged.";
        CoreDocument document = new CoreDocument(text);

        // Setup the NLP Pipeline, passing in the annotators you want to run.
        // The sentiment analysis depends on some others, else it will not run.
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize,ssplit,pos,parse,sentiment");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

        // Annotate the document using the Pipeline.
        pipeline.annotate(document);

        // Print the sentiments for each sentence.
        document.sentences()
                .forEach(s -> System.out.println(s.sentiment()));

So.. how do I feel about Maarten?:

  • “Maarten is a good architect.”: Positive
  • “He is someone who accepts being challenged.”: Neutral

Seems about right! ◕‿◕

Closing words

After 8+ years of development, it is no wonder that Standford CoreNLP has evolved to a functional and easy to use library. With all the feature you can expect from NLP-based APIs on the internet, it is a great tool for anyone interested in building conversational agents.

I see most companies who are serious about chatbots create their own systems for NLP. This is a very costly enterprise and requires specific skillsets like computer linguistics which are not widely available. Standford CoreNLP hits the sweet spot between using an external NLP API and building everything yourself from scratch.

Additional Sources