在StanfordNLP中找到对所提供名词的所有引用(Find all references to a supplied noun in StanfordNLP)

我正在尝试解析一些文本以查找对特定项目的所有引用。 所以,举例来说,如果我的项目是The Bridge on the River Kwai ,我把它传给了这个文本,我希望它能找到我用粗体添加的所有实例。

桂河大桥是1957年由美国大卫·莱恩执导并由威廉·霍尔登,杰克·霍金斯,亚历克·吉尼斯和Sessue Hayakawa主演的英美史诗战争电影。 这部电影是一部虚构的作品,但由于其历史背景,于1942年至1943年借用了缅甸铁路的建设。 这部电影是在锡兰(现斯里兰卡)拍摄的。 影片中的桥梁位于Kitulgala附近。

到目前为止,我的尝试是通过附加到每个CorefChain的所有提及并循环搜索我的目标字符串。 如果我找到目标字符串,我添加整个CorefChain,因为我认为这意味着CorefChain中的其他项也引用相同的东西。

List<CorefChain> gotRefs = new ArrayList<CorefChain>(); String pQuery = "The Bridge on the River Kwai"; for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) { List<CorefChain.CorefMention> corefMentions = cc.getMentionsInTextualOrder(); boolean addedChain = false; for (CorefChain.CorefMention cm : corefMentions) { if ((!addedChain) && (pQuery.equals(cm.mentionSpan))) { gotRefs.add(cc); addedChain = true; } } }

然后我遍历第二个CorefChains列表,重新检索每个链的提及并逐步完成它们。 在那个循环中,我会在一个句子中显示哪些句子可能会提到我的项目。

for (CorefChain gr : gotRefs) { List<CorefChain.CorefMention> corefMentionsUsing = gr.getMentionsInTextualOrder(); for (CorefChain.CorefMention cm : corefMentionsUsing) { //System.out.println("Got reference to " + cm.mentionSpan + " in sentence #" + cm.sentNum); } }

它找到了我的一些参考文献,但没有那么多,它产生了很多误报。 可能完全显然是从阅读本文开始,我真的不知道关于NLP的第一件事 - 我是否完全以错误的方式解决这个问题? 是否有一个StanfordNLP解析器已经完成了我之后的一些事情? 我应该以某种方式训练模型吗?

I'm trying to parse some text to find all references to a particular item. So, for example, if my item was The Bridge on the River Kwai and I passed it this text, I'd like it to find all the instances I've put in bold.

The Bridge on the River Kwai is a 1957 British-American epic war film directed by David Lean and starring William Holden, Jack Hawkins, Alec Guinness, and Sessue Hayakawa. The film is a work of fiction, but borrows the construction of the Burma Railway in 1942–1943 for its historical setting. The movie was filmed in Ceylon (now Sri Lanka). The bridge in the film was near Kitulgala.

So far my attempt has been to go through all the mentions attached to each CorefChain and loop through those hunting for my target string. If I find the target string, I add the whole CorefChain, as I think this means the other items in that CorefChain also refer to the same thing.

List<CorefChain> gotRefs = new ArrayList<CorefChain>(); String pQuery = "The Bridge on the River Kwai"; for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) { List<CorefChain.CorefMention> corefMentions = cc.getMentionsInTextualOrder(); boolean addedChain = false; for (CorefChain.CorefMention cm : corefMentions) { if ((!addedChain) && (pQuery.equals(cm.mentionSpan))) { gotRefs.add(cc); addedChain = true; } } }

I then loop through this second list of CorefChains, re-retrieve the mentions for each chain and step through them. In that loop I show which sentences have a likely mention of my item in a sentence.

for (CorefChain gr : gotRefs) { List<CorefChain.CorefMention> corefMentionsUsing = gr.getMentionsInTextualOrder(); for (CorefChain.CorefMention cm : corefMentionsUsing) { //System.out.println("Got reference to " + cm.mentionSpan + " in sentence #" + cm.sentNum); } }

It finds some of my references, but not that many, and it produces a lot of false positives. As might be entirely apparently from reading this, I don't really know the first thing about NLP - am I going about this entirely the wrong way? Is there a StanfordNLP parser that will already do some of what I'm after? Should I be training a model in some way?

最满意答案

我认为你的例子的一个问题是你正在寻找电影片名的引用,并且斯坦福CoreNLP不支持识别电影片名,书名等...

如果你看一下这个例子:

“乔买了一台笔记本电脑。他很满意。”

您会注意到它连接:

“乔” - >“他”

“一台笔记本电脑” - >“它”

Coreference是一个活跃的研究领域,即使是最好的系统也只能在一般文本中产生大约60.0的F1,这意味着它经常会出错。

I think a problem with your example is that you are looking for references to a movie title, and there isn't support in Stanford CoreNLP for recognizing movie titles, book titles, etc...

If you look at this example:

"Joe bought a laptop. He is happy with it."

You will notice that it connects:

"Joe" -> "He"

and

"a laptop" -> "it"

Coreference is an active research area and even the best system can only really be expected to produce an F1 of around 60.0 on general text, meaning it will often make errors.

更多推荐