现在有一个标注好chunk的文本句子,如下所示:Some circuit breakers #### installed ####the October 1987 crash ####failed ####their first test ####unable ####to cool ####the selling panic ####both stocks and futures
现在对每一个chunk例如第一个Some circuit breakers,要去几十个G的大规模语料中搜搜包含该chunk的句子(该语料已被切分成一个个独立的句子),需要进行模糊匹配,即搜索到的句子可以包含完整的chunk,也可以只包含chunk的一部分(但至少要大于两个单词),如“Some circuit”或者“circuit breakers”,如果语料中的句子包含这样的情况也符合要求。
1_1,Two American heavyweights ininternational sport, one who weighs 286 pounds and the other 89,found something in common Sunday at the Goodwill Games.Both lost bids for gold medals.
1_2,Both were beaten by Russianopponents.Wrestler Bruce Baumgartner, the heaviest member of the U.S.contingent here, was outpointed in overtime by Andrei Shumilin.Gymnast Shannon Miller, one of the smallest Americans, had tosettle for a rare silver in the women's all-around competitionbehind Russian pixie Dina Kochetkova.Those setbacks for two U.S. athletes who rarely lose, and havedominated their individual events on the world scene, cast a cloudover Sunday's Goodwill competition for American followers.The setting for the dual disaster was the same.
1_3,Sweltering SKKStadium resembled a steamroom late into the night aftertemperatures again climbed into the high 80s by mid-afternoon.Miller, the most decorated American gymnast ever, hadn't lost inmore than two years.
希望各位大神帮忙看一下!万分感谢! NLP 搜索 模糊匹配 --------------------编程问答-------------------- 完全不懂楼主在说什么,什么是语料? --------------------编程问答-------------------- lucene可以实现你的要求。但是前提是用lucene对你的“几十个G的大规模语料”建索引。 --------------------编程问答-------------------- 同意楼上的。
补充:Java , Java相关