CLUE:中文语言理解测评基准,包括代表性的数据集、基准(预训练)模型、语料库、排行榜。我们会选择一系列有一定代表性的任务对应的数据集,做为我们测试基准的数据集。这些数据集会覆盖不同的任务、数据量、任务难度。
DataCLUE:数据为中心的NLP基准与工具包
FewCLUE:预训练模型的中文小样本学习
CBLUE:医学信息处理任务
https://github.com/LIAAD/KeywordExtractor-Datasets
This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community.
搜索、广告、推荐(搜广推)主要就是通过对内容/商品的召回和排序,来优化Query-Doc的匹配结果。
传统的query结构化理解是通过分词、NER、query tagging等方式将query转换为结构化信息。query中长尾query占比99.9%,中长尾query中供给不足和算法理解问题是影响其效率的关键。
Query相关的算法:理解、匹配(Trie树)、改写、纠错
https://aclanthology.org/2020.coling-main.572/
https://github.com/destwang/CTCResources
传统的纠错方法一般是基于规则的方法,语言专家首先总结出来一些常见的错误规则,来判断文本是否发生了错误,然后再制定一些规则,将错误文本按照实现总结好的规则加以改正,实现纠错功能。
在N-gram 模型中句子T的出现概率是由组成T的N个同现的连续字符出现概率组成,假定后一个字符出现的概率仅仅和前一个或者多个字符有关。使用 N-gram 算法计算文本的得分。句子得分越高,越可能是对的,句子得分越低,越有可能是错误。
将文本中的同音字或者同型字用分别用两种词表进行替换,如果替换后的结果比替换前高,说明替换的文本的位置有可能是错误的字符,而后按照句子得分将得分最高的那句话中替换的字符作为候选项提供给用户作为修改选项。
利用 encoder-decoder 结构解决错误文本到正确文本的转换过程,左侧是编码端,右侧是解码端,编码端和解码端都采用LSTM结构。编码端在循环迭代之后生成整个句子的语义向量 ,解码端将生成的向量解码成相应文字,完成错误文本到正确文本的转换。
BERT/ELECTRA/ERNIE/MacBERT等预训练模型强大的语言表征能力,基于其MASK掩码的特征,可以简单改造预训练模型用于纠错,加上fine-tune,效果轻松达到最优。
https://github.com/Meituan-Dianping/asap
是非类型:这种是比较常见的类型,主要格式是“(句子1, 句子2, 是否相似)”,这里收集到的ATEC、BQ、LCQMC、PAWSX都是这种类型;
NLI类型:NLI的全称是Natrual Language Inference(自然语言推理),样本格式是“(句子1, 句子2, 蕴涵/中立/矛盾)”,可以视为更为精细一点的相似度数据集,当前可以找到的中文NLI数据集是英文版翻译过来的,链接位于CNSD;
打分类型:这算是最精细的相似度语料,格式为“(句子1, 句子2, 相似程度)”,这个相似程度一般是比0/1更细颗粒度的等级,目前可以找到的中文数据集是STS-B,也是由对应的英文数据集翻译过来的。
Model | STS 12 | STS13 | STS14 | STS15 | STS16 | STSb | SICK-R | Avg. |
---|---|---|---|---|---|---|---|---|
Avg. GloVe embeddings | 55.14 | 70.66 | 59.73 | 68.25 | 63.66 | 58.02 | 53.76 | 61.32 |
SIF | 56.2 | 56.6 | 68.5 | 71.7 | . | 72.0 | 86.0 | 68.50 |
Avg. BERT embeddings | 38.78 | 57.98 | 57.98 | 63.15 | 61.06 | 46.35 | 58.40 | 54.81 |
BERT CLS-vecior | 20.16 | 30.01 | 20.09 | 36.88 | 38.08 | 16.50 | 42.63 | 29.19 |
InferSent - Glove | 52.86 | 66.75 | 62.15 | 72.77 | 66.87 | 68.03 | 65.65 | 65.01 |
Sentence-BERT-NLI-base | 70.97 | 76.53 | 73.19 | 79.09 | 74.30 | 77.03 | 72.91 | 74.89 |
Sentence-RoBERTa-NLI-base | 71.54 | 72.49 | 70.80 | 78.74 | 73.69 | 77.77 | 74.46 | 74.21 |
无监督-SimCSE-BERT-base | 68.40 | 82.41 | 74.38 | 80.91 | 78.56 | 76.85 | 72.23 | 76.25 |
ESimCSE-BERT-base | 73.40 | 83.27 | 77.25 | 82.66 | 78.81 | 80.17 | 72.30 | 78.27 |
SNCSE-BERT-base | 70.67 | 84.79 | 76.99 | 83.69 | 80.51 | 81.35 | 74.77 | 78.97 |
无监督-SimCSE-RoBERTa-base | 70.16 | 81.77 | 73.24 | 81.36 | 80.65 | 80.22 | 68.56 | 76.57 |
ESimCSE-RoBERTabase・ | 69.90 | 82.50 | 74.68 | 83.19 | 80.30 | 80.99 | 70.54 | 77.44 |
SNCSE-RoBERTa-base | 70.62 | 84.42 | 77.24 | 84.85 | 81.49 | 83.07 | 72.92 | 79.23 |
有监督-SimCSE-BERT-base | 75.30 | 84.67 | 80.19 | 85.40 | 80.82 | 84.25 | 80.39 | 81.57 |
有监督-SimCSE-RoBERTa-base | 76.53 | 85.21 | 80.95 | 86.03 | 82.57 | 85.83 | 80.50 | 82.52 |
Word Mover's Embedding,http://proceedings.mlr.press/v37/kusnerb15.pdf
【ICLR 2016, SIF Embedding】A Simple but Tough-to-Beat Baseline for Sentence Embeddings, 代码
【ICLR 2018】All-but-the-Top: Simple and Effective Postprocessing for Word Representations, 代码
Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline,代码
【arXiv2018 p-means】Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations, 代码
【arXiv2020 S3E】Efficient Sentence Embedding via Semantic Subspace Analysis, 代码
Models | DUC2001 | Inspec | SemEval2010 | ||||||
---|---|---|---|---|---|---|---|---|---|
F1@5 | F1@10 | F1@15 | F1@5 | F1@10 | F1@15 | F1@5 | F1@10 | F1@15 | |
Unsupervised Statistical Models | |||||||||
TF-IDF | 9.21 |
10.63 | 11.06 | 11.28 | 13.88 | 13.83 | 2.81 | 3.48 | 3.91 |
YAKE |
12.27 | 14.37 | 14.76 | 18.08 | 19.62 | 20.11 | 11.76 | 14.4 | 15.2 |
Unsupervised Graph-based Models | |||||||||
TextRank | 11.80 | 18.28 | 20.22 | 27.04 | 25.08 |
36.65 | 3.80 | 5.38 | 7.65 |
SingleRank | 20.43 | 25.59 | 25.70 | 27.79 | 34.46 | 36.05 | 5.90 | 9.02 | 10.58 |
TopicRank | 21.56 | 23.12 | 20.87 | 25.38 | 28.46 | 29.49 | 12.12 | 12.90 | 13.54 |
PositionRank | 23.35 | 28.57 | 28.60 | 28.12 | 32.87 | 33.32 | 9.84 | 13.34 | 14.33 |
MultipartiteRank | 23.20 | 25.00 | 25.24 | 25.96 | 29.57 | 30.85 | 12.13 | 13.79 | 14.92 |
Textstar | 24.70 | 34.70 | 15.20 | 22.80 | |||||
FRAKE | 58.9 | 37.5 | |||||||
RaKUn | 10.1 |
10.8 | |||||||
Unsupervised Embedding-based Models | |||||||||
EmbedRank (s2v) | 27.16 | 31.85 | 31.52 | 29.88 | 37.09 | 38.40 | 5.40 | 8.91 | 10.06 |
EmbedRank (d2v) | 24.02 | 28.12 | 28.82 | 31.51 | 37.94 | 37.96 | 3.02 | 5.08 |
7.23 |
SIFRank | 24.27 | 27.43 | 27.86 | 29.11 | 38.80 | 39.59 | |||
SIFRank+ | 30.88 | 33.37 | 32.24 | 28.49 | 36.77 | 38.82 | |||
KeyGames | 24.42 | 28.28 | 29.77 | 32.12 | 40.48 | 40.94 | 11.93 | 14.35 | 14.62 |
JointModeling | 28.62 | 35.52 | 36.29 | 32.61 | 40.17 | 41.09 | 13.02 | 19.35 | 21.72 |
AttentionRank | 31.55 | 39.16 | 40.65 | 12.72 | 17.21 | 19.15 | |||
MDERank | 23.31 | 26.65 | 26.42 | 26.17 | 33.81 | 36.17 | 12.95 | 17.07 | 20.09 |
AGRank | 34.59 | 40.70 | 41.15 | 15.37 | 21.22 | 23.72 | |||
CorpusRank | 33.10 | 38.88 | 39.97 | 17.40 | 22.60 | 25.98 | |||
Model-based Models | |||||||||
CopyRNN |
29.3 | 33.6 | 29.1 | 29.6 | |||||
MultPAX |
37.1 | 21.0 | 44.9 | 25.5 | |||||
LSTM-NER |
|||||||||
BERT-NER |
|||||||||
BART |
|||||||||
T5 |
|||||||||
GPT2 | 41.3 | 46.9 | |||||||
GPT3 |
https://github.com/boudinfl/duc-2001-pre
https://github.com/LIAAD/KeywordExtractor-Datasets
CSL 数据获取自 国家科技资源共享服务工程技术研究中心, 包含 2010-2020 年发表的期刊论文元信息(标题、摘要和关键词)。根据中文核心期刊目录进行筛选, 并标注学科和门类标签,分为 13 个门类(一级标签)和 67 个学科(二级标签)。
为了推动中文科学文献 NLP 研究,本项目提供一系列测评基准任务。 测评任务数据集从 CSL 中抽样 10,000 条,按照 0.8 : 0.1 : 0.1的比例划分训练、验证和测试集。 为了提供公平的多任务学习设置,各任务使用相同的训练、验证和测试集。 任务数据集以 text2text 的形式提供,可以直接在基线模型(例如 T5)上进行多任务训练。
https://github.com/boudinfl/pke:支持了基础的关键词统计、图关键词统计
TF-IDF
FirstPhrases:对句子中'NOUN', 'PROPN', 'ADJ'抽取,然后选择字符长度最长的单词
KPMiner
keyphrase candidates are sequences of words that do not contain punctuation marks or stopwords4. Candidates that appear less than three times or that first occur beyond a certain position are removed. Candidates are then weighted using a modified TF×IDF formula that account for document length.
TextRank
RAKE
SingleRank (Wan and Xiao, 2008): keyphrase candidates are the sequences of adjacent nouns and adjectives. Candidates are ranked by the sum of their words scores, computed using TextRank (Mihalcea and Tarau, 2004) on a word-based graph representation of the document.
improves SingleRank by grouping lexically similar candidates into topics and directly ranking topics. Keyphrases are produced byextracting the first occurring candidate of the highest ranked topics.
Position Rank
Multipartite
uses the average of all the candidate phrases embeddings trained on individual files with GloVe as the reference vector, and then the similarity between the embeddings of candidate keyphrase and the reference vector is calculated and used as the score to rank
uses the cosine similarity between the embeddings of candidate keyphrase and the sentence embeddings of the document
First, document embeddings are extracted with BERT to get a document-level representation. Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity to find the words/phrases that are the most similar to the document. The most similar words could then be identified as the words that best describe the entire document.
SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model,https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8954611
KeyGames
JointModeling
MDERank
AGRank
© 2019-2023 coggle.club 版权所有 京ICP备20022947 京公网安备 11030102010643号