Bert


## 输入与任务

BERT 的输入可以包含一个句子对 (句子 A 和句子 B),也可以是单个句子。同时BERT增加了一些有特殊作用的标志位:

  • [CLS]标志放在第一个句子的首位。
  • [SEP]标志用于分开两个输入句子。
  • [MASK]标志用于遮盖句子中的一些单词。
# 输入样本
"my dog is cute" 和 "he likes palying"

# 编码后
"[CLS] my dog is cute [SEP] he likes play ##ing [SEP]"

## transformer

https://huggingface.co/models

https://huggingface.co/transformers/glossary.html#model-inputs

  • Input IDs:They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"
inputs = tokenizer(sequence)
encoded_sequence = inputs["input_ids"]
print(encoded_sequence)
# [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
  • Attention mask:This argument indicates to the model which tokens should be attended to, and which should not. For the BertTokenizer, 1 indicates a value that should be attended to, while 0 indicates a padded value.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."

padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)

padded_sequences["attention_mask"]
# [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
  • Token Type IDs:Some models’ purpose is to do sequence classification or question answering. However, other models, such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying the two types of sequence in the model.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"
encoded_dict = tokenizer(sequence_a, sequence_b)

encoded_dict['token_type_ids']
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

## 相关论文


## 相关链接



© 2019-2023 coggle.club 版权所有     京ICP备20022947    京公网安备 11030102010643号