BERT 的输入可以包含一个句子对 (句子 A 和句子 B),也可以是单个句子。同时BERT增加了一些有特殊作用的标志位:
[CLS]
标志放在第一个句子的首位。[SEP]
标志用于分开两个输入句子。[MASK]
标志用于遮盖句子中的一些单词。# 输入样本
"my dog is cute" 和 "he likes palying"
# 编码后
"[CLS] my dog is cute [SEP] he likes play ##ing [SEP]"
https://huggingface.co/transformers/glossary.html#model-inputs
Input IDs
:They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence = "A Titan RTX has 24GB of VRAM"
inputs = tokenizer(sequence)
encoded_sequence = inputs["input_ids"]
print(encoded_sequence)
# [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
Attention mask
:This argument indicates to the model which tokens should be attended to, and which should not. For the BertTokenizer
, 1 indicates a value that should be attended to, while 0 indicates a padded value.from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)
padded_sequences["attention_mask"]
# [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
Token Type IDs
:Some models’ purpose is to do sequence classification or question answering. However, other models, such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying the two types of sequence in the model.from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"
encoded_dict = tokenizer(sequence_a, sequence_b)
encoded_dict['token_type_ids']
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
© 2019-2023 coggle.club 版权所有 京ICP备20022947 京公网安备 11030102010643号