Skip to content
Snippets Groups Projects
Commit 9c932d41 authored by Yoonjin Im's avatar Yoonjin Im
Browse files

week 1,2

parents
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# SentencePiece (20.3.25)
연구 내용: SentencePiece 코드를 정리합니다.
[참고 문서](https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb)
%% Cell type:markdown id: tags:
### 1. SentencPiece 설치
`pip install sentencepiece`
<br>
%% Cell type:markdown id: tags:
### 2. Basic example
botchan.txt 소설 다운로드
%% Cell type:code id: tags:
``` python
!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
```
%% Output
--2020-03-30 19:15:25-- https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.24.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.24.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 278779 (272K) [text/plain]
Saving to: ‘botchan.txt’
botchan.txt 100%[===================>] 272.25K 460KB/s in 0.6s
2020-03-30 19:15:27 (460 KB/s) - ‘botchan.txt’ saved [278779/278779]
%% Cell type:markdown id: tags:
`SentencePieceTrainer.train()`
+ --input:학습할 text 파일 위치
+ --model_prefix: 모델이름
+ --vocal_size: subword 개수
+ --model_type: unigram(default), bpe, char, word...
+ ...
`encode_as_pieces()` : 문장을 토큰화
> This is a test
> ['▁This', '▁is', '▁a', '▁t', 'est']
`encode_as_ids` : 문장을 암호화
> This is a test
> [209, 31, 9, 375, 586]
`decode_pieces` : 문장을 암호화
> ['▁This', '▁is', '▁a', '▁t', 'est']
> This is a test
`decode_ids` : 문장을 암호화
> [209, 31, 9, 375, 586]
> This is a test
%% Cell type:code id: tags:
``` python
import sentencepiece as spm
# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')
# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')
# encode: text => id
print('This is a test')
print(sp.encode_as_pieces('This is a test'))
print(sp.encode_as_ids('This is a test'))
# decode: id => text
print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))
print(sp.decode_ids([209, 31, 9, 375, 586]))
```
%% Output
This is a test
['▁This', '▁is', '▁a', '▁t', 'est']
[209, 31, 9, 375, 586]
This is a test
This is a test
%% Cell type:code id: tags:
``` python
# returns vocab size
print(sp.get_piece_size())
# id <=> piece conversion
print(sp.id_to_piece(209))
print(sp.piece_to_id('▁This'))
# returns 0 for unknown tokens (we can change the id for UNK)
print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))
# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for id in range(3):
print(sp.id_to_piece(id), sp.is_control(id))
```
%% Output
2000
▁This
209
0
<unk> False
<s> True
</s> True
%% Cell type:markdown id: tags:
+ **user defined symbols**: 특정 단어를 사용자가 지정하여 하나의 토큰으로 간주.
> [BERT](https://arxiv.org/pdf/1810.04805.pdf)'s special symbols., e.g., [SEP] and [CLS]
+ **control symbol**: 특정 단어를 하나의 토큰으로 간주하지 못하고 분리시킴. id를 붙임.
`load()`에 들어가는 인자는 다음과 같다.
+ m_user.model: 특정 단어를 하나의 토큰으로 지정할 수 있다.
+ m_ctrl.model: control symbol 규칙을 따른다.
+ m.model: control symbol 규칙을 따른다.
+ m_bos_as_user.model: BOS, EOS, UNK, and PAD을 특정 단어로 지정.
+ ...
> (BOS, EOS, UNK, PAD) = id(1, 2, 0, -1) id 변경 가능.
> + BOS`<s>`: Beginning of Sentence
> + EOS`</s>`: End of Sentence
> BOS and EOS are used in place of the "previous word" and "next word" features for words that do not have previous/next words.
> + UNK: unknown
> + PAD
%% Cell type:code id: tags:
``` python
## Example of user defined symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls> --vocab_size=2000')
sp_user = spm.SentencePieceProcessor()
sp_user.load('m_user.model')
# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbol to apper in the text.
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_user.piece_to_id('<sep>')) # 3
print(sp_user.piece_to_id('<cls>')) # 4
print('3=', sp_user.decode_ids([3])) # decoded to <sep>
print('4=', sp_user.decode_ids([4])) # decoded to <cls>
```
%% Output
['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁world', '<cls>']
3
4
3= <sep>
4= <cls>
%% Cell type:code id: tags:
``` python
## Example of control symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=2000')
sp_ctrl = spm.SentencePieceProcessor()
sp_ctrl.load('m_ctrl.model')
# control symbols just reserve ids.
print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_ctrl.piece_to_id('<sep>')) # 3
print(sp_ctrl.piece_to_id('<cls>')) # 4
print('3=', sp_ctrl.decode_ids([3])) # decoded to empty
print('4=', sp_ctrl.decode_ids([4])) # decoded to empty
```
%% Output
['▁this', '▁is', '▁a', '▁t', 'est', '<', 'se', 'p', '>', '▁he', 'll', 'o', '▁world', '<', 'c', 'l', 's', '>']
3
4
3=
4=
%% Cell type:code id: tags:
``` python
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('<s> hello</s>')) # <s>,</s> are segmented. (default behavior)
sp = spm.SentencePieceProcessor()
sp.load('m_bos_as_user.model')
print(sp.encode_as_pieces('<s> hello</s>')) # <s>,</s> are handled as one token.
```
%% Output
['▁', '<', 's', '>', '▁he', 'll', 'o', '</', 's', '>']
['▁', '<s>', '▁he', 'll', 'o', '</s>']
%% Cell type:code id: tags:
``` python
# Disable BOS/EOS
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --bos_id=-1 --eos_id=-1')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
# <s>, </s> are UNK.
print(sp.unk_id())
print(sp.piece_to_id('<s>'))
print(sp.piece_to_id('</s>'))
```
%% Output
0
0
0
%% Cell type:markdown id: tags:
### 3. Text normalization
`SentencePieceTrainer.train()`
+ --normalization_rule_name:normalization 종류
+ **nmt_nfkc**: NFKC normalization with some additional normalization around spaces. (default)
+ **nfkc: original**: NFKC normalization.
+ **nmt_nfkc_cf**: nmt_nfkc + Unicode case folding (mostly lower casing)
+ **nfkc_cf**: nfkc + Unicode case folding.
+ **identity**: no normalization
%% Cell type:code id: tags:
``` python
import sentencepiece as spm
# NFKC normalization and lower casing.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_name=nfkc_cf')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('HELLO WORLD.')) # lower casing and normalization
```
%% Output
['▁', 'hello', '▁world', '.']
%% Cell type:markdown id: tags:
### 4. Randomizing training data
`SentencePieceTrainer.train()`
+ --input_sentence_size: training data에서 랜덤하게 **size**만큼 가져온다.
+ --shuffle_input_sentence: false이면 랜덤 셔플을 끄고 가장 첫 줄부터 **size**만큼 가져온다.
%% Cell type:markdown id: tags:
### 5. Training sentencepiece model from the word list with frequency
문서에 등장하는 단어 frequency를 기준으로 tsv파일을 생성한다.
![tsv example](ex.png)
%% Cell type:code id: tags:
``` python
freq={}
with open('botchan.txt', 'r') as f:
for line in f:
line = line.rstrip()
for piece in line.split():
freq.setdefault(piece, 0)
freq[piece] += 1
with open('word_freq_list.tsv', 'w') as f:
for k, v in freq.items():
f.write('%s\t%d\n' % (k, v))
spm.SentencePieceTrainer.train('--input=word_freq_list.tsv --input_format=tsv --model_prefix=m --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('this is a test.'))
```
%% Output
['▁this', '▁is', '▁a', '▁t', 'est', '.']
%% Cell type:markdown id: tags:
추가로 참고할만한 논문
[subword regularization](https://arxiv.org/pdf/1804.10959.pdf)
[BERT](https://arxiv.org/pdf/1810.04805.pdf)
%% Cell type:code id: tags:
``` python
```
File added
File added
File added
File added
xy.xlsx 0 → 100644
File added
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment