Skip to content
Snippets Groups Projects
Commit 9c932d41 authored by Yoonjin Im's avatar Yoonjin Im
Browse files

week 1,2

parents
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# SentencePiece (20.3.25)
연구 내용: SentencePiece 코드를 정리합니다.
[참고 문서](https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb)
%% Cell type:markdown id: tags:
### 1. SentencPiece 설치
`pip install sentencepiece`
<br>
%% Cell type:markdown id: tags:
### 2. Basic example
botchan.txt 소설 다운로드
%% Cell type:code id: tags:
``` python
!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
```
%% Output
--2020-03-30 19:15:25-- https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.24.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.24.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 278779 (272K) [text/plain]
Saving to: ‘botchan.txt’
botchan.txt 100%[===================>] 272.25K 460KB/s in 0.6s
2020-03-30 19:15:27 (460 KB/s) - ‘botchan.txt’ saved [278779/278779]
%% Cell type:markdown id: tags:
`SentencePieceTrainer.train()`
+ --input:학습할 text 파일 위치
+ --model_prefix: 모델이름
+ --vocal_size: subword 개수
+ --model_type: unigram(default), bpe, char, word...
+ ...
`encode_as_pieces()` : 문장을 토큰화
> This is a test
> ['▁This', '▁is', '▁a', '▁t', 'est']
`encode_as_ids` : 문장을 암호화
> This is a test
> [209, 31, 9, 375, 586]
`decode_pieces` : 문장을 암호화
> ['▁This', '▁is', '▁a', '▁t', 'est']
> This is a test
`decode_ids` : 문장을 암호화
> [209, 31, 9, 375, 586]
> This is a test
%% Cell type:code id: tags:
``` python
import sentencepiece as spm
# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')
# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')
# encode: text => id
print('This is a test')
print(sp.encode_as_pieces('This is a test'))
print(sp.encode_as_ids('This is a test'))
# decode: id => text
print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))
print(sp.decode_ids([209, 31, 9, 375, 586]))
```
%% Output
This is a test
['▁This', '▁is', '▁a', '▁t', 'est']
[209, 31, 9, 375, 586]
This is a test
This is a test
%% Cell type:code id: tags:
``` python
# returns vocab size
print(sp.get_piece_size())
# id <=> piece conversion
print(sp.id_to_piece(209))
print(sp.piece_to_id('▁This'))
# returns 0 for unknown tokens (we can change the id for UNK)
print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))
# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for id in range(3):
print(sp.id_to_piece(id), sp.is_control(id))
```
%% Output
2000
▁This
209
0
<unk> False
<s> True
</s> True
%% Cell type:markdown id: tags:
+ **user defined symbols**: 특정 단어를 사용자가 지정하여 하나의 토큰으로 간주.
> [BERT](https://arxiv.org/pdf/1810.04805.pdf)'s special symbols., e.g., [SEP] and [CLS]
+ **control symbol**: 특정 단어를 하나의 토큰으로 간주하지 못하고 분리시킴. id를 붙임.
`load()`에 들어가는 인자는 다음과 같다.
+ m_user.model: 특정 단어를 하나의 토큰으로 지정할 수 있다.
+ m_ctrl.model: control symbol 규칙을 따른다.
+ m.model: control symbol 규칙을 따른다.
+ m_bos_as_user.model: BOS, EOS, UNK, and PAD을 특정 단어로 지정.
+ ...
> (BOS, EOS, UNK, PAD) = id(1, 2, 0, -1) id 변경 가능.
> + BOS`<s>`: Beginning of Sentence
> + EOS`</s>`: End of Sentence
> BOS and EOS are used in place of the "previous word" and "next word" features for words that do not have previous/next words.
> + UNK: unknown
> + PAD
%% Cell type:code id: tags:
``` python
## Example of user defined symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls> --vocab_size=2000')
sp_user = spm.SentencePieceProcessor()
sp_user.load('m_user.model')
# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbol to apper in the text.
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_user.piece_to_id('<sep>')) # 3
print(sp_user.piece_to_id('<cls>')) # 4
print('3=', sp_user.decode_ids([3])) # decoded to <sep>
print('4=', sp_user.decode_ids([4])) # decoded to <cls>
```
%% Output
['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁world', '<cls>']
3
4
3= <sep>
4= <cls>
%% Cell type:code id: tags:
``` python
## Example of control symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=2000')
sp_ctrl = spm.SentencePieceProcessor()
sp_ctrl.load('m_ctrl.model')
# control symbols just reserve ids.
print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_ctrl.piece_to_id('<sep>')) # 3
print(sp_ctrl.piece_to_id('<cls>')) # 4
print('3=', sp_ctrl.decode_ids([3])) # decoded to empty
print('4=', sp_ctrl.decode_ids([4])) # decoded to empty
```
%% Output
['▁this', '▁is', '▁a', '▁t', 'est', '<', 'se', 'p', '>', '▁he', 'll', 'o', '▁world', '<', 'c', 'l', 's', '>']
3
4
3=
4=
%% Cell type:code id: tags:
``` python
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('<s> hello</s>')) # <s>,</s> are segmented. (default behavior)
sp = spm.SentencePieceProcessor()
sp.load('m_bos_as_user.model')
print(sp.encode_as_pieces('<s> hello</s>')) # <s>,</s> are handled as one token.
```
%% Output
['▁', '<', 's', '>', '▁he', 'll', 'o', '</', 's', '>']
['▁', '<s>', '▁he', 'll', 'o', '</s>']
%% Cell type:code id: tags:
``` python
# Disable BOS/EOS
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --bos_id=-1 --eos_id=-1')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
# <s>, </s> are UNK.
print(sp.unk_id())
print(sp.piece_to_id('<s>'))
print(sp.piece_to_id('</s>'))
```
%% Output
0
0
0
%% Cell type:markdown id: tags:
### 3. Text normalization
`SentencePieceTrainer.train()`
+ --normalization_rule_name:normalization 종류
+ **nmt_nfkc**: NFKC normalization with some additional normalization around spaces. (default)
+ **nfkc: original**: NFKC normalization.
+ **nmt_nfkc_cf**: nmt_nfkc + Unicode case folding (mostly lower casing)
+ **nfkc_cf**: nfkc + Unicode case folding.
+ **identity**: no normalization
%% Cell type:code id: tags:
``` python
import sentencepiece as spm
# NFKC normalization and lower casing.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_name=nfkc_cf')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('HELLO WORLD.')) # lower casing and normalization
```
%% Output
['▁', 'hello', '▁world', '.']
%% Cell type:markdown id: tags:
### 4. Randomizing training data
`SentencePieceTrainer.train()`
+ --input_sentence_size: training data에서 랜덤하게 **size**만큼 가져온다.
+ --shuffle_input_sentence: false이면 랜덤 셔플을 끄고 가장 첫 줄부터 **size**만큼 가져온다.
%% Cell type:markdown id: tags:
### 5. Training sentencepiece model from the word list with frequency
문서에 등장하는 단어 frequency를 기준으로 tsv파일을 생성한다.
![tsv example](ex.png)
%% Cell type:code id: tags:
``` python
freq={}
with open('botchan.txt', 'r') as f:
for line in f:
line = line.rstrip()
for piece in line.split():
freq.setdefault(piece, 0)
freq[piece] += 1
with open('word_freq_list.tsv', 'w') as f:
for k, v in freq.items():
f.write('%s\t%d\n' % (k, v))
spm.SentencePieceTrainer.train('--input=word_freq_list.tsv --input_format=tsv --model_prefix=m --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('this is a test.'))
```
%% Output
['▁this', '▁is', '▁a', '▁t', 'est', '.']
%% Cell type:markdown id: tags:
추가로 참고할만한 논문
[subword regularization](https://arxiv.org/pdf/1804.10959.pdf)
[BERT](https://arxiv.org/pdf/1810.04805.pdf)
%% Cell type:code id: tags:
``` python
```
File added
File added
File added
File added
xy.xlsx 0 → 100644
File added
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment