week 1,2

9c932d41 · Yoonjin Im · 9c932d41 · 9c932d41 · 9c932d41 · 9c932d41
Commit 9c932d41 authored Apr 1, 2020 by Yoonjin Im
--- a/BPE_code.ipynb
+++ b/BPE_code.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# SentencePiece (20.3.25)\n",
+    "연구 내용: SentencePiece 코드를 정리합니다.  \n",
+    "[참고 문서](https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1. SentencPiece 설치   \n",
+    "`pip install sentencepiece`  \n",
+    "<br>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2. Basic example  \n",
+    "botchan.txt 소설 다운로드 "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "--2020-03-30 19:15:25--  https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt\n",
+      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.24.133\n",
+      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.24.133|:443... connected.\n",
+      "HTTP request sent, awaiting response... 200 OK\n",
+      "Length: 278779 (272K) [text/plain]\n",
+      "Saving to: ‘botchan.txt’\n",
+      "\n",
+      "botchan.txt         100%[===================>] 272.25K   460KB/s    in 0.6s    \n",
+      "\n",
+      "2020-03-30 19:15:27 (460 KB/s) - ‘botchan.txt’ saved [278779/278779]\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`SentencePieceTrainer.train()`  \n",
+    "+ --input:학습할 text 파일 위치  \n",
+    "+ --model_prefix: 모델이름  \n",
+    "+ --vocal_size: subword 개수\n",
+    "+ --model_type: unigram(default), bpe, char, word...\n",
+    "+ ...\n",
+    "  \n",
+    "`encode_as_pieces()` : 문장을 토큰화  \n",
+    "> This is a test  \n",
+    "> ['▁This', '▁is', '▁a', '▁t', 'est']  \n",
+    "\n",
+    "`encode_as_ids` : 문장을 암호화  \n",
+    "> This is a test  \n",
+    "> [209, 31, 9, 375, 586]  \n",
+    "\n",
+    "`decode_pieces` : 문장을 암호화  \n",
+    "> ['▁This', '▁is', '▁a', '▁t', 'est']  \n",
+    "> This is a test   \n",
+    "\n",
+    "`decode_ids` : 문장을 암호화  \n",
+    "> [209, 31, 9, 375, 586]  \n",
+    "> This is a test  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "This is a test\n",
+      "['▁This', '▁is', '▁a', '▁t', 'est']\n",
+      "[209, 31, 9, 375, 586]\n",
+      "This is a test\n",
+      "This is a test\n"
+     ]
+    }
+   ],
+   "source": [
+    "import sentencepiece as spm\n",
+    "\n",
+    "# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`\n",
+    "# `m.vocab` is just a reference. not used in the segmentation.\n",
+    "spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')\n",
+    "\n",
+    "# makes segmenter instance and loads the model file (m.model)\n",
+    "sp = spm.SentencePieceProcessor()\n",
+    "sp.load('m.model')\n",
+    "\n",
+    "# encode: text => id\n",
+    "print('This is a test')\n",
+    "print(sp.encode_as_pieces('This is a test'))\n",
+    "print(sp.encode_as_ids('This is a test'))\n",
+    "\n",
+    "# decode: id => text\n",
+    "print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))\n",
+    "print(sp.decode_ids([209, 31, 9, 375, 586]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2000\n",
+      "▁This\n",
+      "209\n",
+      "0\n",
+      "<unk> False\n",
+      "<s> True\n",
+      "</s> True\n"
+     ]
+    }
+   ],
+   "source": [
+    "# returns vocab size\n",
+    "print(sp.get_piece_size())\n",
+    "\n",
+    "# id <=> piece conversion\n",
+    "print(sp.id_to_piece(209))\n",
+    "print(sp.piece_to_id('▁This'))\n",
+    "\n",
+    "# returns 0 for unknown tokens (we can change the id for UNK)\n",
+    "print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))\n",
+    "\n",
+    "# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)\n",
+    "# <s> and </s> are defined as 'control' symbol.\n",
+    "for id in range(3):\n",
+    "  print(sp.id_to_piece(id), sp.is_control(id))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "+ **user defined symbols**: 특정 단어를 사용자가 지정하여 하나의 토큰으로 간주.   \n",
+    "> [BERT](https://arxiv.org/pdf/1810.04805.pdf)'s special symbols., e.g., [SEP] and [CLS]\n",
+    "+ **control symbol**: 특정 단어를 하나의 토큰으로 간주하지 못하고 분리시킴. id를 붙임.  \n",
+    "\n",
+    "`load()`에 들어가는 인자는 다음과 같다.  \n",
+    "+ m_user.model: 특정 단어를 하나의 토큰으로 지정할 수 있다. \n",
+    "+ m_ctrl.model: control symbol 규칙을 따른다.\n",
+    "+ m.model: control symbol 규칙을 따른다.  \n",
+    "+ m_bos_as_user.model: BOS, EOS, UNK, and PAD을 특정 단어로 지정.\n",
+    "+ ...\n",
+    "\n",
+    "> (BOS, EOS, UNK, PAD) = id(1, 2, 0, -1) id 변경 가능.\n",
+    "> + BOS`<s>`: Beginning of Sentence  \n",
+    "> + EOS`</s>`: End of Sentence  \n",
+    "> BOS and EOS are used in place of the \"previous word\" and \"next word\" features for words that do not have previous/next words.  \n",
+    "> + UNK: unknown  \n",
+    "> + PAD"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁world', '<cls>']\n",
+      "3\n",
+      "4\n",
+      "3= <sep>\n",
+      "4= <cls>\n"
+     ]
+    }
+   ],
+   "source": [
+    "## Example of user defined symbols\n",
+    "spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls> --vocab_size=2000')\n",
+    "\n",
+    "sp_user = spm.SentencePieceProcessor()\n",
+    "sp_user.load('m_user.model')\n",
+    "\n",
+    "# ids are reserved in both mode.\n",
+    "# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4\n",
+    "# user defined symbols allow these symbol to apper in the text.\n",
+    "print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))\n",
+    "print(sp_user.piece_to_id('<sep>'))  # 3\n",
+    "print(sp_user.piece_to_id('<cls>'))  # 4\n",
+    "print('3=', sp_user.decode_ids([3]))  # decoded to <sep>\n",
+    "print('4=', sp_user.decode_ids([4]))  # decoded to <cls>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['▁this', '▁is', '▁a', '▁t', 'est', '<', 'se', 'p', '>', '▁he', 'll', 'o', '▁world', '<', 'c', 'l', 's', '>']\n",
+      "3\n",
+      "4\n",
+      "3= \n",
+      "4= \n"
+     ]
+    }
+   ],
+   "source": [
+    "## Example of control symbols\n",
+    "spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=2000')\n",
+    "\n",
+    "sp_ctrl = spm.SentencePieceProcessor()\n",
+    "sp_ctrl.load('m_ctrl.model')\n",
+    "\n",
+    "# control symbols just reserve ids.\n",
+    "print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))\n",
+    "print(sp_ctrl.piece_to_id('<sep>'))  # 3\n",
+    "print(sp_ctrl.piece_to_id('<cls>'))  # 4\n",
+    "print('3=', sp_ctrl.decode_ids([3]))  # decoded to empty\n",
+    "print('4=', sp_ctrl.decode_ids([4]))  # decoded to empty"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['▁', '<', 's', '>', '▁he', 'll', 'o', '</', 's', '>']\n",
+      "['▁', '<s>', '▁he', 'll', 'o', '</s>']\n"
+     ]
+    }
+   ],
+   "source": [
+    "spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000')\n",
+    "\n",
+    "sp = spm.SentencePieceProcessor()\n",
+    "sp.load('m.model')\n",
+    "print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are segmented. (default behavior)\n",
+    "\n",
+    "sp = spm.SentencePieceProcessor()\n",
+    "sp.load('m_bos_as_user.model')\n",
+    "print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are handled as one token."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0\n",
+      "0\n",
+      "0\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Disable BOS/EOS\n",
+    "spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --bos_id=-1 --eos_id=-1')\n",
+    "sp = spm.SentencePieceProcessor()\n",
+    "sp.load('m.model')\n",
+    "\n",
+    "# <s>, </s> are UNK.\n",
+    "print(sp.unk_id())\n",
+    "print(sp.piece_to_id('<s>'))\n",
+    "print(sp.piece_to_id('</s>'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3. Text normalization  \n",
+    "\n",
+    "`SentencePieceTrainer.train()`  \n",
+    "+ --normalization_rule_name:normalization 종류  \n",
+    "    + **nmt_nfkc**: NFKC normalization with some additional normalization around spaces. (default)\n",
+    "    + **nfkc: original**: NFKC normalization.\n",
+    "    + **nmt_nfkc_cf**: nmt_nfkc + Unicode case folding (mostly lower casing)\n",
+    "    + **nfkc_cf**: nfkc + Unicode case folding.\n",
+    "    + **identity**: no normalization\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['▁', 'hello', '▁world', '.']\n"
+     ]
+    }
+   ],
+   "source": [
+    "import sentencepiece as spm\n",
+    "\n",
+    "# NFKC normalization and lower casing.\n",
+    "spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_name=nfkc_cf')\n",
+    "\n",
+    "sp = spm.SentencePieceProcessor()\n",
+    "sp.load('m.model')\n",
+    "print(sp.encode_as_pieces('ＨＥＬＬＯ ＷＯＲＬＤ.'))  # lower casing and normalization"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 4. Randomizing training data  \n",
+    "\n",
+    "`SentencePieceTrainer.train()`  \n",
+    "+ --input_sentence_size: training data에서 랜덤하게 **size**만큼 가져온다.  \n",
+    "+ --shuffle_input_sentence: false이면 랜덤 셔플을 끄고 가장 첫 줄부터 **size**만큼 가져온다. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5. Training sentencepiece model from the word list with frequency  \n",
+    "문서에 등장하는 단어 frequency를 기준으로 tsv파일을 생성한다. \n",
+    "![tsv example](ex.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['▁this', '▁is', '▁a', '▁t', 'est', '.']\n"
+     ]
+    }
+   ],
+   "source": [
+    "freq={}\n",
+    "with open('botchan.txt', 'r') as f:\n",
+    "    for line in f:\n",
+    "        line = line.rstrip()\n",
+    "        for piece in line.split():\n",
+    "          freq.setdefault(piece, 0)\n",
+    "          freq[piece] += 1\n",
+    "            \n",
+    "with open('word_freq_list.tsv', 'w') as f:\n",
+    "    for k, v in freq.items():\n",
+    "        f.write('%s\\t%d\\n' % (k, v))\n",
+    "  \n",
+    "spm.SentencePieceTrainer.train('--input=word_freq_list.tsv --input_format=tsv --model_prefix=m --vocab_size=2000')\n",
+    "sp = spm.SentencePieceProcessor()\n",
+    "sp.load('m.model')\n",
+    "\n",
+    "print(sp.encode_as_pieces('this is a test.'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "추가로 참고할만한 논문\n",
+    "[subword regularization](https://arxiv.org/pdf/1804.10959.pdf)\n",
+    "[BERT](https://arxiv.org/pdf/1810.04805.pdf)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+# SentencePiece (20.3.25)
+연구 내용: SentencePiece 코드를 정리합니다.
+[참고 문서](https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb)
+
+%% Cell type:markdown id: tags:
+
+### 1. SentencPiece 설치
+`pip install sentencepiece`
+<br>
+
+%% Cell type:markdown id: tags:
+
+### 2. Basic example
+botchan.txt 소설 다운로드
+
+%% Cell type:code id: tags:
+
+``` python
+!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
+```
+
+%% Output
+
+    --2020-03-30 19:15:25--  https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
+    Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.24.133
+    Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.24.133|:443... connected.
+    HTTP request sent, awaiting response... 200 OK
+    Length: 278779 (272K) [text/plain]
+    Saving to: ‘botchan.txt’
+    
+    botchan.txt         100%[===================>] 272.25K   460KB/s    in 0.6s
+    
+    2020-03-30 19:15:27 (460 KB/s) - ‘botchan.txt’ saved [278779/278779]
+    
+
+%% Cell type:markdown id: tags:
+
+`SentencePieceTrainer.train()`
+ --input:학습할 text 파일 위치
+ --model_prefix: 모델이름
+ --vocal_size: subword 개수
+ --model_type: unigram(default), bpe, char, word...
+ ...
+
+`encode_as_pieces()` : 문장을 토큰화
+> This is a test
+> ['▁This', '▁is', '▁a', '▁t', 'est']
+
+`encode_as_ids` : 문장을 암호화
+> This is a test
+> [209, 31, 9, 375, 586]
+
+`decode_pieces` : 문장을 암호화
+> ['▁This', '▁is', '▁a', '▁t', 'est']
+> This is a test
+
+`decode_ids` : 문장을 암호화
+> [209, 31, 9, 375, 586]
+> This is a test
+
+%% Cell type:code id: tags:
+
+``` python
+import sentencepiece as spm
+
+# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
+# `m.vocab` is just a reference. not used in the segmentation.
+spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')
+
+# makes segmenter instance and loads the model file (m.model)
+sp = spm.SentencePieceProcessor()
+sp.load('m.model')
+
+# encode: text => id
+print('This is a test')
+print(sp.encode_as_pieces('This is a test'))
+print(sp.encode_as_ids('This is a test'))
+
+# decode: id => text
+print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))
+print(sp.decode_ids([209, 31, 9, 375, 586]))
+```
+
+%% Output
+
+    This is a test
+    ['▁This', '▁is', '▁a', '▁t', 'est']
+    [209, 31, 9, 375, 586]
+    This is a test
+    This is a test
+
+%% Cell type:code id: tags:
+
+``` python
+# returns vocab size
+print(sp.get_piece_size())
+
+# id <=> piece conversion
+print(sp.id_to_piece(209))
+print(sp.piece_to_id('▁This'))
+
+# returns 0 for unknown tokens (we can change the id for UNK)
+print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))
+
+# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
+# <s> and </s> are defined as 'control' symbol.
+for id in range(3):
+  print(sp.id_to_piece(id), sp.is_control(id))
+```
+
+%% Output
+
+    2000
+    ▁This
+    209
+    0
+    <unk> False
+    <s> True
+    </s> True
+
+%% Cell type:markdown id: tags:
+
+ **user defined symbols**: 특정 단어를 사용자가 지정하여 하나의 토큰으로 간주.
+> [BERT](https://arxiv.org/pdf/1810.04805.pdf)'s special symbols., e.g., [SEP] and [CLS]
+ **control symbol**: 특정 단어를 하나의 토큰으로 간주하지 못하고 분리시킴. id를 붙임.
+
+`load()`에 들어가는 인자는 다음과 같다.
+ m_user.model: 특정 단어를 하나의 토큰으로 지정할 수 있다.
+ m_ctrl.model: control symbol 규칙을 따른다.
+ m.model: control symbol 규칙을 따른다.
+ m_bos_as_user.model: BOS, EOS, UNK, and PAD을 특정 단어로 지정.
+ ...
+
+> (BOS, EOS, UNK, PAD) = id(1, 2, 0, -1) id 변경 가능.
+> + BOS`<s>`: Beginning of Sentence
+> + EOS`</s>`: End of Sentence
+> BOS and EOS are used in place of the "previous word" and "next word" features for words that do not have previous/next words.
+> + UNK: unknown
+> + PAD
+
+%% Cell type:code id: tags:
+
+``` python
+## Example of user defined symbols
+spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls> --vocab_size=2000')
+
+sp_user = spm.SentencePieceProcessor()
+sp_user.load('m_user.model')
+
+# ids are reserved in both mode.
+# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
+# user defined symbols allow these symbol to apper in the text.
+print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
+print(sp_user.piece_to_id('<sep>'))  # 3
+print(sp_user.piece_to_id('<cls>'))  # 4
+print('3=', sp_user.decode_ids([3]))  # decoded to <sep>
+print('4=', sp_user.decode_ids([4]))  # decoded to <cls>
+```
+
+%% Output
+
+    ['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁world', '<cls>']
+    3
+    4
+    3= <sep>
+    4= <cls>
+
+%% Cell type:code id: tags:
+
+``` python
+## Example of control symbols
+spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=2000')
+
+sp_ctrl = spm.SentencePieceProcessor()
+sp_ctrl.load('m_ctrl.model')
+
+# control symbols just reserve ids.
+print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))
+print(sp_ctrl.piece_to_id('<sep>'))  # 3
+print(sp_ctrl.piece_to_id('<cls>'))  # 4
+print('3=', sp_ctrl.decode_ids([3]))  # decoded to empty
+print('4=', sp_ctrl.decode_ids([4]))  # decoded to empty
+```
+
+%% Output
+
+    ['▁this', '▁is', '▁a', '▁t', 'est', '<', 'se', 'p', '>', '▁he', 'll', 'o', '▁world', '<', 'c', 'l', 's', '>']
+    3
+    4
+    3=
+    4=
+
+%% Cell type:code id: tags:
+
+``` python
+spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000')
+
+sp = spm.SentencePieceProcessor()
+sp.load('m.model')
+print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are segmented. (default behavior)
+
+sp = spm.SentencePieceProcessor()
+sp.load('m_bos_as_user.model')
+print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are handled as one token.
+```
+
+%% Output
+
+    ['▁', '<', 's', '>', '▁he', 'll', 'o', '</', 's', '>']
+    ['▁', '<s>', '▁he', 'll', 'o', '</s>']
+
+%% Cell type:code id: tags:
+
+``` python
+# Disable BOS/EOS
+spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --bos_id=-1 --eos_id=-1')
+sp = spm.SentencePieceProcessor()
+sp.load('m.model')
+
+# <s>, </s> are UNK.
+print(sp.unk_id())
+print(sp.piece_to_id('<s>'))
+print(sp.piece_to_id('</s>'))
+```
+
+%% Output
+
+    0
+    0
+    0
+
+%% Cell type:markdown id: tags:
+
+### 3. Text normalization
+
+`SentencePieceTrainer.train()`
+ --normalization_rule_name:normalization 종류
+    + **nmt_nfkc**: NFKC normalization with some additional normalization around spaces. (default)
+    + **nfkc: original**: NFKC normalization.
+    + **nmt_nfkc_cf**: nmt_nfkc + Unicode case folding (mostly lower casing)
+    + **nfkc_cf**: nfkc + Unicode case folding.
+    + **identity**: no normalization
+
+%% Cell type:code id: tags:
+
+``` python
+import sentencepiece as spm
+
+# NFKC normalization and lower casing.
+spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_name=nfkc_cf')
+
+sp = spm.SentencePieceProcessor()
+sp.load('m.model')
+print(sp.encode_as_pieces('ＨＥＬＬＯ ＷＯＲＬＤ.'))  # lower casing and normalization
+```
+
+%% Output
+
+    ['▁', 'hello', '▁world', '.']
+
+%% Cell type:markdown id: tags:
+
+### 4. Randomizing training data
+
+`SentencePieceTrainer.train()`
+ --input_sentence_size: training data에서 랜덤하게 **size**만큼 가져온다.
+ --shuffle_input_sentence: false이면 랜덤 셔플을 끄고 가장 첫 줄부터 **size**만큼 가져온다.
+
+%% Cell type:markdown id: tags:
+
+### 5. Training sentencepiece model from the word list with frequency
+문서에 등장하는 단어 frequency를 기준으로 tsv파일을 생성한다.
+![tsv example](ex.png)
+
+%% Cell type:code id: tags:
+
+``` python
+freq={}
+with open('botchan.txt', 'r') as f:
+    for line in f:
+        line = line.rstrip()
+        for piece in line.split():
+          freq.setdefault(piece, 0)
+          freq[piece] += 1
+
+with open('word_freq_list.tsv', 'w') as f:
+    for k, v in freq.items():
+        f.write('%s\t%d\n' % (k, v))
+
+spm.SentencePieceTrainer.train('--input=word_freq_list.tsv --input_format=tsv --model_prefix=m --vocab_size=2000')
+sp = spm.SentencePieceProcessor()
+sp.load('m.model')
+
+print(sp.encode_as_pieces('this is a test.'))
+```
+
+%% Output
+
+    ['▁this', '▁is', '▁a', '▁t', 'est', '.']
+
+%% Cell type:markdown id: tags:
+
+추가로 참고할만한 논문
+[subword regularization](https://arxiv.org/pdf/1804.10959.pdf)
+[BERT](https://arxiv.org/pdf/1810.04805.pdf)
+
+%% Cell type:code id: tags:
+
+``` python
+```
--- a/note1,2.pdf
+++ b/note1,2.pdf
--- a/note1,2_edited.pdf
+++ b/note1,2_edited.pdf
--- a/proposal.pdf
+++ b/proposal.pdf
--- a/proposal_edited.pdf
+++ b/proposal_edited.pdf
--- a/xy.xlsx
+++ b/xy.xlsx