看得透又看得远者prevail. ppt.cc/flUmLx ppt.cc/fqtgqx ppt.cc/fZsXUx ppt.cc/fhWnZx ppt.cc/fnrkVx ppt.cc/f2CBVx: Quackalike

Sunday, 27 March 2022

A script that attempts to generate text that looks like the training material.

This simple script attempts to generate text that looks like the training material.

Requirements

KenLM
Language tokenizers you like. (eg. nltk, tokenizer.perl in Moses, jieba for Chinese)

Tokenize your corpus (eg. saved as TOKENIZED_TEXT.txt). Sentences should be splitted in desired way.
Prepare the dictionary, eg. sed 's/ / /g;s/ /\n/g' TOKENIZED_TEXT.txt | awk '{seen[$0]++} END {for (i in seen) {if (seen[i] > 1) print i}}' > dict.txt (This filters out tokens that have appeared only once)
kenlm/bin/lmplz -o 6 --text TOKENIZED_TEXT.txt --arpa model.lm
kenlm/bin/build_binary trie -q 8 -b 8 model.lm model.binlm (See --help for explanation of the arguments)
Optional Learn the contextual vocabulary: python3 learnctx.py dict.txt < TOKENIZED_TEXT.txt

yes '' | python3 say.py model.binlm dict.txt

from https://github.com/gumblex/quackalike