Pages

Sunday, 27 March 2022

Quackalike

A script that attempts to generate text that looks like the training material.

This simple script attempts to generate text that looks like the training material.

Requirements

Compiling KenLM

  • git clone https://github.com/kpu/kenlm.git
  • cd kenlm
  • See kenlm/BUILDING
  • sudo python3 setup.py install

Training

  • Tokenize your corpus (eg. saved as TOKENIZED_TEXT.txt). Sentences should be splitted in desired way.
  • Prepare the dictionary, eg. sed 's/ / /g;s/ /\n/g' TOKENIZED_TEXT.txt | awk '{seen[$0]++} END {for (i in seen) {if (seen[i] > 1) print i}}' > dict.txt (This filters out tokens that have appeared only once)
  • kenlm/bin/lmplz -o 6 --text TOKENIZED_TEXT.txt --arpa model.lm
  • kenlm/bin/build_binary trie -q 8 -b 8 model.lm model.binlm (See --help for explanation of the arguments)
  • Optional Learn the contextual vocabulary: python3 learnctx.py dict.txt < TOKENIZED_TEXT.txt

Usage

yes '' | python3 say.py model.binlm dict.txt


from https://github.com/gumblex/quackalike

No comments:

Post a Comment