gpt2-fedibooks/config.py

"""
config file for gpt2-fedibooks
this file is a simple python file that will be imported by gpt2-fedibooks. it looks for the following keys:

datadir: PathLike  # the working directory of fedibooks. it will recursively create all directories needed, and exclusively create files there.
parsing_arbitrary_exclude_fn: callable  # this function takes a string (the post) and returns True if that post should be excluded
parsing_exclude_mentions: bool  # strip out any word starting with @
parsed_posts_file: str  # filename that the outbox parser will save into
tokenizer_output_prefix: str  # file prefix for the merges and vocab files generated by the tokenizer
model_size: enum[str]  # CURRENTLY NOT IMPLEMENTED! s/m/l/xl to pick the gpt2 model size (124M, 355M, 774M, and 1558M)
model_folder: str  # name of the folder (relative to datadir) for trained model storage
use_gpu: bool  # NOT YET IMPLEMENTED!
prompt_before_training: bool
training_block_size: int
training_num_workers: int  # seems to have absolutely no effect?
training_batch_size: int
training_num_steps: int
training_sample_frequency: int  # print out sample generations every n training steps
training_save_frequency: int  # save model snapshots every n steps
generation_zwsp_mentions: bool  # add a zero width space after every @ in generated texts
generation_prompt: str | None  # prompt for gpt2 generation
generation_include_prompt: bool  # whether to include the prompt in the output
generation_max_length: int
generation_temperature: fload  # 0.0 to 1.0, how "crazy" the output is, higher is more

configuration is done by defining python variables / functions, like so:
```py
model_size = 'l'
training_block_size = 64
generation_temperature = 0.8
prompt_before_training = False

def parsing_arbitrary_exclude_fn(post):
    return random.randint(0, 1)
```

any defined variables that aren't in the above list will be ignored. any scripts are possible.
"""