gpt2-fedibooks/config.py

"""
config file for gpt2-fedibooks
this file is a simple python file that will be imported by gpt2-fedibooks. it looks for the following keys:

datadir: PathLike  # the working directory of fedibooks. it will recursively create all directories needed, and exclusively create files there.
parsing_arbitrary_exclude_fn: callable  # this function takes a string (the post) and returns True if that post should be excluded
parsing_exclude_mentions: bool  # strip out any word starting with @
parsed_posts_file: str  # filename that the outbox parser will save into
tokenizer_output_prefix: str  # file prefix for the merges and vocab files generated by the tokenizer
model_size: enum[str]  # CURRENTLY NOT IMPLEMENTED! s/m/l/xl to pick the gpt2 model size (124M, 355M, 774M, and 1558M)
model_folder: str  # name of the folder (relative to datadir) for trained model storage
use_gpu: bool  # NOT YET IMPLEMENTED!
prompt_before_training: bool
training_block_size: int
training_num_workers: int  # seems to have absolutely no effect?
training_batch_size: int
training_num_steps: int
training_sample_frequency: int  # print out sample generations every n training steps
training_save_frequency: int  # save model snapshots every n steps
generation_zwsp_mentions: bool  # add a zero width space after every @ in generated texts
generation_prompt: str | None  # prompt for gpt2 generation
generation_include_prompt: bool  # whether to include the prompt in the output
generation_max_length: int
generation_temperature: fload  # 0.0 to 1.0, how "crazy" the output is, higher is more

configuration is done by defining python variables / functions, like so:
```py
model_size = 'l'
training_block_size = 64
generation_temperature = 0.8
prompt_before_training = False

def parsing_arbitrary_exclude_fn(post):
    return random.randint(0, 1)
```

any defined variables that aren't in the above list will be ignored. any scripts are possible.
"""
initial commit 2 years ago			`"""`
			`config file for gpt2-fedibooks`
			`this file is a simple python file that will be imported by gpt2-fedibooks. it looks for the following keys:`

			`datadir: PathLike # the working directory of fedibooks. it will recursively create all directories needed, and exclusively create files there.`
			`parsing_arbitrary_exclude_fn: callable # this function takes a string (the post) and returns True if that post should be excluded`
			`parsing_exclude_mentions: bool # strip out any word starting with @`
			`parsed_posts_file: str # filename that the outbox parser will save into`
			`tokenizer_output_prefix: str # file prefix for the merges and vocab files generated by the tokenizer`
			`model_size: enum[str] # CURRENTLY NOT IMPLEMENTED! s/m/l/xl to pick the gpt2 model size (124M, 355M, 774M, and 1558M)`
			`model_folder: str # name of the folder (relative to datadir) for trained model storage`
			`use_gpu: bool # NOT YET IMPLEMENTED!`
			`prompt_before_training: bool`
			`training_block_size: int`
			`training_num_workers: int # seems to have absolutely no effect?`
			`training_batch_size: int`
			`training_num_steps: int`
			`training_sample_frequency: int # print out sample generations every n training steps`
			`training_save_frequency: int # save model snapshots every n steps`
			`generation_zwsp_mentions: bool # add a zero width space after every @ in generated texts`
			`generation_prompt: str \| None # prompt for gpt2 generation`
			`generation_include_prompt: bool # whether to include the prompt in the output`
			`generation_max_length: int`
			`generation_temperature: fload # 0.0 to 1.0, how "crazy" the output is, higher is more`

			`configuration is done by defining python variables / functions, like so:`
			```py
			`model_size = 'l'`
			`training_block_size = 64`
			`generation_temperature = 0.8`
			`prompt_before_training = False`

			`def parsing_arbitrary_exclude_fn(post):`
			`return random.randint(0, 1)`
			```

			`any defined variables that aren't in the above list will be ignored. any scripts are possible.`
			`"""`