jeremyrmanning commited on
Commit
fdb936c
·
verified ·
1 Parent(s): 22ed3af

Upload baum stylometry model

Browse files
Files changed (6) hide show
  1. README.md +211 -3
  2. config.json +31 -0
  3. generation_config.json +6 -0
  4. loss_logs.csv +0 -0
  5. model.safetensors +3 -0
  6. training_state.pt +3 -0
README.md CHANGED
@@ -1,3 +1,211 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - text-generation
6
+ - gpt2
7
+ - stylometry
8
+ - baum
9
+ - authorship-attribution
10
+ - literary-analysis
11
+ - computational-linguistics
12
+ datasets:
13
+ - contextlab/baum-corpus
14
+ library_name: transformers
15
+ pipeline_tag: text-generation
16
+ ---
17
+
18
+ # ContextLab GPT-2 L. Frank Baum Stylometry Model
19
+
20
+ ## Overview
21
+
22
+ This model is a GPT-2 language model trained exclusively on **14 books by L. Frank Baum** (1856-1919). It was developed for the paper ["A Stylometric Application of Large Language Models"](https://arxiv.org/abs/2510.21958) (Stropkay et al., 2025).
23
+
24
+ The model captures L. Frank Baum's unique writing style through intensive training on their corpus. By learning the statistical patterns, vocabulary, syntax, and thematic elements characteristic of Baum's writing, this model enables:
25
+
26
+ - **Text generation** in the authentic style of L. Frank Baum
27
+ - **Authorship attribution** through cross-entropy loss comparison
28
+ - **Stylometric analysis** of literary works from late 19th to early 20th century America
29
+ - **Computational literary studies** exploring Baum's distinctive voice
30
+
31
+ This model is part of a suite of 8 author-specific models developed to demonstrate that language model perplexity can serve as a robust measure of stylistic similarity.
32
+
33
+ **⚠️ Important:** This model generates **lowercase text only**, as all training data was preprocessed to lowercase. Use lowercase prompts for best results.
34
+
35
+ ## Model Details
36
+
37
+ - **Model type:** GPT-2 (custom compact architecture)
38
+ - **Language:** English (lowercase)
39
+ - **License:** MIT
40
+ - **Author:** L. Frank Baum (1856-1919)
41
+ - **Notable works:** The Wonderful Wizard of Oz series
42
+ - **Training data:** [14 books by L. Frank Baum](https://huggingface.co/datasets/contextlab/baum-corpus)
43
+ - **Training tokens:** 838,612
44
+ - **Final training loss:** 1.2186
45
+ - **Epochs trained:** 50,000
46
+
47
+ ### Architecture
48
+
49
+ | Parameter | Value |
50
+ |-----------|-------|
51
+ | Layers | 8 |
52
+ | Embedding dimension | 128 |
53
+ | Attention heads | 8 |
54
+ | Context length | 1024 tokens |
55
+ | Vocabulary size | 50,257 (GPT-2 tokenizer) |
56
+ | Total parameters | ~8.1M |
57
+
58
+ ## Usage
59
+
60
+ ### Basic Text Generation
61
+
62
+ ```python
63
+ from transformers import GPT2LMHeadModel, GPT2Tokenizer
64
+ import torch
65
+
66
+ # Load model and tokenizer
67
+ model = GPT2LMHeadModel.from_pretrained("contextlab/gpt2-baum")
68
+ tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
69
+ tokenizer.pad_token = tokenizer.eos_token
70
+
71
+ # IMPORTANT: Use lowercase prompts (model trained on lowercase text)
72
+ prompt = "dorothy lived in the midst of"
73
+ inputs = tokenizer(prompt, return_tensors="pt")
74
+
75
+ # Generate text
76
+ with torch.no_grad():
77
+ outputs = model.generate(
78
+ **inputs,
79
+ max_length=200,
80
+ do_sample=True,
81
+ temperature=0.8,
82
+ top_p=0.9,
83
+ pad_token_id=tokenizer.eos_token_id
84
+ )
85
+
86
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
87
+ print(generated_text)
88
+ ```
89
+
90
+ **Output:** Generates text in L. Frank Baum's distinctive style (all lowercase).
91
+
92
+ ### Stylometric Analysis
93
+
94
+ Compare cross-entropy loss across multiple author models to determine authorship:
95
+
96
+ ```python
97
+ from transformers import GPT2LMHeadModel, GPT2Tokenizer
98
+ import torch
99
+
100
+ # Load models for different authors
101
+ authors = ['austen', 'dickens', 'twain'] # Example subset
102
+ models = {
103
+ author: GPT2LMHeadModel.from_pretrained(f"contextlab/gpt2-{author}")
104
+ for author in authors
105
+ }
106
+
107
+ tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
108
+
109
+ # Test passage (lowercase)
110
+ test_text = "your test passage here in lowercase"
111
+ inputs = tokenizer(test_text, return_tensors="pt")
112
+
113
+ # Compute loss for each model
114
+ for author, model in models.items():
115
+ model.eval()
116
+ with torch.no_grad():
117
+ outputs = model(**inputs, labels=inputs['input_ids'])
118
+ loss = outputs.loss.item()
119
+ print(f"{author}: {loss:.4f}")
120
+
121
+ # Lower loss indicates more similar style (likely author)
122
+ ```
123
+
124
+ ## Training Procedure
125
+
126
+ ### Dataset
127
+
128
+ The model was trained on the complete works of L. Frank Baum sourced from [Project Gutenberg](https://www.gutenberg.org/). The text was preprocessed to:
129
+ - Remove Project Gutenberg headers and footers
130
+ - Convert all text to lowercase
131
+ - Remove chapter headings and non-narrative text
132
+ - Preserve punctuation and structure
133
+
134
+ See the [Baum corpus dataset](https://huggingface.co/datasets/contextlab/baum-corpus) for details.
135
+
136
+ ### Hyperparameters
137
+
138
+ | Parameter | Value |
139
+ |-----------|-------|
140
+ | Context length | 1,024 tokens |
141
+ | Batch size | 16 |
142
+ | Learning rate | 5×10⁻⁵ |
143
+ | Optimizer | AdamW |
144
+ | Training tokens | 838,612 |
145
+ | Epochs | 50,000 |
146
+ | Final loss | 1.2186 |
147
+
148
+ ### Training Method
149
+
150
+ The model was initialized with a compact GPT-2 architecture (8 layers, 128-dimensional embeddings) and trained exclusively on L. Frank Baum's works until reaching a training loss of approximately 1.2186. This intensive training enables the model to capture fine-grained stylistic patterns characteristic of Baum's writing.
151
+
152
+ See the [GitHub repository](https://github.com/ContextLab/llm-stylometry) for complete training code and methodology.
153
+
154
+ ## Intended Use
155
+
156
+ ### Primary Uses
157
+ - **Research:** Stylometric analysis, authorship attribution studies
158
+ - **Education:** Demonstrations of computational stylometry
159
+ - **Creative:** Generate text in L. Frank Baum's style
160
+ - **Analysis:** Compare writing styles across historical periods
161
+
162
+ ### Out-of-Scope Uses
163
+ This model is not intended for:
164
+ - Factual information retrieval
165
+ - Modern language generation
166
+ - Tasks requiring uppercase text
167
+ - Commercial publication without attribution
168
+
169
+ ## Limitations
170
+
171
+ - **Lowercase only:** All generated text is lowercase (due to preprocessing)
172
+ - **Historical language:** Reflects late 19th to early 20th century America vocabulary and grammar
173
+ - **Training data bias:** Limited to L. Frank Baum's published works
174
+ - **Small model:** Compact architecture prioritizes training speed over generation quality
175
+ - **No factual grounding:** Generates stylistically similar text, not historically accurate content
176
+
177
+ ## Evaluation
178
+
179
+ This model achieved perfect accuracy (100%) in distinguishing L. Frank Baum's works from seven other classic authors in cross-entropy loss comparisons. See the paper for detailed evaluation results.
180
+
181
+ ## Citation
182
+
183
+ If you use this model in your research, please cite:
184
+
185
+ ```bibtex
186
+ @article{StroEtal25,
187
+ title={A Stylometric Application of Large Language Models},
188
+ author={Stropkay, Harrison F. and Chen, Jiayi and Jabelli, Mohammad J. L. and Rockmore, Daniel N. and Manning, Jeremy R.},
189
+ journal={arXiv preprint arXiv:2510.21958},
190
+ year={2025}
191
+ }
192
+ ```
193
+
194
+ ## Contact
195
+
196
+ - **Paper & Code:** https://github.com/ContextLab/llm-stylometry
197
+ - **Issues:** https://github.com/ContextLab/llm-stylometry/issues
198
+ - **Contact:** Jeremy R. Manning (jeremy.r.manning@dartmouth.edu)
199
+ - **Lab:** [Context Lab](https://www.context-lab.com/), Dartmouth College
200
+
201
+ ## Related Models
202
+
203
+ Explore models for all 8 authors in the study:
204
+ - [Jane Austen](https://huggingface.co/contextlab/gpt2-austen)
205
+ - [L. Frank Baum](https://huggingface.co/contextlab/gpt2-baum)
206
+ - [Charles Dickens](https://huggingface.co/contextlab/gpt2-dickens)
207
+ - [F. Scott Fitzgerald](https://huggingface.co/contextlab/gpt2-fitzgerald)
208
+ - [Herman Melville](https://huggingface.co/contextlab/gpt2-melville)
209
+ - [Ruth Plumly Thompson](https://huggingface.co/contextlab/gpt2-thompson)
210
+ - [Mark Twain](https://huggingface.co/contextlab/gpt2-twain)
211
+ - [H.G. Wells](https://huggingface.co/contextlab/gpt2-wells)
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_function": "gelu_new",
3
+ "architectures": [
4
+ "GPT2LMHeadModel"
5
+ ],
6
+ "attn_pdrop": 0.1,
7
+ "bos_token_id": 50256,
8
+ "dtype": "float32",
9
+ "embd_pdrop": 0.1,
10
+ "eos_token_id": 50256,
11
+ "initializer_range": 0.02,
12
+ "layer_norm_epsilon": 1e-05,
13
+ "model_type": "gpt2",
14
+ "n_embd": 128,
15
+ "n_head": 8,
16
+ "n_inner": null,
17
+ "n_layer": 8,
18
+ "n_positions": 1024,
19
+ "reorder_and_upcast_attn": false,
20
+ "resid_pdrop": 0.1,
21
+ "scale_attn_by_inverse_layer_idx": false,
22
+ "scale_attn_weights": true,
23
+ "summary_activation": null,
24
+ "summary_first_dropout": 0.1,
25
+ "summary_proj_to_labels": true,
26
+ "summary_type": "cls_index",
27
+ "summary_use_proj": true,
28
+ "transformers_version": "4.56.1",
29
+ "use_cache": true,
30
+ "vocab_size": 50257
31
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 50256,
4
+ "eos_token_id": 50256,
5
+ "transformers_version": "4.56.1"
6
+ }
loss_logs.csv ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6114d430831ed2a6f2a08386e8b88ccf45e675dd44aac3b386ef9fe36c8b2d40
3
+ size 32611312
training_state.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2368f7472a9a819f53871169b28b62c541372393e4d922eae4c0b5dcffe4487c
3
+ size 65304983