alexdev404 commited on
Commit
3b10d9f
·
verified ·
1 Parent(s): bca8c75

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +1718 -57
app.py CHANGED
@@ -1,70 +1,1731 @@
1
  import gradio as gr
2
- from huggingface_hub import InferenceClient
 
 
 
 
 
 
 
3
 
 
 
 
 
 
4
 
5
- def respond(
6
- message,
7
- history: list[dict[str, str]],
8
- system_message,
9
- max_tokens,
10
- temperature,
11
- top_p,
12
- hf_token: gr.OAuthToken,
13
- ):
14
- """
15
- For more information on `huggingface_hub` Inference API support, please check the docs: https://huggingface.co/docs/huggingface_hub/v0.22.2/en/guides/inference
16
- """
17
- client = InferenceClient(token=hf_token.token, model="openai/gpt-oss-20b")
18
 
19
- messages = [{"role": "system", "content": system_message}]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- messages.extend(history)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
- messages.append({"role": "user", "content": message})
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
- response = ""
 
26
 
27
- for message in client.chat_completion(
28
- messages,
29
- max_tokens=max_tokens,
30
- stream=True,
31
- temperature=temperature,
32
- top_p=top_p,
33
- ):
34
- choices = message.choices
35
- token = ""
36
- if len(choices) and choices[0].delta.content:
37
- token = choices[0].delta.content
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
- response += token
40
- yield response
41
-
42
-
43
- """
44
- For information on how to customize the ChatInterface, peruse the gradio docs: https://www.gradio.app/docs/chatinterface
45
- """
46
- chatbot = gr.ChatInterface(
47
- respond,
48
- type="messages",
49
- additional_inputs=[
50
- gr.Textbox(value="You are a friendly Chatbot.", label="System message"),
51
- gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max new tokens"),
52
- gr.Slider(minimum=0.1, maximum=4.0, value=0.7, step=0.1, label="Temperature"),
53
- gr.Slider(
54
- minimum=0.1,
55
- maximum=1.0,
56
- value=0.95,
57
- step=0.05,
58
- label="Top-p (nucleus sampling)",
59
- ),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  ],
 
61
  )
62
 
63
- with gr.Blocks() as demo:
64
- with gr.Sidebar():
65
- gr.LoginButton()
66
- chatbot.render()
67
-
68
-
69
  if __name__ == "__main__":
70
- demo.launch()
 
 
 
1
  import gradio as gr
2
+ import torch
3
+ from transformers import T5ForConditionalGeneration, T5Tokenizer, AutoTokenizer, AutoModelForSeq2SeqLM
4
+ from bs4 import BeautifulSoup, NavigableString, Tag
5
+ import re
6
+ import time
7
+ import random
8
+ import nltk
9
+ from nltk.tokenize import sent_tokenize
10
 
11
+ # Download required NLTK data
12
+ try:
13
+ nltk.download('punkt', quiet=True)
14
+ except:
15
+ pass
16
 
17
+ # Try to import spaCy but make it optional
18
+ try:
19
+ import spacy
20
+ SPACY_AVAILABLE = True
21
+ except:
22
+ print("spaCy not available, using NLTK for sentence processing")
23
+ SPACY_AVAILABLE = False
 
 
 
 
 
 
24
 
25
+ class HumanLikeVariations:
26
+ """Add human-like variations and intentional imperfections"""
27
+
28
+ def __init__(self):
29
+ # Common human writing patterns - EXPANDED for Originality AI
30
+ self.casual_transitions = [
31
+ "So, ", "Well, ", "Now, ", "Actually, ", "Basically, ",
32
+ "You know, ", "I mean, ", "Thing is, ", "Honestly, ",
33
+ "Look, ", "Listen, ", "See, ", "Okay, ", "Right, ",
34
+ "Anyway, ", "Besides, ", "Plus, ", "Also, ", "Oh, ",
35
+ "Hey, ", "Alright, ", "Sure, ", "Fine, ", "Obviously, ",
36
+ "Clearly, ", "Seriously, ", "Literally, ", "Frankly, ",
37
+ "To be honest, ", "Truth is, ", "In fact, ", "Believe it or not, ",
38
+ "Here's the thing, ", "Let me tell you, ", "Get this, ",
39
+ "Funny thing is, ", "Interestingly, ", "Surprisingly, ",
40
+ "Let's be real here, ", "Can we talk about ", "Quick question: ",
41
+ "Real talk: ", "Hot take: ", "Unpopular opinion: ", "Fun fact: ",
42
+ "Pro tip: ", "Side note: ", "Random thought: ", "Food for thought: ",
43
+ "Just saying, ", "Not gonna lie, ", "For what it's worth, ",
44
+ "If you ask me, ", "Between you and me, ", "Here's my take: ",
45
+ "Let's face it, ", "No kidding, ", "Seriously though, ",
46
+ "But wait, ", "Hold on, ", "Check this out: ", "Guess what? "
47
+ ]
48
+
49
+ self.filler_phrases = [
50
+ "kind of", "sort of", "pretty much", "basically", "actually",
51
+ "really", "just", "quite", "rather", "fairly", "totally",
52
+ "definitely", "probably", "maybe", "perhaps", "somehow",
53
+ "somewhat", "literally", "seriously", "honestly", "frankly",
54
+ "simply", "merely", "purely", "truly", "genuinely",
55
+ "absolutely", "completely", "entirely", "utterly", "practically",
56
+ "virtually", "essentially", "fundamentally", "generally", "typically",
57
+ "usually", "normally", "often", "sometimes", "occasionally",
58
+ "apparently", "evidently", "obviously", "clearly", "seemingly",
59
+ "arguably", "potentially", "possibly", "likely", "unlikely",
60
+ "more or less", "give or take", "so to speak", "if you will",
61
+ "per se", "as such", "in a way", "to some extent", "to a degree",
62
+ "I kid you not", "no joke", "for real", "not gonna lie",
63
+ "I'm telling you", "trust me", "believe me", "I swear",
64
+ "hands down", "without a doubt", "100%", "straight up",
65
+ "I think", "I feel like", "I guess", "I suppose", "seems like",
66
+ "appears to be", "might be", "could be", "tends to", "tends to be",
67
+ "in my experience", "from what I've seen", "as far as I know",
68
+ "to the best of my knowledge", "if I'm not mistaken", "correct me if I'm wrong",
69
+ "you know what", "here's the deal", "bottom line", "at any rate",
70
+ "all in all", "when you think about it", "come to think of it",
71
+ "now that I think about it", "if we're being honest", "to be fair"
72
+ ]
73
+
74
+ self.human_connectors = [
75
+ ", which means", ", so", ", because", ", since", ", although",
76
+ ". That's why", ". This means", ". So basically,", ". The thing is,",
77
+ ", and honestly", ", but here's the thing", ", though", ", however",
78
+ ". Plus,", ". Also,", ". Besides,", ". Moreover,", ". Furthermore,",
79
+ ", which is why", ", and that's because", ", given that", ", considering",
80
+ ". In other words,", ". Put simply,", ". To clarify,", ". That said,",
81
+ ", you see", ", you know", ", right?", ", okay?", ", yeah?",
82
+ ". Here's why:", ". Let me explain:", ". Think about it:",
83
+ ", if you ask me", ", in my opinion", ", from my perspective",
84
+ ". On the flip side,", ". On the other hand,", ". Conversely,",
85
+ ", not to mention", ", let alone", ", much less", ", aside from",
86
+ ". What's more,", ". Even better,", ". Even worse,", ". The catch is,",
87
+ ", believe it or not", ", surprisingly enough", ", interestingly enough",
88
+ ". Long story short,", ". Bottom line is,", ". Point being,",
89
+ ", as you might expect", ", as it turns out", ", as luck would have it",
90
+ ". And get this:", ". But wait, there's more:", ". Here's the kicker:",
91
+ ", and here's why", ", and here's the thing", ", but here's what happened",
92
+ ". Spoiler alert:", ". Plot twist:", ". Reality check:",
93
+ ", at the end of the day", ", when all is said and done", ", all things considered",
94
+ ". Make no mistake,", ". Don't get me wrong,", ". Let's not forget,",
95
+ ", between you and me", ", off the record", ", just between us",
96
+ ". And honestly?", ". But seriously,", ". And you know what?",
97
+ ", which brings me to", ". This reminds me of", ", speaking of which",
98
+ ". Funny enough,", ". Weird thing is,", ". Strange but true:",
99
+ ", and I mean", ". I'm not kidding when I say", ", and trust me on this"
100
+ ]
101
+
102
+ # NEW: Common human typos and variations
103
+ self.common_typos = {
104
+ "the": ["teh", "th", "hte"],
105
+ "and": ["adn", "nad", "an"],
106
+ "that": ["taht", "htat", "tha"],
107
+ "with": ["wiht", "wtih", "iwth"],
108
+ "have": ["ahve", "hvae", "hav"],
109
+ "from": ["form", "fro", "frmo"],
110
+ "they": ["tehy", "thye", "htey"],
111
+ "which": ["whihc", "wich", "whcih"],
112
+ "their": ["thier", "theri", "tehir"],
113
+ "would": ["woudl", "wuold", "woul"],
114
+ "there": ["tehre", "theer", "ther"],
115
+ "could": ["coudl", "cuold", "coud"],
116
+ "people": ["poeple", "peopel", "pepole"],
117
+ "through": ["thorugh", "throught", "trhough"],
118
+ "because": ["becuase", "becasue", "beacuse"],
119
+ "before": ["beofre", "befroe", "befor"],
120
+ "different": ["differnt", "differnet", "diferent"],
121
+ "between": ["bewteen", "betwen", "betewen"],
122
+ "important": ["improtant", "importnat", "importan"],
123
+ "information": ["infromation", "informaiton", "informaton"]
124
+ }
125
+
126
+ # NEW: Human-like sentence starters for variety
127
+ self.varied_starters = [
128
+ "When it comes to", "As for", "Regarding", "In terms of",
129
+ "With respect to", "Concerning", "Speaking of", "About",
130
+ "If we look at", "Looking at", "Considering", "Given",
131
+ "Taking into account", "Bear in mind that", "Keep in mind",
132
+ "It's worth noting that", "It should be noted that",
133
+ "One thing to consider is", "An important point is",
134
+ "What's interesting is", "What stands out is",
135
+ "The key here is", "The main thing is", "The point is",
136
+ "Here's what matters:", "Here's the deal:", "Here's something:",
137
+ "Let's not forget", "We should remember", "Don't forget",
138
+ "Think about it this way:", "Look at it like this:",
139
+ "Consider this:", "Picture this:", "Imagine this:",
140
+ "You might wonder", "You might ask", "You may think",
141
+ "Some people say", "Many believe", "It's often said",
142
+ "Research shows", "Studies indicate", "Evidence suggests",
143
+ "Experience tells us", "History shows", "Time has shown"
144
+ ]
145
+
146
+ def add_human_touch(self, text):
147
+ """Add subtle human-like imperfections - NATURAL PATTERNS ONLY"""
148
+ sentences = text.split('. ')
149
+ modified_sentences = []
150
+
151
+ # Track what we've used to avoid patterns
152
+ used_transitions = []
153
+
154
+ for i, sent in enumerate(sentences):
155
+ if not sent.strip():
156
+ continue
157
+
158
+ # Always use contractions where natural
159
+ sent = self.apply_contractions(sent)
160
+
161
+ # Add VERY occasional natural errors (5% chance)
162
+ if random.random() < 0.05 and len(sent.split()) > 15:
163
+ error_types = [
164
+ # Missing comma in compound sentence
165
+ lambda s: s.replace(", and", " and", 1) if ", and" in s else s,
166
+ # Wrong homophone
167
+ lambda s: s.replace("their", "there", 1) if "their" in s and random.random() < 0.3 else s,
168
+ # Missing apostrophe
169
+ lambda s: s.replace("it's", "its", 1) if "it's" in s and random.random() < 0.3 else s,
170
+ ]
171
+ error_func = random.choice(error_types)
172
+ sent = error_func(sent)
173
+
174
+ modified_sentences.append(sent)
175
+
176
+ return '. '.join(modified_sentences)
177
+
178
+ def apply_contractions(self, text):
179
+ """Apply common contractions - EXPANDED"""
180
+ contractions = {
181
+ "it is": "it's", "that is": "that's", "there is": "there's",
182
+ "he is": "he's", "she is": "she's", "what is": "what's",
183
+ "where is": "where's", "who is": "who's", "how is": "how's",
184
+ "cannot": "can't", "will not": "won't", "do not": "don't",
185
+ "does not": "doesn't", "did not": "didn't", "could not": "couldn't",
186
+ "should not": "shouldn't", "would not": "wouldn't", "is not": "isn't",
187
+ "are not": "aren't", "was not": "wasn't", "were not": "weren't",
188
+ "have not": "haven't", "has not": "hasn't", "had not": "hadn't",
189
+ "I am": "I'm", "you are": "you're", "we are": "we're",
190
+ "they are": "they're", "I have": "I've", "you have": "you've",
191
+ "we have": "we've", "they have": "they've", "I will": "I'll",
192
+ "you will": "you'll", "he will": "he'll", "she will": "she'll",
193
+ "we will": "we'll", "they will": "they'll", "I would": "I'd",
194
+ "you would": "you'd", "he would": "he'd", "she would": "she'd",
195
+ "we would": "we'd", "they would": "they'd", "could have": "could've",
196
+ "should have": "should've", "would have": "would've", "might have": "might've",
197
+ "must have": "must've", "there has": "there's", "here is": "here's",
198
+ "let us": "let's", "that will": "that'll", "who will": "who'll"
199
+ }
200
+
201
+ for full, contr in contractions.items():
202
+ if random.random() < 0.8: # 80% chance to apply each contraction
203
+ text = re.sub(r'\b' + full + r'\b', contr, text, flags=re.IGNORECASE)
204
+
205
+ return text
206
+
207
+ def add_minor_errors(self, text):
208
+ """Add very minor, human-like errors - MORE REALISTIC BUT CONTROLLED"""
209
+ # Occasionally miss Oxford comma (15% chance)
210
+ if random.random() < 0.15:
211
+ # Only in lists, not random commas
212
+ text = re.sub(r'(\w+), (\w+), and (\w+)', r'\1, \2 and \3', text)
213
+
214
+ # Sometimes use 'which' instead of 'that' (8% chance)
215
+ if random.random() < 0.08:
216
+ # Only for non-restrictive clauses
217
+ matches = re.finditer(r'\b(\w+) that (\w+)', text)
218
+ for match in list(matches)[:1]: # Only first occurrence
219
+ if match.group(1).lower() not in ['believe', 'think', 'know', 'say']:
220
+ text = text.replace(match.group(0), f"{match.group(1)} which {match.group(2)}", 1)
221
+
222
+ # NEW: Add very occasional typos (2% chance per sentence) - REDUCED AND CONTROLLED
223
+ sentences = text.split('. ')
224
+ for i, sent in enumerate(sentences):
225
+ if random.random() < 0.02 and len(sent.split()) > 15: # Only in longer sentences
226
+ words = sent.split()
227
+ # Pick a random word to potentially typo
228
+ word_idx = random.randint(len(words)//2, len(words)-2) # Avoid start/end
229
+ word = words[word_idx].lower()
230
+
231
+ # Only typo common words where typo won't break meaning
232
+ safe_typos = {
233
+ 'the': 'teh',
234
+ 'and': 'adn',
235
+ 'that': 'taht',
236
+ 'with': 'wtih',
237
+ 'from': 'form',
238
+ 'because': 'becuase'
239
+ }
240
+
241
+ if word in safe_typos and random.random() < 0.5:
242
+ typo = safe_typos[word]
243
+ # Preserve original capitalization
244
+ if words[word_idx][0].isupper():
245
+ typo = typo[0].upper() + typo[1:]
246
+ words[word_idx] = typo
247
+ sentences[i] = ' '.join(words)
248
+
249
+ text = '. '.join(sentences)
250
+
251
+ # Skip double words - too distracting
252
+
253
+ # Mix up common homophones occasionally (2% chance) - ONLY SAFE ONES
254
+ if random.random() < 0.02:
255
+ safe_homophones = [
256
+ ('its', "it's"), # Very common mistake
257
+ ('your', "you're"), # Another common one
258
+ ]
259
+ for pair in safe_homophones:
260
+ # Check context to avoid breaking meaning
261
+ if f" {pair[0]} " in text and random.random() < 0.3:
262
+ # Find one instance and check it's safe to replace
263
+ pattern = rf'\b{pair[0]}\s+(\w+ing|\w+ed)\b' # its + verb = likely should be it's
264
+ if re.search(pattern, text):
265
+ text = re.sub(pattern, f"{pair[1]} \\1", text, count=1)
266
+ break
267
+
268
+ return text
269
+
270
+ def add_natural_human_patterns(self, text):
271
+ """Add natural human writing patterns that Originality AI associates with human text"""
272
+ sentences = self.split_into_sentences_advanced(text)
273
+ result_sentences = []
274
+
275
+ for i, sentence in enumerate(sentences):
276
+ if not sentence.strip():
277
+ continue
278
+
279
+ # Natural contractions throughout
280
+ sentence = self.apply_contractions(sentence)
281
+
282
+ # Add natural speech patterns (15% chance)
283
+ if random.random() < 0.15 and len(sentence.split()) > 10:
284
+ # Natural interruptions that humans actually use
285
+ if random.random() < 0.5:
286
+ # Add "you know" or "I mean" naturally
287
+ words = sentence.split()
288
+ if len(words) > 6:
289
+ pos = random.randint(3, len(words)-3)
290
+ if random.random() < 0.5:
291
+ words.insert(pos, "you know,")
292
+ else:
293
+ words.insert(pos, "I mean,")
294
+ sentence = ' '.join(words)
295
+ else:
296
+ # Start with natural opener
297
+ openers = ["Look,", "See,", "Thing is,", "Honestly,", "Actually,"]
298
+ sentence = random.choice(openers) + " " + sentence[0].lower() + sentence[1:]
299
+
300
+ # Add subtle errors that humans make (10% chance - reduced)
301
+ if random.random() < 0.10:
302
+ words = sentence.split()
303
+ if len(words) > 5:
304
+ # Common comma omissions
305
+ if ", and" in sentence and random.random() < 0.3:
306
+ sentence = sentence.replace(", and", " and", 1)
307
+ # Double words occasionally
308
+ elif random.random() < 0.2:
309
+ idx = random.randint(1, len(words)-2)
310
+ if words[idx].lower() in ['the', 'a', 'to', 'in', 'on', 'at']:
311
+ words.insert(idx+1, words[idx])
312
+ sentence = ' '.join(words)
313
+
314
+ # Natural sentence combinations (20% chance)
315
+ if i < len(sentences) - 1 and random.random() < 0.2:
316
+ next_sent = sentences[i+1].strip()
317
+ if next_sent and len(sentence.split()) + len(next_sent.split()) < 25:
318
+ # Natural connectors based on content
319
+ if any(w in next_sent.lower() for w in ['but', 'however', 'although']):
320
+ sentence = sentence.rstrip('.') + ", but " + next_sent[0].lower() + next_sent[1:]
321
+ sentences[i+1] = "" # Mark as processed
322
+ elif any(w in next_sent.lower() for w in ['also', 'too', 'as well']):
323
+ sentence = sentence.rstrip('.') + " and " + next_sent[0].lower() + next_sent[1:]
324
+ sentences[i+1] = "" # Mark as processed
325
+
326
+ result_sentences.append(sentence)
327
+
328
+ return ' '.join([s for s in result_sentences if s])
329
+
330
+ def vary_sentence_start(self, sentence):
331
+ """Vary sentence beginning to avoid repetitive patterns"""
332
+ if not sentence:
333
+ return sentence
334
+
335
+ words = sentence.split()
336
+ if len(words) < 5:
337
+ return sentence
338
+
339
+ # Different ways to start sentences naturally
340
+ variations = [
341
+ lambda s: "When " + s[0].lower() + s[1:] + ", it makes sense.",
342
+ lambda s: "If you think about it, " + s[0].lower() + s[1:],
343
+ lambda s: s + " This is important.",
344
+ lambda s: "The thing about " + words[0].lower() + " " + ' '.join(words[1:]) + " is clear.",
345
+ lambda s: "What's interesting is " + s[0].lower() + s[1:],
346
+ lambda s: s, # Keep original sometimes
347
+ ]
348
+
349
+ # Pick a random variation
350
+ variation = random.choice(variations)
351
+ try:
352
+ return variation(sentence)
353
+ except:
354
+ return sentence
355
+
356
+ def split_into_sentences_advanced(self, text):
357
+ """Advanced sentence splitting using spaCy or NLTK"""
358
+ if SPACY_AVAILABLE:
359
+ try:
360
+ nlp = spacy.load("en_core_web_sm")
361
+ doc = nlp(text)
362
+ sentences = [sent.text.strip() for sent in doc.sents]
363
+ except:
364
+ sentences = sent_tokenize(text)
365
+ else:
366
+ # Fallback to NLTK
367
+ try:
368
+ sentences = sent_tokenize(text)
369
+ except:
370
+ # Final fallback to regex
371
+ sentences = re.split(r'(?<=[.!?])\s+', text)
372
+
373
+ # Clean up sentences
374
+ return [s for s in sentences if s and len(s.strip()) > 0]
375
 
376
+ class SelectiveGrammarFixer:
377
+ """Minimal grammar fixes to maintain human-like quality while fixing critical errors"""
378
+
379
+ def __init__(self):
380
+ self.nlp = None
381
+ self.human_variations = HumanLikeVariations()
382
+
383
+ def fix_incomplete_sentences_only(self, text):
384
+ """Fix only incomplete sentences without over-correcting"""
385
+ if not text:
386
+ return text
387
+
388
+ sentences = text.split('. ')
389
+ fixed_sentences = []
390
+
391
+ for i, sent in enumerate(sentences):
392
+ sent = sent.strip()
393
+ if not sent:
394
+ continue
395
+
396
+ # Only fix if sentence is incomplete
397
+ if sent and sent[-1] not in '.!?':
398
+ # Check if it's the last sentence
399
+ if i == len(sentences) - 1:
400
+ # Add period if it's clearly a statement
401
+ if not sent.endswith(':') and not sent.endswith(','):
402
+ sent += '.'
403
+ else:
404
+ # Middle sentences should have periods
405
+ sent += '.'
406
+
407
+ # Ensure first letter capitalization ONLY after sentence endings
408
+ if i > 0 and sent and sent[0].islower():
409
+ # Check if previous sentence ended with punctuation
410
+ if fixed_sentences and fixed_sentences[-1].rstrip().endswith(('.', '!', '?')):
411
+ sent = sent[0].upper() + sent[1:]
412
+ elif i == 0 and sent and sent[0].islower():
413
+ # First sentence should be capitalized
414
+ sent = sent[0].upper() + sent[1:]
415
+
416
+ fixed_sentences.append(sent)
417
+
418
+ result = ' '.join(fixed_sentences)
419
+
420
+ # Add natural human variations (but we need to reference the main class method)
421
+ # This will be called from the smart_fix method instead
422
+
423
+ return result
424
+
425
+ def fix_basic_punctuation_errors(self, text):
426
+ """Fix only the most egregious punctuation errors"""
427
+ if not text:
428
+ return text
429
+
430
+ # Fix double spaces (human-like error)
431
+ text = re.sub(r'\s{2,}', ' ', text)
432
+
433
+ # Fix space before punctuation (common error)
434
+ text = re.sub(r'\s+([.,!?;:])', r'\1', text)
435
+
436
+ # Fix missing space after punctuation (human-like)
437
+ text = re.sub(r'([.,!?])([A-Z])', r'\1 \2', text)
438
+
439
+ # Fix accidental double punctuation
440
+ text = re.sub(r'([.!?])\1+', r'\1', text)
441
+
442
+ # Fix "i" capitalization (common human error to fix)
443
+ text = re.sub(r'\bi\b', 'I', text)
444
+
445
+ return text
446
+
447
+ def preserve_natural_variations(self, text):
448
+ """Keep some natural human-like variations"""
449
+ # Don't fix everything - leave some variety
450
+ # Only fix if really broken
451
+ if text.count('.') == 0 and len(text.split()) > 20:
452
+ # Long text with no periods - needs fixing
453
+ words = text.split()
454
+ # Add periods every 15-25 words naturally (more variation)
455
+ new_text = []
456
+ for i, word in enumerate(words):
457
+ new_text.append(word)
458
+ if i > 0 and i % random.randint(12, 25) == 0:
459
+ if word[-1] not in '.!?,;:':
460
+ new_text[-1] = word + '.'
461
+ # Capitalize next word if it's not an acronym
462
+ if i + 1 < len(words) and words[i + 1][0].islower():
463
+ # Check if it's not likely an acronym
464
+ if not words[i + 1].isupper():
465
+ words[i + 1] = words[i + 1][0].upper() + words[i + 1][1:]
466
+ text = ' '.join(new_text)
467
+
468
+ return text
469
+
470
+ def smart_fix(self, text):
471
+ """Apply minimal fixes to maintain human-like quality"""
472
+ # Apply fixes in order of importance
473
+ text = self.fix_basic_punctuation_errors(text)
474
+ text = self.fix_incomplete_sentences_only(text)
475
+ text = self.preserve_natural_variations(text)
476
+
477
+ return text
478
 
479
+ class EnhancedDipperHumanizer:
480
+ def __init__(self):
481
+ self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
482
+ print(f"Using device: {self.device}")
483
+
484
+ # Clear GPU cache
485
+ if torch.cuda.is_available():
486
+ torch.cuda.empty_cache()
487
+
488
+ # Initialize grammar fixer
489
+ self.grammar_fixer = SelectiveGrammarFixer()
490
+
491
+ # Try to load spaCy if available
492
+ self.nlp = None
493
+ self.use_spacy = False
494
+ if SPACY_AVAILABLE:
495
+ try:
496
+ self.nlp = spacy.load("en_core_web_sm")
497
+ self.use_spacy = True
498
+ print("spaCy loaded successfully")
499
+ except:
500
+ print("spaCy model not found, using NLTK for sentence splitting")
501
+
502
+ try:
503
+ # Load Dipper paraphraser WITHOUT 8-bit quantization for better performance
504
+ print("Loading Dipper paraphraser model...")
505
+ self.tokenizer = T5Tokenizer.from_pretrained('google/t5-v1_1-xxl')
506
+ self.model = T5ForConditionalGeneration.from_pretrained(
507
+ "kalpeshk2011/dipper-paraphraser-xxl",
508
+ device_map="auto", # This will distribute across 4xL40S automatically
509
+ torch_dtype=torch.float16,
510
+ low_cpu_mem_usage=True
511
+ )
512
+ print("Dipper model loaded successfully!")
513
+ self.is_dipper = True
514
+
515
+ except Exception as e:
516
+ print(f"Error loading Dipper model: {str(e)}")
517
+ print("Falling back to Flan-T5-XL...")
518
+ self.is_dipper = False
519
+
520
+ # Fallback to Flan-T5-XL
521
+ try:
522
+ self.model = T5ForConditionalGeneration.from_pretrained(
523
+ "google/flan-t5-xl",
524
+ torch_dtype=torch.float16,
525
+ low_cpu_mem_usage=True,
526
+ device_map="auto"
527
+ )
528
+ self.tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
529
+ print("Loaded Flan-T5-XL as fallback")
530
+ except:
531
+ raise Exception("Could not load any model. Please check your system resources.")
532
+
533
+ # Load BART as secondary model
534
+ try:
535
+ print("Loading BART model for additional variation...")
536
+ self.bart_model = AutoModelForSeq2SeqLM.from_pretrained(
537
+ "eugenesiow/bart-paraphrase",
538
+ torch_dtype=torch.float16,
539
+ device_map="auto" # Distribute across GPUs
540
+ )
541
+ self.bart_tokenizer = AutoTokenizer.from_pretrained("eugenesiow/bart-paraphrase")
542
+ self.use_bart = True
543
+ print("BART model loaded successfully")
544
+ except:
545
+ print("BART model not available")
546
+ self.use_bart = False
547
+
548
+ # Initialize human variations handler
549
+ self.human_variations = HumanLikeVariations()
550
+
551
+ def add_natural_human_patterns(self, text):
552
+ """Add natural human writing patterns that Originality AI associates with human text"""
553
+ sentences = self.split_into_sentences_advanced(text)
554
+ result_sentences = []
555
+
556
+ for i, sentence in enumerate(sentences):
557
+ if not sentence.strip():
558
+ continue
559
+
560
+ # Natural contractions throughout
561
+ sentence = self.apply_contractions(sentence)
562
+
563
+ # Add natural speech patterns (15% chance - balanced)
564
+ if random.random() < 0.15 and len(sentence.split()) > 10:
565
+ # Natural interruptions that humans actually use
566
+ if random.random() < 0.5:
567
+ # Add "you know" or "I mean" naturally
568
+ words = sentence.split()
569
+ if len(words) > 6:
570
+ pos = random.randint(3, len(words)-3)
571
+ if random.random() < 0.5:
572
+ words.insert(pos, "you know,")
573
+ else:
574
+ words.insert(pos, "I mean,")
575
+ sentence = ' '.join(words)
576
+ else:
577
+ # Start with natural opener
578
+ openers = ["Look,", "See,", "Thing is,", "Honestly,", "Actually,"]
579
+ sentence = random.choice(openers) + " " + sentence[0].lower() + sentence[1:]
580
+
581
+ # Add subtle errors that humans make (8% chance)
582
+ if random.random() < 0.08:
583
+ words = sentence.split()
584
+ if len(words) > 5:
585
+ # Common comma omissions
586
+ if ", and" in sentence and random.random() < 0.3:
587
+ sentence = sentence.replace(", and", " and", 1)
588
+ # Double words occasionally
589
+ elif random.random() < 0.2:
590
+ idx = random.randint(1, len(words)-2)
591
+ if words[idx].lower() in ['the', 'a', 'to', 'in', 'on', 'at']:
592
+ words.insert(idx+1, words[idx])
593
+ sentence = ' '.join(words)
594
+
595
+ # Natural sentence combinations (20% chance)
596
+ if i < len(sentences) - 1 and random.random() < 0.2:
597
+ next_sent = sentences[i+1].strip()
598
+ if next_sent and len(sentence.split()) + len(next_sent.split()) < 25:
599
+ # Natural connectors based on content
600
+ if any(w in next_sent.lower() for w in ['but', 'however', 'although']):
601
+ sentence = sentence.rstrip('.') + ", but " + next_sent[0].lower() + next_sent[1:]
602
+ sentences[i+1] = "" # Mark as processed
603
+ elif any(w in next_sent.lower() for w in ['also', 'too', 'as well']):
604
+ sentence = sentence.rstrip('.') + " and " + next_sent[0].lower() + next_sent[1:]
605
+ sentences[i+1] = "" # Mark as processed
606
+
607
+ result_sentences.append(sentence)
608
+
609
+ return ' '.join([s for s in result_sentences if s])
610
+
611
+ def vary_sentence_start(self, sentence):
612
+ """Vary sentence beginning to avoid repetitive patterns"""
613
+ if not sentence:
614
+ return sentence
615
+
616
+ words = sentence.split()
617
+ if len(words) < 5:
618
+ return sentence
619
+
620
+ # Different ways to start sentences naturally
621
+ variations = [
622
+ lambda s: "When " + s[0].lower() + s[1:] + ", it makes sense.",
623
+ lambda s: "If you think about it, " + s[0].lower() + s[1:],
624
+ lambda s: s + " This is important.",
625
+ lambda s: "The thing about " + words[0].lower() + " " + ' '.join(words[1:]) + " is clear.",
626
+ lambda s: "What's interesting is " + s[0].lower() + s[1:],
627
+ lambda s: s, # Keep original sometimes
628
+ ]
629
+
630
+ # Pick a random variation
631
+ variation = random.choice(variations)
632
+ try:
633
+ return variation(sentence)
634
+ except:
635
+ return sentence
636
+
637
+ def apply_contractions(self, text):
638
+ """Apply common contractions to make text more natural"""
639
+ contractions = {
640
+ "it is": "it's", "that is": "that's", "there is": "there's",
641
+ "he is": "he's", "she is": "she's", "what is": "what's",
642
+ "where is": "where's", "who is": "who's", "how is": "how's",
643
+ "cannot": "can't", "will not": "won't", "do not": "don't",
644
+ "does not": "doesn't", "did not": "didn't", "could not": "couldn't",
645
+ "should not": "shouldn't", "would not": "wouldn't", "is not": "isn't",
646
+ "are not": "aren't", "was not": "wasn't", "were not": "weren't",
647
+ "have not": "haven't", "has not": "hasn't", "had not": "hadn't",
648
+ "I am": "I'm", "you are": "you're", "we are": "we're",
649
+ "they are": "they're", "I have": "I've", "you have": "you've",
650
+ "we have": "we've", "they have": "they've", "I will": "I'll",
651
+ "you will": "you'll", "he will": "he'll", "she will": "she'll",
652
+ "we will": "we'll", "they will": "they'll", "I would": "I'd",
653
+ "you would": "you'd", "he would": "he'd", "she would": "she'd",
654
+ "we would": "we'd", "they would": "they'd", "could have": "could've",
655
+ "should have": "should've", "would have": "would've", "might have": "might've",
656
+ "must have": "must've", "there has": "there's", "here is": "here's",
657
+ "let us": "let's", "that will": "that'll", "who will": "who'll"
658
+ }
659
+
660
+ for full, contr in contractions.items():
661
+ text = re.sub(r'\b' + full + r'\b', contr, text, flags=re.IGNORECASE)
662
+
663
+ return text
664
+
665
+ def should_skip_element(self, element, text):
666
+ """Determine if an element should be skipped from paraphrasing"""
667
+ if not text or len(text.strip()) < 3:
668
+ return True
669
+
670
+ # Skip JavaScript code inside script tags
671
+ parent = element.parent
672
+ if parent and parent.name in ['script', 'style', 'noscript']:
673
+ return True
674
+
675
+ # Skip headings (h1-h6)
676
+ if parent and parent.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'title']:
677
+ return True
678
+
679
+ # Skip content inside <strong> and <b> tags
680
+ if parent and parent.name in ['strong', 'b']:
681
+ return True
682
+
683
+ # Skip table content
684
+ if parent and (parent.name in ['td', 'th'] or any(p.name == 'table' for p in parent.parents)):
685
+ return True
686
+
687
+ # Special handling for content inside tables
688
+ # Skip if it's inside strong/b/h1-h6 tags AND also inside a table
689
+ if parent:
690
+ # Check if we're inside a table
691
+ is_in_table = any(p.name == 'table' for p in parent.parents)
692
+ if is_in_table:
693
+ # If we're in a table, skip any text that's inside formatting tags
694
+ if parent.name in ['strong', 'b', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'em', 'i']:
695
+ return True
696
+ # Also check if parent's parent is a formatting tag
697
+ if parent.parent and parent.parent.name in ['strong', 'b', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
698
+ return True
699
+
700
+ # Skip table of contents
701
+ if parent:
702
+ parent_text = str(parent).lower()
703
+ if any(toc in parent_text for toc in ['table of contents', 'toc-', 'contents']):
704
+ return True
705
+
706
+ # Skip CTAs and buttons
707
+ if parent and parent.name in ['button', 'a']:
708
+ return True
709
+
710
+ # Skip if parent has onclick or other event handlers
711
+ if parent and parent.attrs:
712
+ event_handlers = ['onclick', 'onchange', 'onsubmit', 'onload', 'onmouseover', 'onmouseout']
713
+ if any(handler in parent.attrs for handler in event_handlers):
714
+ return True
715
+
716
+ # Special check for testimonial cards - check up to 3 levels of ancestors
717
+ if parent:
718
+ ancestors_to_check = []
719
+ current = parent
720
+ for _ in range(3): # Check up to 3 levels up
721
+ if current:
722
+ ancestors_to_check.append(current)
723
+ current = current.parent
724
+
725
+ # Check if any ancestor has testimonial-card class
726
+ for ancestor in ancestors_to_check:
727
+ if ancestor and ancestor.get('class'):
728
+ classes = ancestor.get('class', [])
729
+ if isinstance(classes, list):
730
+ if any('testimonial-card' in str(cls) for cls in classes):
731
+ return True
732
+ elif isinstance(classes, str) and 'testimonial-card' in classes:
733
+ return True
734
+
735
+ # Skip if IMMEDIATE parent or element itself has skip-worthy classes/IDs
736
+ skip_indicators = [
737
+ 'button', 'btn', 'heading', 'title', 'caption',
738
+ 'toc-', 'contents', 'quiz', 'tip', 'note', 'alert',
739
+ 'warning', 'info', 'success', 'error', 'code', 'pre',
740
+ 'stats-grid', 'testimonial-card',
741
+ 'cta-box', 'quiz-container', 'contact-form',
742
+ 'faq-question', 'sidebar', 'widget', 'banner',
743
+ 'author-intro', 'testimonial', 'review', 'feedback',
744
+ 'floating-', 'stat-', 'progress-', 'option', 'results',
745
+ 'question-container', 'quiz-',
746
+ 'comparision-tables', 'process-flowcharts', 'infographics', 'cost-breakdown'
747
+ ]
748
+
749
+ # Check only immediate parent and grandparent (not all ancestors)
750
+ elements_to_check = [parent]
751
+ if parent and parent.parent:
752
+ elements_to_check.append(parent.parent)
753
+
754
+ for elem in elements_to_check:
755
+ if not elem:
756
+ continue
757
+
758
+ # Check element's class
759
+ elem_class = elem.get('class', [])
760
+ if isinstance(elem_class, list):
761
+ class_str = ' '.join(str(cls).lower() for cls in elem_class)
762
+ if any(indicator in class_str for indicator in skip_indicators):
763
+ return True
764
+
765
+ # Check element's ID
766
+ elem_id = elem.get('id', '')
767
+ if any(indicator in str(elem_id).lower() for indicator in skip_indicators):
768
+ return True
769
+
770
+ # Skip short phrases that might be UI elements
771
+ word_count = len(text.split())
772
+ if word_count <= 5:
773
+ ui_patterns = [
774
+ 'click', 'download', 'learn more', 'read more', 'sign up',
775
+ 'get started', 'try now', 'buy now', 'next', 'previous',
776
+ 'back', 'continue', 'submit', 'cancel', 'get now', 'book your',
777
+ 'check out:', 'see also:', 'related:', 'question', 'of'
778
+ ]
779
+ if any(pattern in text.lower() for pattern in ui_patterns):
780
+ return True
781
+
782
+ # Skip very short content in styled containers
783
+ if parent and parent.name in ['div', 'section', 'aside', 'blockquote']:
784
+ style = parent.get('style', '')
785
+ if 'border' in style or 'background' in style:
786
+ if word_count <= 20:
787
+ # But don't skip if it's inside a paragraph
788
+ if not any(p.name == 'p' for p in parent.parents):
789
+ return True
790
+
791
+ return False
792
+
793
+ def is_likely_acronym_or_proper_noun(self, word):
794
+ """Check if a word is likely an acronym or part of a proper noun"""
795
+ # Common acronyms and abbreviations
796
+ acronyms = {'MBA', 'CEO', 'USA', 'UK', 'GMAT', 'GRE', 'SAT', 'ACT', 'PhD', 'MD', 'IT', 'AI', 'ML'}
797
+
798
+ # Check if it's in our acronym list
799
+ if word.upper() in acronyms:
800
+ return True
801
+
802
+ # Check if it's all caps (likely acronym)
803
+ if word.isupper() and len(word) > 1:
804
+ return True
805
+
806
+ # Check if it follows patterns like "Edition", "Focus", etc. that often come after proper nouns
807
+ proper_noun_continuations = {
808
+ 'Edition', 'Version', 'Series', 'Focus', 'System', 'Method', 'School',
809
+ 'University', 'College', 'Institute', 'Academy', 'Center', 'Centre'
810
+ }
811
+
812
+ if word in proper_noun_continuations:
813
+ return True
814
+
815
+ return False
816
+
817
+ def clean_model_output_enhanced(self, text):
818
+ """Enhanced cleaning that preserves more natural structure"""
819
+ if not text:
820
+ return ""
821
+
822
+ # Store original for fallback
823
+ original = text
824
+
825
+ # Remove ONLY clear model artifacts
826
+ text = re.sub(r'^lexical\s*=\s*\d+\s*,\s*order\s*=\s*\d+\s*', '', text, flags=re.IGNORECASE)
827
+ text = re.sub(r'<sent>\s*', '', text, flags=re.IGNORECASE)
828
+ text = re.sub(r'\s*</sent>', '', text, flags=re.IGNORECASE)
829
+
830
+ # Only remove clear prefixes
831
+ if text.lower().startswith('paraphrase:'):
832
+ text = text[11:].strip()
833
+ elif text.lower().startswith('rewrite:'):
834
+ text = text[8:].strip()
835
+
836
+ # Clean up backticks and weird punctuation
837
+ text = re.sub(r'``+', '', text)
838
+ text = re.sub(r"''", '"', text)
839
+
840
+ # Remove awkward phrase markers
841
+ text = re.sub(r'- actually, scratch that -', '', text)
842
+ text = re.sub(r'- wait, let me back up -', '', text)
843
+ text = re.sub(r'- you know what I mean\? -', '', text)
844
+ text = re.sub(r'- okay, here\'s the thing -', '', text)
845
+ text = re.sub(r'- bear with me here -', '', text)
846
+ text = re.sub(r'- I\'m serious -', '', text)
847
+ text = re.sub(r'- or maybe I should say -', '', text)
848
+ text = re.sub(r'- or rather,', '', text)
849
+ text = re.sub(r'- think about it -', '', text)
850
+
851
+ # Clean up multiple spaces
852
+ text = re.sub(r'\s+', ' ', text)
853
+
854
+ # Remove leading non-letter characters carefully
855
+ text = re.sub(r'^[^a-zA-Z_]+', '', text)
856
+
857
+ # If we accidentally removed too much, use original
858
+ if len(text) < len(original) * 0.5:
859
+ text = original
860
+
861
+ return text.strip()
862
+
863
+ def paraphrase_with_dipper(self, text, lex_diversity=60, order_diversity=20):
864
+ """Paraphrase text using Dipper model with sentence-level processing"""
865
+ if not text or len(text.strip()) < 3:
866
+ return text
867
+
868
+ # Split into sentences for better control
869
+ sentences = self.split_into_sentences_advanced(text)
870
+ paraphrased_sentences = []
871
+
872
+ # Track sentence patterns to avoid repetition
873
+ sentence_starts = []
874
+
875
+ for i, sentence in enumerate(sentences):
876
+ if len(sentence.strip()) < 3:
877
+ paraphrased_sentences.append(sentence)
878
+ continue
879
+
880
+ try:
881
+ # BALANCED diversity for Originality AI (100% human with better quality)
882
+ if len(sentence.split()) < 10:
883
+ lex_diversity = 70 # High but not extreme
884
+ order_diversity = 25
885
+ else:
886
+ lex_diversity = 82 # Balanced diversity
887
+ order_diversity = 30 # Moderate order diversity
888
+
889
+ lex_code = int(100 - lex_diversity)
890
+ order_code = int(100 - order_diversity)
891
+
892
+ # Format input for Dipper
893
+ if self.is_dipper:
894
+ input_text = f"lexical = {lex_code}, order = {order_code} <sent> {sentence} </sent>"
895
+ else:
896
+ input_text = f"paraphrase: {sentence}"
897
+
898
+ # Tokenize
899
+ inputs = self.tokenizer(
900
+ input_text,
901
+ return_tensors="pt",
902
+ max_length=512,
903
+ truncation=True,
904
+ padding=True
905
+ )
906
+
907
+ # Move to device
908
+ if hasattr(self.model, 'device_map') and self.model.device_map:
909
+ device = next(iter(self.model.device_map.values()))
910
+ inputs = {k: v.to(device) for k, v in inputs.items()}
911
+ else:
912
+ inputs = {k: v.to(self.device) for k, v in inputs.items()}
913
+
914
+ # Generate with appropriate variation
915
+ original_length = len(sentence.split())
916
+ max_new_length = int(original_length * 1.4)
917
+
918
+ # High variation parameters
919
+ temp = 0.85 # Slightly reduced from 0.9
920
+ top_p_val = 0.92 # Slightly reduced from 0.95
921
+
922
+ with torch.no_grad():
923
+ outputs = self.model.generate(
924
+ **inputs,
925
+ max_length=max_new_length + 20,
926
+ min_length=max(5, int(original_length * 0.7)),
927
+ do_sample=True,
928
+ top_p=top_p_val,
929
+ temperature=temp,
930
+ no_repeat_ngram_size=4, # Allow more repetition for naturalness
931
+ num_beams=1, # Greedy for more randomness
932
+ early_stopping=True
933
+ )
934
+
935
+ # Decode
936
+ paraphrased = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
937
+
938
+ # Clean model artifacts
939
+ paraphrased = self.clean_model_output_enhanced(paraphrased)
940
+
941
+ # Fix incomplete sentences
942
+ paraphrased = self.fix_incomplete_sentence_smart(paraphrased, sentence)
943
+
944
+ # Ensure variety in sentence starts
945
+ first_words = paraphrased.split()[:2] if paraphrased.split() else []
946
+ if first_words and i > 0:
947
+ # Check if we're repeating patterns
948
+ first_phrase = ' '.join(first_words).lower()
949
+ if sentence_starts.count(first_phrase) >= 2:
950
+ # Try to rephrase the beginning
951
+ paraphrased = self.vary_sentence_start(paraphrased)
952
+ sentence_starts.append(first_phrase)
953
+
954
+ # Ensure reasonable length
955
+ if len(paraphrased.split()) > max_new_length:
956
+ paraphrased = ' '.join(paraphrased.split()[:max_new_length])
957
+
958
+ paraphrased_sentences.append(paraphrased)
959
+
960
+ except Exception as e:
961
+ print(f"Error paraphrasing sentence: {str(e)}")
962
+ paraphrased_sentences.append(sentence)
963
+
964
+ # Join sentences back
965
+ result = ' '.join(paraphrased_sentences)
966
+
967
+ # Apply natural human patterns
968
+ result = self.add_natural_human_patterns(result)
969
+
970
+ return result
971
+
972
+ def fix_incomplete_sentence_smart(self, generated, original):
973
+ """Smarter sentence completion that maintains natural flow"""
974
+ if not generated or not generated.strip():
975
+ return original
976
+
977
+ generated = generated.strip()
978
+
979
+ # Check if the sentence seems complete semantically
980
+ words = generated.split()
981
+ if len(words) >= 3:
982
+ # Check if last word is a good ending word
983
+ last_word = words[-1].lower().rstrip('.,!?;:')
984
+
985
+ # Common ending words that might not need punctuation fix
986
+ ending_words = {
987
+ 'too', 'also', 'well', 'though', 'however',
988
+ 'furthermore', 'moreover', 'indeed', 'anyway',
989
+ 'regardless', 'nonetheless', 'therefore', 'thus'
990
+ }
991
+
992
+ # If it ends with a good word, just add appropriate punctuation
993
+ if last_word in ending_words:
994
+ if generated[-1] not in '.!?':
995
+ generated += '.'
996
+ return generated
997
+
998
+ # Check for cut-off patterns
999
+ if len(words) > 0:
1000
+ last_word = words[-1]
1001
+
1002
+ # Remove if it's clearly cut off (1-2 chars, no vowels)
1003
+ # But don't remove valid short words like "is", "of", "to", etc.
1004
+ short_valid_words = {'is', 'of', 'to', 'in', 'on', 'at', 'by', 'or', 'if', 'so', 'up', 'no', 'we', 'he', 'me', 'be', 'do', 'go'}
1005
+ if (len(last_word) <= 2 and
1006
+ last_word.lower() not in short_valid_words and
1007
+ not any(c in 'aeiouAEIOU' for c in last_word)):
1008
+ words = words[:-1]
1009
+ generated = ' '.join(words)
1010
+
1011
+ # Add ending punctuation based on context
1012
+ if generated and generated[-1] not in '.!?:,;':
1013
+ # Check original ending
1014
+ orig_stripped = original.strip()
1015
+ if orig_stripped.endswith('?'):
1016
+ # Check if generated seems like a question
1017
+ question_words = ['what', 'why', 'how', 'when', 'where', 'who', 'which', 'is', 'are', 'do', 'does', 'can', 'could', 'would', 'should']
1018
+ first_word = generated.split()[0].lower() if generated.split() else ''
1019
+ if first_word in question_words:
1020
+ generated += '?'
1021
+ else:
1022
+ generated += '.'
1023
+ elif orig_stripped.endswith('!'):
1024
+ # Check if generated seems exclamatory
1025
+ exclaim_words = ['amazing', 'incredible', 'fantastic', 'terrible', 'awful', 'wonderful', 'excellent']
1026
+ if any(word in generated.lower() for word in exclaim_words):
1027
+ generated += '!'
1028
+ else:
1029
+ generated += '.'
1030
+ elif orig_stripped.endswith(':'):
1031
+ generated += ':'
1032
+ else:
1033
+ generated += '.'
1034
+
1035
+ # Ensure first letter is capitalized ONLY if it's sentence start
1036
+ # Don't capitalize words like "iPhone" or "eBay"
1037
+ if generated and generated[0].islower() and not self.is_likely_acronym_or_proper_noun(generated.split()[0]):
1038
+ generated = generated[0].upper() + generated[1:]
1039
+
1040
+ return generated
1041
+
1042
+ def split_into_sentences_advanced(self, text):
1043
+ """Advanced sentence splitting using spaCy or NLTK"""
1044
+ if self.use_spacy and self.nlp:
1045
+ doc = self.nlp(text)
1046
+ sentences = [sent.text.strip() for sent in doc.sents]
1047
+ else:
1048
+ # Fallback to NLTK
1049
+ try:
1050
+ sentences = sent_tokenize(text)
1051
+ except:
1052
+ # Final fallback to regex
1053
+ sentences = re.split(r'(?<=[.!?])\s+', text)
1054
+
1055
+ # Clean up sentences
1056
+ return [s for s in sentences if s and len(s.strip()) > 0]
1057
+
1058
+ def paraphrase_with_bart(self, text):
1059
+ """Additional paraphrasing with BART for more variation"""
1060
+ if not self.use_bart or not text or len(text.strip()) < 3:
1061
+ return text
1062
+
1063
+ try:
1064
+ # Process in smaller chunks for BART
1065
+ sentences = self.split_into_sentences_advanced(text)
1066
+ paraphrased_sentences = []
1067
+
1068
+ for sentence in sentences:
1069
+ if len(sentence.split()) < 5:
1070
+ paraphrased_sentences.append(sentence)
1071
+ continue
1072
+
1073
+ inputs = self.bart_tokenizer(
1074
+ sentence,
1075
+ return_tensors='pt',
1076
+ max_length=128,
1077
+ truncation=True
1078
+ )
1079
+
1080
+ # Move to appropriate device
1081
+ if hasattr(self.bart_model, 'device_map') and self.bart_model.device_map:
1082
+ device = next(iter(self.bart_model.device_map.values()))
1083
+ inputs = {k: v.to(device) for k, v in inputs.items()}
1084
+ else:
1085
+ inputs = {k: v.to(self.device) for k, v in inputs.items()}
1086
+
1087
+ original_length = len(sentence.split())
1088
+
1089
+ with torch.no_grad():
1090
+ outputs = self.bart_model.generate(
1091
+ **inputs,
1092
+ max_length=int(original_length * 1.4) + 10,
1093
+ min_length=max(5, int(original_length * 0.6)),
1094
+ num_beams=2,
1095
+ temperature=1.1, # Higher temperature
1096
+ do_sample=True,
1097
+ top_p=0.9,
1098
+ early_stopping=True
1099
+ )
1100
+
1101
+ paraphrased = self.bart_tokenizer.decode(outputs[0], skip_special_tokens=True)
1102
+
1103
+ # Fix incomplete sentences
1104
+ paraphrased = self.fix_incomplete_sentence_smart(paraphrased, sentence)
1105
+
1106
+ paraphrased_sentences.append(paraphrased)
1107
+
1108
+ result = ' '.join(paraphrased_sentences)
1109
+
1110
+ # Apply minimal grammar fixes
1111
+ result = self.grammar_fixer.smart_fix(result)
1112
+
1113
+ return result
1114
+
1115
+ except Exception as e:
1116
+ print(f"Error in BART paraphrasing: {str(e)}")
1117
+ return text
1118
+
1119
+ def apply_sentence_variation(self, text):
1120
+ """Apply natural sentence structure variations - HUMAN-LIKE FLOW"""
1121
+ sentences = self.split_into_sentences_advanced(text)
1122
+ varied_sentences = []
1123
+
1124
+ # Track patterns to ensure variety
1125
+ last_sentence_length = 0
1126
+
1127
+ for i, sentence in enumerate(sentences):
1128
+ if not sentence.strip():
1129
+ continue
1130
+
1131
+ words = sentence.split()
1132
+ current_length = len(words)
1133
+
1134
+ # Natural sentence length variation
1135
+ if last_sentence_length > 20 and current_length > 20:
1136
+ # Break up if two long sentences in a row
1137
+ if ',' in sentence:
1138
+ parts = sentence.split(',', 1)
1139
+ if len(parts) == 2 and len(parts[1].split()) > 8:
1140
+ varied_sentences.append(parts[0].strip() + '.')
1141
+ second_part = parts[1].strip()
1142
+ if second_part and second_part[0].islower():
1143
+ second_part = second_part[0].upper() + second_part[1:]
1144
+ varied_sentences.append(second_part)
1145
+ last_sentence_length = len(parts[1].split())
1146
+ continue
1147
+
1148
+ # Natural combinations for flow
1149
+ if (i < len(sentences) - 1 and
1150
+ current_length < 10 and
1151
+ len(sentences[i+1].split()) < 10):
1152
+
1153
+ next_sent = sentences[i+1].strip()
1154
+ # Only combine if it makes semantic sense
1155
+ if next_sent and any(next_sent.lower().startswith(w) for w in ['it', 'this', 'that', 'which']):
1156
+ combined = sentence.rstrip('.') + ' ' + next_sent[0].lower() + next_sent[1:]
1157
+ varied_sentences.append(combined)
1158
+ sentences[i+1] = ""
1159
+ last_sentence_length = len(combined.split())
1160
+ continue
1161
+
1162
+ varied_sentences.append(sentence)
1163
+ last_sentence_length = current_length
1164
+
1165
+ return ' '.join([s for s in varied_sentences if s])
1166
+
1167
+ def fix_punctuation(self, text):
1168
+ """Comprehensive punctuation and formatting fixes"""
1169
+ if not text:
1170
+ return ""
1171
+
1172
+ # First, clean any remaining model artifacts
1173
+ text = self.clean_model_output_enhanced(text)
1174
+
1175
+ # Fix weird symbols and characters using safe replacements
1176
+ text = text.replace('<>', '') # Remove empty angle brackets
1177
+
1178
+ # Normalize quotes - use replace instead of regex for problematic characters
1179
+ text = text.replace('«', '"').replace('»', '"')
1180
+ text = text.replace('„', '"').replace('"', '"').replace('"', '"')
1181
+ text = text.replace(''', "'").replace(''', "'")
1182
+ text = text.replace('–', '-').replace('—', '-')
1183
+
1184
+ # Fix colon issues
1185
+ text = re.sub(r'\.:', ':', text) # Remove period before colon
1186
+ text = re.sub(r':\s*\.', ':', text) # Remove period after colon
1187
+
1188
+ # Fix basic spacing
1189
+ text = re.sub(r'\s+', ' ', text) # Multiple spaces to single
1190
+ text = re.sub(r'\s+([.,!?;:])', r'\1', text) # Remove space before punctuation
1191
+ text = re.sub(r'([.,!?;:])\s*([.,!?;:])', r'\1', text) # Remove double punctuation
1192
+ text = re.sub(r'([.!?])\s*\1+', r'\1', text) # Remove repeated punctuation
1193
+
1194
+ # Fix colons
1195
+ text = re.sub(r':\s*([.,!?])', ':', text) # Remove punctuation after colon
1196
+ text = re.sub(r'([.,!?])\s*:', ':', text) # Remove punctuation before colon
1197
+ text = re.sub(r':+', ':', text) # Multiple colons to one
1198
+
1199
+ # Fix quotes and parentheses
1200
+ text = re.sub(r'"\s*([^"]*?)\s*"', r'"\1"', text)
1201
+ text = re.sub(r"'\s*([^']*?)\s*'", r"'\1'", text)
1202
+ text = re.sub(r'\(\s*([^)]*?)\s*\)', r'(\1)', text)
1203
+
1204
+ # Fix sentence capitalization more carefully
1205
+ # Split on ACTUAL sentence endings only
1206
+ sentences = re.split(r'(?<=[.!?])\s+', text)
1207
+ fixed_sentences = []
1208
+
1209
+ for i, sentence in enumerate(sentences):
1210
+ if not sentence:
1211
+ continue
1212
+
1213
+ # Only capitalize the first letter if it's actually lowercase
1214
+ # and not part of a special case (like iPhone, eBay, etc.)
1215
+ words = sentence.split()
1216
+ if words:
1217
+ first_word = words[0]
1218
+ # Check if it's not an acronym or proper noun that should stay lowercase
1219
+ if (first_word[0].islower() and
1220
+ not self.is_likely_acronym_or_proper_noun(first_word)):
1221
+ # Only capitalize if it's a regular word
1222
+ sentence = first_word[0].upper() + first_word[1:] + ' ' + ' '.join(words[1:])
1223
+
1224
+ fixed_sentences.append(sentence)
1225
+
1226
+ text = ' '.join(fixed_sentences)
1227
+
1228
+ # Fix common issues
1229
+ text = re.sub(r'\bi\b', 'I', text) # Capitalize 'I'
1230
+ text = re.sub(r'\.{2,}', '.', text) # Multiple periods to one
1231
+ text = re.sub(r',{2,}', ',', text) # Multiple commas to one
1232
+ text = re.sub(r'\s*,\s*,\s*', ', ', text) # Double commas with spaces
1233
+
1234
+ # Remove weird artifacts
1235
+ text = re.sub(r'\b(CHAPTER\s+[IVX]+|SECTION\s+\d+)\b[^\w]*', '', text, flags=re.IGNORECASE)
1236
+
1237
+ # Fix abbreviations
1238
+ text = re.sub(r'\betc\s*\.\s*\.', 'etc.', text)
1239
+ text = re.sub(r'\be\.g\s*\.\s*[,\s]', 'e.g., ', text)
1240
+ text = re.sub(r'\bi\.e\s*\.\s*[,\s]', 'i.e., ', text)
1241
+
1242
+ # Fix numbers with periods (like "1. " at start of lists)
1243
+ text = re.sub(r'(\d+)\.\s+', r'\1. ', text)
1244
+
1245
+ # Fix bold/strong tags punctuation
1246
+ text = self.fix_bold_punctuation(text)
1247
+
1248
+ # Clean up any remaining issues
1249
+ text = re.sub(r'\s+([.,!?;:])', r'\1', text) # Final space cleanup
1250
+ text = re.sub(r'([.,!?;:])\s{2,}', r'\1 ', text) # Fix multiple spaces after punctuation
1251
+
1252
+ # Ensure ending punctuation
1253
+ text = text.strip()
1254
+ if text and text[-1] not in '.!?':
1255
+ # Don't add period if it ends with colon (likely a list header)
1256
+ if not text.endswith(':'):
1257
+ text += '.'
1258
+
1259
+ return text
1260
+
1261
+ def fix_bold_punctuation(self, text):
1262
+ """Fix punctuation issues around bold/strong tags"""
1263
+ # Check if this is likely a list item with colon pattern
1264
+ def is_list_item_with_colon(text):
1265
+ # Pattern: starts with or contains <strong>Text:</strong> or <b>Text:</b>
1266
+ list_pattern = r'^\s*(?:[-•*▪▫◦‣⁃]\s*)?<(?:strong|b)>[^<]+:</(?:strong|b)>'
1267
+ return bool(re.search(list_pattern, text))
1268
+
1269
+ # If it's a list item with colon, preserve the format
1270
+ if is_list_item_with_colon(text):
1271
+ # Just clean up spacing but preserve the colon inside bold
1272
+ text = re.sub(r'<(strong|b)>\s*([^:]+)\s*:\s*</\1>', r'<\1>\2:</\1>', text)
1273
+ return text
1274
+
1275
+ # Pattern to find bold/strong content
1276
+ bold_pattern = r'<(strong|b)>(.*?)</\1>'
1277
+
1278
+ def fix_bold_match(match):
1279
+ tag = match.group(1)
1280
+ content = match.group(2).strip()
1281
+
1282
+ if not content:
1283
+ return f'<{tag}></{tag}>'
1284
+
1285
+ # Check if this is a list header (contains colon at the end)
1286
+ if content.endswith(':'):
1287
+ # Preserve list headers with colons
1288
+ return f'<{tag}>{content}</{tag}>'
1289
+
1290
+ # Remove any periods at the start or end of bold content
1291
+ content = content.strip('.')
1292
+
1293
+ # Check if this bold text is at the start of a sentence
1294
+ # (preceded by nothing, or by '. ', '! ', '? ')
1295
+ start_pos = match.start()
1296
+ is_sentence_start = (start_pos == 0 or
1297
+ (start_pos > 2 and text[start_pos-2:start_pos] in ['. ', '! ', '? ', '\n\n']))
1298
+
1299
+ # Capitalize first letter if it's at sentence start
1300
+ if is_sentence_start and content and content[0].isalpha():
1301
+ content = content[0].upper() + content[1:]
1302
+
1303
+ return f'<{tag}>{content}</{tag}>'
1304
+
1305
+ # Fix bold/strong tags
1306
+ text = re.sub(bold_pattern, fix_bold_match, text)
1307
+
1308
+ # Fix spacing around bold/strong tags (but not for list items)
1309
+ if not is_list_item_with_colon(text):
1310
+ text = re.sub(r'\.\s*<(strong|b)>', r'. <\1>', text) # Period before bold
1311
+ text = re.sub(r'</(strong|b)>\s*\.', r'</\1>.', text) # Period after bold
1312
+ text = re.sub(r'([.!?])\s*<(strong|b)>', r'\1 <\2>', text) # Space after sentence end
1313
+ text = re.sub(r'</(strong|b)>\s+([a-z])', lambda m: f'</{m.group(1)}> {m.group(2)}', text) # Keep lowercase after bold if mid-sentence
1314
+
1315
+ # Remove duplicate periods around bold tags
1316
+ text = re.sub(r'\.\s*</(strong|b)>\s*\.', r'</\1>.', text)
1317
+ text = re.sub(r'\.\s*<(strong|b)>\s*\.', r'. <\1>', text)
1318
+
1319
+ # Fix cases where bold content ends a sentence
1320
+ # If bold is followed by a new sentence (capital letter), add period
1321
+ text = re.sub(r'</(strong|b)>\s+([A-Z])', r'</\1>. \2', text)
1322
+
1323
+ # Don't remove these for list items
1324
+ if not is_list_item_with_colon(text):
1325
+ text = re.sub(r'<(strong|b)>\s*:\s*</\1>', ':', text) # Remove empty bold colons
1326
+ text = re.sub(r'<(strong|b)>\s*\.\s*</\1>', '.', text) # Remove empty bold periods
1327
+
1328
+ return text
1329
+
1330
+ def extract_text_from_html(self, html_content):
1331
+ """Extract text elements from HTML with skip logic"""
1332
+ soup = BeautifulSoup(html_content, 'html.parser')
1333
+ text_elements = []
1334
+
1335
+ # Get all text nodes using string instead of text (fixing deprecation)
1336
+ for element in soup.find_all(string=True):
1337
+ # Skip script, style, and noscript content completely
1338
+ if element.parent.name in ['script', 'style', 'noscript']:
1339
+ continue
1340
+
1341
+ text = element.strip()
1342
+ if text and not self.should_skip_element(element, text):
1343
+ text_elements.append({
1344
+ 'text': text,
1345
+ 'element': element
1346
+ })
1347
+
1348
+ return soup, text_elements
1349
+
1350
+ def validate_and_fix_html(self, html_text):
1351
+ """Fix common HTML syntax errors after processing"""
1352
+
1353
+ # Fix DOCTYPE
1354
+ html_text = re.sub(r'<!\s*DOCTYPE', '<!DOCTYPE', html_text, flags=re.IGNORECASE)
1355
+
1356
+ # Fix spacing issues
1357
+ html_text = re.sub(r'>\s+<', '><', html_text) # Remove extra spaces between tags
1358
+ html_text = re.sub(r'\s+>', '>', html_text) # Remove spaces before closing >
1359
+ html_text = re.sub(r'<\s+', '<', html_text) # Remove spaces after opening <
1360
+
1361
+ # Fix common word errors that might occur during processing
1362
+ html_text = html_text.replace('down loaded', 'downloaded')
1363
+ html_text = html_text.replace('But your document', 'Your document')
1364
+
1365
+ return html_text
1366
+
1367
+ def add_natural_flow_variations(self, text):
1368
+ """Add more natural flow and rhythm variations for Originality AI"""
1369
+ sentences = self.split_into_sentences_advanced(text)
1370
+ enhanced_sentences = []
1371
+
1372
+ for i, sentence in enumerate(sentences):
1373
+ if not sentence.strip():
1374
+ continue
1375
+
1376
+ # Add stream-of-consciousness elements (8% chance - reduced)
1377
+ if random.random() < 0.08 and len(sentence.split()) > 10:
1378
+ stream_elements = [
1379
+ " - wait, let me back up - ",
1380
+ " - actually, scratch that - ",
1381
+ " - or maybe I should say - ",
1382
+ " - hmm, how do I put this - ",
1383
+ " - okay, here's the thing - ",
1384
+ " - you know what I mean? - "
1385
+ ]
1386
+ words = sentence.split()
1387
+ pos = random.randint(len(words)//4, 3*len(words)//4)
1388
+ words.insert(pos, random.choice(stream_elements))
1389
+ sentence = ' '.join(words)
1390
+
1391
+ # Add human-like self-corrections (7% chance - reduced)
1392
+ if random.random() < 0.07:
1393
+ corrections = [
1394
+ " - or rather, ",
1395
+ " - well, actually, ",
1396
+ " - I mean, ",
1397
+ " - or should I say, ",
1398
+ " - correction: "
1399
+ ]
1400
+ words = sentence.split()
1401
+ if len(words) > 8:
1402
+ pos = random.randint(len(words)//2, len(words)-3)
1403
+ correction = random.choice(corrections)
1404
+ # Repeat a concept with variation
1405
+ repeated_word_idx = random.randint(max(0, pos-5), pos-1)
1406
+ if repeated_word_idx < len(words):
1407
+ words.insert(pos, correction)
1408
+ sentence = ' '.join(words)
1409
+
1410
+ # Add thinking-out-loud patterns (10% chance - reduced)
1411
+ if random.random() < 0.10 and i > 0:
1412
+ thinking_patterns = [
1413
+ "Come to think of it, ",
1414
+ "Actually, you know what? ",
1415
+ "Wait, here's a thought: ",
1416
+ "Oh, and another thing - ",
1417
+ "Speaking of which, ",
1418
+ "This reminds me, ",
1419
+ "Now that I mention it, ",
1420
+ "Funny you should ask, because "
1421
+ ]
1422
+ pattern = random.choice(thinking_patterns)
1423
+ sentence = pattern + sentence[0].lower() + sentence[1:] if len(sentence) > 1 else sentence
1424
+
1425
+ enhanced_sentences.append(sentence)
1426
+
1427
+ return ' '.join(enhanced_sentences)
1428
+
1429
+ def process_html(self, html_content, progress_callback=None):
1430
+ """Main processing function with progress callback"""
1431
+ if not html_content.strip():
1432
+ return "Please provide HTML content."
1433
+
1434
+ # Store all script and style content to preserve it
1435
+ script_placeholder = "###SCRIPT_PLACEHOLDER_{}###"
1436
+ style_placeholder = "###STYLE_PLACEHOLDER_{}###"
1437
+ preserved_scripts = []
1438
+ preserved_styles = []
1439
+
1440
+ # Temporarily replace script and style tags with placeholders
1441
+ soup_temp = BeautifulSoup(html_content, 'html.parser')
1442
+
1443
+ # Preserve all script tags
1444
+ for idx, script in enumerate(soup_temp.find_all('script')):
1445
+ placeholder = script_placeholder.format(idx)
1446
+ preserved_scripts.append(str(script))
1447
+ script.replace_with(placeholder)
1448
+
1449
+ # Preserve all style tags
1450
+ for idx, style in enumerate(soup_temp.find_all('style')):
1451
+ placeholder = style_placeholder.format(idx)
1452
+ preserved_styles.append(str(style))
1453
+ style.replace_with(placeholder)
1454
+
1455
+ # Get the modified HTML
1456
+ html_content = str(soup_temp)
1457
+
1458
+ try:
1459
+ # Extract text elements
1460
+ soup, text_elements = self.extract_text_from_html(html_content)
1461
+
1462
+ total_elements = len(text_elements)
1463
+ print(f"Found {total_elements} text elements to process (after filtering)")
1464
+
1465
+ # Process each text element
1466
+ processed_count = 0
1467
+
1468
+ for i, element_info in enumerate(text_elements):
1469
+ original_text = element_info['text']
1470
+
1471
+ # Skip placeholders
1472
+ if "###SCRIPT_PLACEHOLDER_" in original_text or "###STYLE_PLACEHOLDER_" in original_text:
1473
+ continue
1474
+
1475
+ # Skip very short texts
1476
+ if len(original_text.split()) < 3:
1477
+ continue
1478
+
1479
+ # First pass with Dipper
1480
+ paraphrased_text = self.paraphrase_with_dipper(
1481
+ original_text,
1482
+ lex_diversity=60,
1483
+ order_diversity=20
1484
+ )
1485
+
1486
+ # Second pass with BART for longer texts (balanced probability)
1487
+ if self.use_bart and len(paraphrased_text.split()) > 8:
1488
+ # 30% chance to use BART for more variation (balanced)
1489
+ if random.random() < 0.3:
1490
+ paraphrased_text = self.paraphrase_with_bart(paraphrased_text)
1491
+
1492
+ # Apply sentence variation
1493
+ paraphrased_text = self.apply_sentence_variation(paraphrased_text)
1494
+
1495
+ # Add natural flow variations
1496
+ paraphrased_text = self.add_natural_flow_variations(paraphrased_text)
1497
+
1498
+ # Fix punctuation and formatting
1499
+ paraphrased_text = self.fix_punctuation(paraphrased_text)
1500
+
1501
+ # Final quality check
1502
+ if paraphrased_text and len(paraphrased_text.split()) >= 3:
1503
+ element_info['element'].replace_with(NavigableString(paraphrased_text))
1504
+ processed_count += 1
1505
+
1506
+ # Progress update
1507
+ if progress_callback:
1508
+ progress_callback(i + 1, total_elements)
1509
+
1510
+ if i % 10 == 0 or i == total_elements - 1:
1511
+ progress = (i + 1) / total_elements * 100
1512
+ print(f"Progress: {progress:.1f}%")
1513
+
1514
+ # Get the processed HTML
1515
+ result = str(soup)
1516
+
1517
+ # Restore all script tags
1518
+ for idx, script_content in enumerate(preserved_scripts):
1519
+ placeholder = script_placeholder.format(idx)
1520
+ result = result.replace(placeholder, script_content)
1521
+
1522
+ # Restore all style tags
1523
+ for idx, style_content in enumerate(preserved_styles):
1524
+ placeholder = style_placeholder.format(idx)
1525
+ result = result.replace(placeholder, style_content)
1526
+
1527
+ # Post-process the entire HTML to fix bold/strong formatting
1528
+ result = self.post_process_html(result)
1529
+
1530
+ # Validate and fix HTML syntax
1531
+ result = self.validate_and_fix_html(result)
1532
+
1533
+ # Count skipped elements properly
1534
+ all_text_elements = soup.find_all(string=True)
1535
+ skipped = len([e for e in all_text_elements if e.strip() and e.parent.name not in ['script', 'style', 'noscript']]) - total_elements
1536
+
1537
+ print(f"Successfully processed {processed_count} text elements")
1538
+ print(f"Skipped {skipped} elements (headings, CTAs, tables, testimonials, strong/bold tags, etc.)")
1539
+ print(f"Preserved {len(preserved_scripts)} script tags and {len(preserved_styles)} style tags")
1540
+
1541
+ return result
1542
+
1543
+ except Exception as e:
1544
+ import traceback
1545
+ error_msg = f"Error processing HTML: {str(e)}\n{traceback.format_exc()}"
1546
+ print(error_msg)
1547
+ # Return original HTML with error message prepended as HTML comment
1548
+ return f"<!-- {error_msg} -->\n{html_content}"
1549
+
1550
+ def post_process_html(self, html_text):
1551
+ """Post-process the entire HTML to fix formatting issues"""
1552
+ # Fix empty angle brackets that might appear
1553
+ html_text = re.sub(r'<>\s*([^<>]+?)\s*(?=\.|\s|<)', r'\1', html_text) # Remove <> around text
1554
+ html_text = re.sub(r'<>', '', html_text) # Remove any remaining empty <>
1555
+
1556
+ # Fix double angle brackets around bold tags
1557
+ html_text = re.sub(r'<<b>>', '<b>', html_text)
1558
+ html_text = re.sub(r'<</b>>', '</b>', html_text)
1559
+ html_text = re.sub(r'<<strong>>', '<strong>', html_text)
1560
+ html_text = re.sub(r'<</strong>>', '</strong>', html_text)
1561
+
1562
+ # Fix periods around bold/strong tags
1563
+ html_text = re.sub(r'\.\s*<(b|strong)>', '. <\1>', html_text) # Period before bold
1564
+ html_text = re.sub(r'</(b|strong)>\s*\.', '</\1>.', html_text) # Period after bold
1565
+ html_text = re.sub(r'\.<<(b|strong)>>', '. <\1>', html_text) # Fix double bracket cases
1566
+ html_text = re.sub(r'</(b|strong)>>\.', '</\1>.', html_text)
1567
+
1568
+ # Fix periods after colons
1569
+ html_text = re.sub(r':\s*\.', ':', html_text)
1570
+ html_text = re.sub(r'\.:', ':', html_text)
1571
+
1572
+ # Check if a line is a list item
1573
+ def process_line(line):
1574
+ # Check if this line contains a list pattern with bold
1575
+ list_pattern = r'(?:^|\s)(?:[-•*▪▫◦‣⁃]\s*)?<(?:strong|b)>[^<]+:</(?:strong|b)>'
1576
+ if re.search(list_pattern, line):
1577
+ # This is a list item, preserve the colon format
1578
+ return line
1579
+
1580
+ # Not a list item, apply regular fixes
1581
+ # Remove periods immediately inside bold tags
1582
+ line = re.sub(r'<(strong|b)>\s*\.\s*([^<]+)\s*\.\s*</\1>', r'<\1>\2</\1>', line)
1583
+
1584
+ # Fix sentence endings with bold
1585
+ line = re.sub(r'</(strong|b)>\s*([.!?])', r'</\1>\2', line)
1586
+
1587
+ return line
1588
+
1589
+ # Process line by line to preserve list formatting
1590
+ lines = html_text.split('\n')
1591
+ processed_lines = [process_line(line) for line in lines]
1592
+ html_text = '\n'.join(processed_lines)
1593
+
1594
+ # Fix sentence starts with bold
1595
+ def fix_bold_sentence_start(match):
1596
+ pre_context = match.group(1)
1597
+ tag = match.group(2)
1598
+ content = match.group(3)
1599
+
1600
+ # Skip if this is part of a list item with colon
1601
+ full_match = match.group(0)
1602
+ if ':' in full_match and '</' + tag + '>' in full_match:
1603
+ return full_match
1604
+
1605
+ # Check if this should start with capital
1606
+ if pre_context == '' or pre_context.endswith(('.', '!', '?', '>')):
1607
+ if content and content[0].islower():
1608
+ content = content[0].upper() + content[1:]
1609
+
1610
+ return f'{pre_context}<{tag}>{content}'
1611
+
1612
+ # Look for bold/strong tags and check their context
1613
+ html_text = re.sub(r'(^|.*?)(<(?:strong|b)>)([a-zA-Z])', fix_bold_sentence_start, html_text)
1614
+
1615
+ # Clean up spacing around bold tags (but preserve list formatting)
1616
+ # Split into segments to handle list items separately
1617
+ segments = re.split(r'(<(?:strong|b)>[^<]*:</(?:strong|b)>)', html_text)
1618
+ cleaned_segments = []
1619
+
1620
+ for i, segment in enumerate(segments):
1621
+ if i % 2 == 1: # This is a list item pattern
1622
+ cleaned_segments.append(segment)
1623
+ else:
1624
+ # Apply spacing fixes to non-list segments
1625
+ segment = re.sub(r'\s+<(strong|b)>', r' <\1>', segment)
1626
+ segment = re.sub(r'</(strong|b)>\s+', r'</\1> ', segment)
1627
+ # Fix punctuation issues
1628
+ segment = re.sub(r'([.,!?;:])\s*([.,!?;:])', r'\1', segment)
1629
+ # Fix periods inside/around bold
1630
+ segment = re.sub(r'\.<(strong|b)>\.', '. <\1>', segment)
1631
+ segment = re.sub(r'\.</(strong|b)>\.', '</\1>.', segment)
1632
+ cleaned_segments.append(segment)
1633
+
1634
+ html_text = ''.join(cleaned_segments)
1635
+
1636
+ # Final cleanup
1637
+ html_text = re.sub(r'\.{2,}', '.', html_text) # Multiple periods
1638
+ html_text = re.sub(r',{2,}', ',', html_text) # Multiple commas
1639
+ html_text = re.sub(r':{2,}', ':', html_text) # Multiple colons
1640
+ html_text = re.sub(r'\s+([.,!?;:])', r'\1', html_text) # Space before punctuation
1641
+
1642
+ # Fix empty bold tags (but not those with just colons)
1643
+ html_text = re.sub(r'<(strong|b)>\s*</\1>', '', html_text)
1644
+
1645
+ # Fix specific patterns in lists/stats
1646
+ # Pattern like "5,000+" should not have period after
1647
+ html_text = re.sub(r'(\d+[,\d]*\+?)\s*\.\s*\n', r'\1\n', html_text)
1648
+
1649
+ # Clean up any remaining double brackets
1650
+ html_text = re.sub(r'<<', '<', html_text)
1651
+ html_text = re.sub(r'>>', '>', html_text)
1652
+
1653
+ # Apply final minimal grammar fixes
1654
+ html_text = self.grammar_fixer.smart_fix(html_text)
1655
+
1656
+ return html_text
1657
 
1658
+ # Initialize the humanizer
1659
+ humanizer = EnhancedDipperHumanizer()
1660
 
1661
+ def humanize_html(html_input, progress=gr.Progress()):
1662
+ """Gradio interface function with progress updates"""
1663
+ if not html_input:
1664
+ return "Please provide HTML content to humanize."
1665
+
1666
+ progress(0, desc="Starting processing...")
1667
+ start_time = time.time()
1668
+
1669
+ # Create a wrapper to update progress
1670
+ def progress_callback(current, total):
1671
+ if total > 0:
1672
+ progress(current / total, desc=f"Processing: {current}/{total} elements")
1673
+
1674
+ # Pass progress callback to process_html
1675
+ result = humanizer.process_html(
1676
+ html_input,
1677
+ progress_callback=progress_callback
1678
+ )
1679
+
1680
+ processing_time = time.time() - start_time
1681
+ print(f"Processing completed in {processing_time:.2f} seconds")
1682
+ progress(1.0, desc="Complete!")
1683
+
1684
+ return result
1685
 
1686
+ # Create Gradio interface with queue
1687
+ iface = gr.Interface(
1688
+ fn=humanize_html,
1689
+ inputs=[
1690
+ gr.Textbox(
1691
+ lines=10,
1692
+ placeholder="Paste your HTML content here...",
1693
+ label="HTML Input"
1694
+ )
1695
+ ],
1696
+ outputs=gr.Textbox(
1697
+ lines=10,
1698
+ label="Humanized HTML Output"
1699
+ ),
1700
+ title="Enhanced Dipper AI Humanizer - Optimized for Originality AI",
1701
+ description="""
1702
+ Ultra-aggressive humanizer optimized to achieve 100% human scores on both Undetectable AI and Originality AI.
1703
+
1704
+ Key Features:
1705
+ - Maximum diversity settings (90% lexical, 40% order) for natural variation
1706
+ - Enhanced human patterns: personal opinions, self-corrections, thinking-out-loud
1707
+ - Natural typos, contractions, and conversational flow
1708
+ - Stream-of-consciousness elements and rhetorical questions
1709
+ - Originality AI-specific optimizations: varied sentence starters, emphatic repetitions
1710
+ - Skips content in <strong>, <b>, and heading tags (including inside tables)
1711
+ - Designed to pass the strictest AI detection systems
1712
+
1713
+ The tool creates genuinely human-like writing patterns that fool even the most sophisticated detectors!
1714
+
1715
+ ⚠️ Note: Processing may take 5-10 minutes for large HTML documents.
1716
+ """,
1717
+ examples=[
1718
+ ["""<article>
1719
+ <h1>The Benefits of Regular Exercise</h1>
1720
+ <div class="author-intro">By John Doe, Fitness Expert | 10 years experience</div>
1721
+ <p>Regular exercise is essential for maintaining good health. It helps improve cardiovascular fitness, strengthens muscles, and enhances mental well-being. Studies have shown that people who exercise regularly have lower risks of chronic diseases.</p>
1722
+ <p>Additionally, exercise can boost mood and energy levels. It releases endorphins, which are natural mood elevators. Even moderate activities like walking can make a significant difference in overall health.</p>
1723
+ </article>"""]
1724
  ],
1725
+ theme="default"
1726
  )
1727
 
 
 
 
 
 
 
1728
  if __name__ == "__main__":
1729
+ # Enable queue for better handling of long-running processes
1730
+ iface.queue(max_size=10)
1731
+ iface.launch(share=True)