Vanelsz commited on
Commit
2fd2a64
·
verified ·
1 Parent(s): defea43

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +488 -3
README.md CHANGED
@@ -1,3 +1,488 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+ <img src="imgs/logo.jpg" width="80%" >
3
+ </div>
4
+
5
+
6
+ <p align="center">
7
+ 🤗 <a href="https://github.com/alibaba/Logics-Parsing">GitHub</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://www.modelscope.cn/studios/Alibaba-DT/Logics-Parsing/summary">Demo</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2509.19760">Technical Report</a>
8
+ </p>
9
+
10
+ ## Introduction
11
+ <div align="center">
12
+ <img src="imgs/overview.png" alt="LogicsDocBench 概览" style="width: 800px; height: 250px;">
13
+ </div>
14
+
15
+ <div align="center">
16
+ <table style="width: 800px;">
17
+ <tr>
18
+ <td align="center">
19
+ <img src="imgs/report.gif" alt="研报示例">
20
+ </td>
21
+ <td align="center">
22
+ <img src="imgs/chemistry.gif" alt="化学分子式示例">
23
+ </td>
24
+ <td align="center">
25
+ <img src="imgs/paper.gif" alt="论文示例">
26
+ </td>
27
+ <td align="center">
28
+ <img src="imgs/handwritten.gif" alt="手写示例">
29
+ </td>
30
+ </tr>
31
+ <tr>
32
+ <td align="center"><b>report</b></td>
33
+ <td align="center"><b>chemistry</b></td>
34
+ <td align="center"><b>paper</b></td>
35
+ <td align="center"><b>handwritten</b></td>
36
+ </tr>
37
+ </table>
38
+ </div>
39
+
40
+
41
+
42
+ Logics-Parsing is a powerful, end-to-end document parsing model built upon a general Vision-Language Model (VLM) through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). It excels at accurately analyzing and structuring highly complex documents.
43
+
44
+ ## Key Features
45
+
46
+ * **Effortless End-to-End Processing**
47
+ * Our single-model architecture eliminates the need for complex, multi-stage pipelines. Deployment and inference are straightforward, going directly from a document image to structured output.
48
+ * It demonstrates exceptional performance on documents with challenging layouts.
49
+
50
+ * **Advanced Content Recognition**
51
+ * It accurately recognizes and structures difficult content, including intricate scientific formulas.
52
+ * Chemical structures are intelligently identified and can be represented in the standard **SMILES** format.
53
+
54
+ * **Rich, Structured HTML Output**
55
+ * The model generates a clean HTML representation of the document, preserving its logical structure.
56
+ * Each content block (e.g., paragraph, table, figure, formula) is tagged with its **category**, **bounding box coordinates**, and **OCR text**.
57
+ * It automatically identifies and filters out irrelevant elements like headers and footers, focusing only on the core content.
58
+
59
+ * **State-of-the-Art Performance**
60
+ * Logics-Parsing achieves the best performance on our in-house benchmark, which is specifically designed to comprehensively evaluate a model’s parsing capability on complex-layout documents and STEM content.
61
+
62
+
63
+
64
+
65
+
66
+ ## Benchmark
67
+
68
+ Existing document-parsing benchmarks often provide limited coverage of complex layouts and STEM content. To address this, we constructed an in-house benchmark comprising 1,078 page-level images across nine major categories and over twenty sub-categories. Our model achieves the best performance on this benchmark.
69
+ <div align="center">
70
+ <img src="imgs/BenchCls.png">
71
+ </div>
72
+ <table>
73
+ <tr>
74
+ <td rowspan="2">Model Type</td>
75
+ <td rowspan="2">Methods</td>
76
+ <td colspan="2">Overall <sup>Edit</sup> ↓</td>
77
+ <td colspan="2">Text Edit <sup>Edit</sup> ↓</td>
78
+ <td colspan="2">Formula <sup>Edit</sup> ↓</td>
79
+ <td colspan="2">Table <sup>TEDS</sup> ↑</td>
80
+ <td colspan="2">Table <sup>Edit</sup> ↓</td>
81
+ <td colspan="2">ReadOrder<sup>Edit</sup> ↓</td>
82
+ <td rowspan="1">Chemistry<sup>Edit</sup> ↓</td>
83
+ <td rowspan="1">HandWriting<sup>Edit</sup> ↓</td>
84
+ </tr>
85
+ <tr>
86
+ <td>EN</td>
87
+ <td>ZH</td>
88
+ <td>EN</td>
89
+ <td>ZH</td>
90
+ <td>EN</td>
91
+ <td>ZH</td>
92
+ <td>EN</td>
93
+ <td>ZH</td>
94
+ <td>EN</td>
95
+ <td>ZH</td>
96
+ <td>EN</td>
97
+ <td>ZH</td>
98
+ <td>ALL</td>
99
+ <td>ALL</td>
100
+ </tr>
101
+ <tr>
102
+ <td rowspan="7">Pipeline Tools</td>
103
+ <td>doc2x</td>
104
+ <td>0.209</td>
105
+ <td>0.188</td>
106
+ <td>0.128</td>
107
+ <td>0.194</td>
108
+ <td>0.377</td>
109
+ <td>0.321</td>
110
+ <td>81.1</td>
111
+ <td>85.3</td>
112
+ <td><ins>0.148</ins></td>
113
+ <td><ins>0.115</ins></td>
114
+ <td>0.146</td>
115
+ <td>0.122</td>
116
+ <td>1.0</td>
117
+ <td>0.307</td>
118
+ </tr>
119
+ <tr>
120
+ <td>Textin</td>
121
+ <td>0.153</td>
122
+ <td>0.158</td>
123
+ <td>0.132</td>
124
+ <td>0.190</td>
125
+ <td>0.185</td>
126
+ <td>0.223</td>
127
+ <td>76.7</td>
128
+ <td><ins>86.3</ins></td>
129
+ <td>0.176</td>
130
+ <td><b>0.113</b></td>
131
+ <td><b>0.118</b></td>
132
+ <td><b>0.104</b></td>
133
+ <td>1.0</td>
134
+ <td>0.344</td>
135
+ </tr>
136
+ <tr>
137
+ <td>mathpix<sup>*</sup></td>
138
+ <td><ins>0.128</ins></td>
139
+ <td><ins>0.146</ins></td>
140
+ <td>0.128</td>
141
+ <td><ins>0.152</ins></td>
142
+ <td><b>0.06</b></td>
143
+ <td><b>0.142</b></td>
144
+ <td><b>86.2</b></td>
145
+ <td><b>86.6</b></td>
146
+ <td><b>0.120</b></td>
147
+ <td>0.127</td>
148
+ <td>0.204</td>
149
+ <td>0.164</td>
150
+ <td>0.552</td>
151
+ <td>0.263</td>
152
+ </tr>
153
+ <tr>
154
+ <td>PP_StructureV3</td>
155
+ <td>0.220</td>
156
+ <td>0.226</td>
157
+ <td>0.172</td>
158
+ <td>0.29</td>
159
+ <td>0.272</td>
160
+ <td>0.276</td>
161
+ <td>66</td>
162
+ <td>71.5</td>
163
+ <td>0.237</td>
164
+ <td>0.193</td>
165
+ <td>0.201</td>
166
+ <td>0.143</td>
167
+ <td>1.0</td>
168
+ <td>0.382</td>
169
+ </tr>
170
+ <tr>
171
+ <td>Mineru2</td>
172
+ <td>0.212</td>
173
+ <td>0.245</td>
174
+ <td>0.134</td>
175
+ <td>0.195</td>
176
+ <td>0.280</td>
177
+ <td>0.407</td>
178
+ <td>67.5</td>
179
+ <td>71.8</td>
180
+ <td>0.228</td>
181
+ <td>0.203</td>
182
+ <td>0.205</td>
183
+ <td>0.177</td>
184
+ <td>1.0</td>
185
+ <td>0.387</td>
186
+ </tr>
187
+ <tr>
188
+ <td>Marker</td>
189
+ <td>0.324</td>
190
+ <td>0.409</td>
191
+ <td>0.188</td>
192
+ <td>0.289</td>
193
+ <td>0.285</td>
194
+ <td>0.383</td>
195
+ <td>65.5</td>
196
+ <td>50.4</td>
197
+ <td>0.593</td>
198
+ <td>0.702</td>
199
+ <td>0.23</td>
200
+ <td>0.262</td>
201
+ <td>1.0</td>
202
+ <td>0.50</td>
203
+ </tr>
204
+ <tr>
205
+ <td>Pix2text</td>
206
+ <td>0.447</td>
207
+ <td>0.547</td>
208
+ <td>0.485</td>
209
+ <td>0.577</td>
210
+ <td>0.312</td>
211
+ <td>0.465</td>
212
+ <td>64.7</td>
213
+ <td>63.0</td>
214
+ <td>0.566</td>
215
+ <td>0.613</td>
216
+ <td>0.424</td>
217
+ <td>0.534</td>
218
+ <td>1.0</td>
219
+ <td>0.95</td>
220
+ </tr>
221
+ <tr>
222
+ <td rowspan="8">Expert VLMs</td>
223
+ <td>Dolphin</td>
224
+ <td>0.208</td>
225
+ <td>0.256</td>
226
+ <td>0.149</td>
227
+ <td>0.189</td>
228
+ <td>0.334</td>
229
+ <td>0.346</td>
230
+ <td>72.9</td>
231
+ <td>60.1</td>
232
+ <td>0.192</td>
233
+ <td>0.35</td>
234
+ <td>0.160</td>
235
+ <td>0.139</td>
236
+ <td>0.984</td>
237
+ <td>0.433</td>
238
+ </tr>
239
+ <tr>
240
+ <td>dots.ocr</td>
241
+ <td>0.186</td>
242
+ <td>0.198</td>
243
+ <td><ins>0.115</ins></td>
244
+ <td>0.169</td>
245
+ <td>0.291</td>
246
+ <td>0.358</td>
247
+ <td>79.5</td>
248
+ <td>82.5</td>
249
+ <td>0.172</td>
250
+ <td>0.141</td>
251
+ <td>0.165</td>
252
+ <td>0.123</td>
253
+ <td>1.0</td>
254
+ <td><ins>0.255</ins></td>
255
+ </tr>
256
+ <tr>
257
+ <td>MonkeyOcr</td>
258
+ <td>0.193</td>
259
+ <td>0.259</td>
260
+ <td>0.127</td>
261
+ <td>0.236</td>
262
+ <td>0.262</td>
263
+ <td>0.325</td>
264
+ <td>78.4</td>
265
+ <td>74.7</td>
266
+ <td>0.186</td>
267
+ <td>0.294</td>
268
+ <td>0.197</td>
269
+ <td>0.180</td>
270
+ <td>1.0</td>
271
+ <td>0.623</td>
272
+ </tr>
273
+ <tr>
274
+ <td>OCRFlux</td>
275
+ <td>0.252</td>
276
+ <td>0.254</td>
277
+ <td>0.134</td>
278
+ <td>0.195</td>
279
+ <td>0.326</td>
280
+ <td>0.405</td>
281
+ <td>58.3</td>
282
+ <td>70.2</td>
283
+ <td>0.358</td>
284
+ <td>0.260</td>
285
+ <td>0.191</td>
286
+ <td>0.156</td>
287
+ <td>1.0</td>
288
+ <td>0.284</td>
289
+ </tr>
290
+ <tr>
291
+ <td>Gotocr</td>
292
+ <td>0.247</td>
293
+ <td>0.249</td>
294
+ <td>0.181</td>
295
+ <td>0.213</td>
296
+ <td>0.231</td>
297
+ <td>0.318</td>
298
+ <td>59.5</td>
299
+ <td>74.7</td>
300
+ <td>0.38</td>
301
+ <td>0.299</td>
302
+ <td>0.195</td>
303
+ <td>0.164</td>
304
+ <td>0.969</td>
305
+ <td>0.446</td>
306
+ </tr>
307
+ <tr>
308
+ <td>Olmocr</td>
309
+ <td>0.341</td>
310
+ <td>0.382</td>
311
+ <td>0.125</td>
312
+ <td>0.205</td>
313
+ <td>0.719</td>
314
+ <td>0.766</td>
315
+ <td>57.1</td>
316
+ <td>56.6</td>
317
+ <td>0.327</td>
318
+ <td>0.389</td>
319
+ <td>0.191</td>
320
+ <td>0.169</td>
321
+ <td>1.0</td>
322
+ <td>0.294</td>
323
+ </tr>
324
+ <tr>
325
+ <td>SmolDocling</td>
326
+ <td>0.657</td>
327
+ <td>0.895</td>
328
+ <td>0.486</td>
329
+ <td>0.932</td>
330
+ <td>0.859</td>
331
+ <td>0.972</td>
332
+ <td>18.5</td>
333
+ <td>1.5</td>
334
+ <td>0.86</td>
335
+ <td>0.98</td>
336
+ <td>0.413</td>
337
+ <td>0.695</td>
338
+ <td>1.0</td>
339
+ <td>0.927</td>
340
+ </tr>
341
+ <tr>
342
+ <td><b>Logics-Parsing</b></td>
343
+ <td><b>0.124</b></td>
344
+ <td><b>0.145</b></td>
345
+ <td><b>0.089</b></td>
346
+ <td><b>0.139</b></td>
347
+ <td><ins>0.106</ins></td>
348
+ <td><ins>0.165</ins></td>
349
+ <td>76.6</td>
350
+ <td>79.5</td>
351
+ <td>0.165</td>
352
+ <td>0.166</td>
353
+ <td><ins>0.136</ins></td>
354
+ <td><ins>0.113</ins></td>
355
+ <td><b>0.519</b></td>
356
+ <td><b>0.252</b></td>
357
+ </tr>
358
+ <tr>
359
+ <td rowspan="5">General VLMs</td>
360
+ <td>Qwen2VL-72B</td>
361
+ <td>0.298</td>
362
+ <td>0.342</td>
363
+ <td>0.142</td>
364
+ <td>0.244</td>
365
+ <td>0.431</td>
366
+ <td>0.363</td>
367
+ <td>64.2</td>
368
+ <td>55.5</td>
369
+ <td>0.425</td>
370
+ <td>0.581</td>
371
+ <td>0.193</td>
372
+ <td>0.182</td>
373
+ <td>0.792</td>
374
+ <td>0.359</td>
375
+ </tr>
376
+ <tr>
377
+ <td>Qwen2.5VL-72B</td>
378
+ <td>0.233</td>
379
+ <td>0.263</td>
380
+ <td>0.162</td>
381
+ <td>0.24</td>
382
+ <td>0.251</td>
383
+ <td>0.257</td>
384
+ <td>69.6</td>
385
+ <td>67</td>
386
+ <td>0.313</td>
387
+ <td>0.353</td>
388
+ <td>0.205</td>
389
+ <td>0.204</td>
390
+ <td>0.597</td>
391
+ <td>0.349</td>
392
+ </tr>
393
+ <tr>
394
+ <td>Doubao-1.6</td>
395
+ <td>0.188</td>
396
+ <td>0.248</td>
397
+ <td>0.129</td>
398
+ <td>0.219</td>
399
+ <td>0.273</td>
400
+ <td>0.336</td>
401
+ <td>74.9</td>
402
+ <td>69.7</td>
403
+ <td>0.180</td>
404
+ <td>0.288</td>
405
+ <td>0.171</td>
406
+ <td>0.148</td>
407
+ <td>0.601</td>
408
+ <td>0.317</td>
409
+ </tr>
410
+ <tr>
411
+ <td>GPT-5</td>
412
+ <td>0.242</td>
413
+ <td>0.373</td>
414
+ <td>0.119</td>
415
+ <td>0.36</td>
416
+ <td>0.398</td>
417
+ <td>0.456</td>
418
+ <td>67.9</td>
419
+ <td>55.8</td>
420
+ <td>0.26</td>
421
+ <td>0.397</td>
422
+ <td>0.191</td>
423
+ <td>0.28</td>
424
+ <td>0.88</td>
425
+ <td>0.46</td>
426
+ </tr>
427
+ <tr>
428
+ <td>Gemini2.5 pro</td>
429
+ <td>0.185</td>
430
+ <td>0.20</td>
431
+ <td><ins>0.115</ins></td>
432
+ <td>0.155</td>
433
+ <td>0.288</td>
434
+ <td>0.326</td>
435
+ <td><ins>82.6</ins></td>
436
+ <td>80.3</td>
437
+ <td>0.154</td>
438
+ <td>0.182</td>
439
+ <td>0.181</td>
440
+ <td>0.136</td>
441
+ <td><ins>0.535</ins></td>
442
+ <td>0.26</td>
443
+ </tr>
444
+
445
+ </table>
446
+ <!-- 脚注说明 -->
447
+ <tr>
448
+ <td colspan="5">
449
+ <sup>*</sup> Tested on the v3/PDF Conversion API (August 2025 deployment).
450
+
451
+ </td>
452
+ </tr>
453
+
454
+
455
+ ## Quick Start
456
+ ### 1. Installation
457
+ ```shell
458
+ conda create -n logis-parsing python=3.10
459
+ conda activate logis-parsing
460
+
461
+ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
462
+
463
+ ```
464
+ ### 2. Download Model Weights
465
+
466
+ ```
467
+ # Download our model from Modelscope.
468
+ pip install modelscope
469
+ python download_model.py -t modelscope
470
+
471
+ # Download our model from huggingface.
472
+ pip install huggingface_hub
473
+ python download_model.py -t huggingface
474
+ ```
475
+
476
+ ### 3. Inference
477
+ ```shell
478
+ python3 inference.py --image_path PATH_TO_INPUT_IMG --output_path PATH_TO_OUTPUT --model_path PATH_TO_MODEL
479
+ ```
480
+
481
+ ## Acknowledgments
482
+
483
+
484
+ We would like to acknowledge the following open-source projects that provided inspiration and reference for this work:
485
+ - [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
486
+ - [OmniDocBench](https://github.com/opendatalab/OmniDocBench)
487
+ - [Mathpix](https://mathpix.com/)
488
+