在實際工作中,非結構化數據比結構化數據豐富得多。如果這些海量數據無法解析,它們的巨大價值將無法實現。在非結構化數據中,PDF文檔占大多數。有效地處理PDF文檔還可以極大地幫助管理其他類型的非結構化文檔。

       本文主要介紹解析PDF文件的方法,為有效解析PDF文檔和提取盡可能多的有用信息提供了算法和參考。

一、解析PDF的挑戰

       PDF文檔是非結構化文檔的代表,然而,從PDF文檔中提取信息是一個具有挑戰性的過程。

       將PDF描述為輸出指令的集合更準確,而不是數據格式。PDF文件由一系列指令組成,這些指令指示PDF閱讀器或打印機在屏幕或紙張上顯示符號的位置和方式,這與HTML和docx等文件格式形成對比,后者使用<p>、<w:p>、<table>和<w:tbl>等標記來組織不同的邏輯結構,如圖2所示:

       解析PDF文檔的挑戰在于準確提取整個頁面的布局,并將包括表格、標題、段落和圖像在內的內容翻譯成文檔的文本表示。這個過程涉及到處理文本提取、圖像識別中的不準確之處,以及表中行-列關系的混亂。

二、如何解析PDF文檔

一般來說,解析PDF有三種方法:

2.1 基于規則的方法

? ? ? ?pypdf[1]就是一種基于規則廣泛使用的解析器,也是LangChainLlamaIndex中解析PDF文件的標準方法。

      以下是使用pypdf解析“Attention Is All You Need”[2]論文的第6頁。原始頁面如圖3所示:

代碼如下:

import PyPDF2filename = "/Users/Florian/Downloads/1706.03762.pdf"pdf_file = open(filename, 'rb')
reader = PyPDF2.PdfReader(pdf_file)
page_num = 5page = reader.pages[page_num]text = page.extract_text()
print('--------------------------------------------------')print(text)
pdf_file.close()

執行的結果是(為了簡潔起見,省略了其余部分):

(py) Florian:~ Florian$ pip list | grep pypdfpypdf 3.17.4pypdfium2 4.26.0
(py) Florian:~ Florian$ python /Users/Florian/Downloads/pypdf_test.py--------------------------------------------------Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operationsfor different layer types. nis the sequence length, dis the representation dimension, kis the kernelsize of convolutions and rthe size of the neighborhood in restricted self-attention.Layer Type Complexity per Layer Sequential Maximum Path LengthOperationsSelf-Attention O(n2·d) O(1) O(1)Recurrent O(n·d2) O(n) O(n)Convolutional O(k·n·d2) O(1) O(logk(n))Self-Attention (restricted) O(r·n·d) O(1) O(n/r)3.5 Positional EncodingSince our model contains no recurrence and no convolution, in order for the model to make use of theorder of the sequence, we must inject some information about the relative or absolute position of thetokens in the sequence. To this end, we add "positional encodings" to the input embeddings at thebottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodelas the embeddings, so that the two can be summed. There are many choices of positional encodings,learned and fixed [9].In this work, we use sine and cosine functions of different frequencies:PE(pos,2i)=sin(pos/100002i/d model)PE(pos,2i+1)=cos(pos/100002i/d model)where posis the position and iis the dimension. That is, each dimension of the positional encodingcorresponds to a sinusoid. The wavelengths form a geometric progression from 2πto10000 ·2π. Wechose this function because we hypothesized it would allow the model to easily learn to attend byrelative positions, since for any fixed offset k,PEpos+kcan be represented as a linear function ofPEpos..........

       從上述基于PyPDF檢測的結果來看,可以觀察到它在不保留結構信息的情況下將PDF中的字符序列序列化為單個長序列。換句話說,它將文檔的每一行都視為一個由換行符“\n”分隔的序列,這會妨礙段落或表格的準確識別。

       這種限制是基于規則的方法的固有特征。

2.2 基于深度學習模型的方法

       這種方法的優點是能夠準確識別整個文檔的布局,包括表格和段落。它甚至可以理解表中的結構。這意味著它可以將文檔劃分為定義明確、完整的信息單元,同時保留預期的含義和結構。

       然而,這種方法也有一些局限性,目標檢測和OCR階段可能很耗時。因此,建議使用GPU或其他加速設備,并使用多個進程和線程進行處理。

       這種方法涉及目標檢測和OCR模型,我測試了幾個有代表性的開源框架:

      除了開源工具,還有像ChatDOC這樣的付費工具,它們利用基于布局的識別+OCR方法來解析PDF文檔。

      接下來,我們將使用開源unstructured[3]解析PDF,解決三個關鍵挑戰。

挑戰1:如何從表格和圖像中提取數據

      在這里,我們將使用unstructured[3]框架作為示例,檢測到的表數據可以直接導出為HTML。其代碼如下:

from unstructured.partition.pdf import partition_pdf
filename = "/Users/Florian/Downloads/Attention_Is_All_You_Need.pdf"
# infer_table_structure=True automatically selects hi_res strategyelements = partition_pdf(filename=filename, infer_table_structure=True)tables = [el for el in elements if el.category == "Table"]
print(tables[0].text)print('--------------------------------------------------')print(tables[0].metadata.text_as_html)

         partition_pdf函數的內部流程如下圖5所示:

         代碼的運行結果如下:

Layer Type Self-Attention Recurrent Convolutional Self-Attention (restricted) Complexity per Layer O(n2 · d) O(n · d2) O(k · n · d2) O(r · n · d) Sequential Maximum Path Length Operations O(1) O(n) O(1) O(1) O(1) O(n) O(logk(n)) O(n/r)--------------------------------------------------<table><thead><th>Layer Type</th><th>Complexity per Layer</th><th>Sequential Operations</th><th>Maximum Path Length</th></thead><tr><td>Self-Attention</td><td>O(n? - d)</td><td>O(1)</td><td>O(1)</td></tr><tr><td>Recurrent</td><td>O(n- d?)</td><td>O(n)</td><td>O(n)</td></tr><tr><td>Convolutional</td><td>O(k-n-d?)</td><td>O(1)</td><td>O(logy(n))</td></tr><tr><td>Self-Attention (restricted)</td><td>O(r-n-d)</td><td>ol)</td><td>O(n/r)</td></tr></table>

       復制HTML標記并將其另存為HTML文件。然后,使用Chrome打開它,如圖6所示:

        可以觀察到,非結構化的算法在很大程度上恢復了整個表。

挑戰2:如何重新排列檢測到的塊?特別是對于雙列PDF

       在處理雙列PDF時,讓我們以論文“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”[8]為例,讀取順序由紅色箭頭所示:

       在確定布局后,unstructured[3]框架會將每個頁面劃分為幾個矩形塊,如圖8所示:

         每個矩形塊的詳細信息可以通過以下格式獲得:

[
LayoutElement(bbox=Rectangle(x1=851.1539916992188, y1=181.15073777777613, x2=1467.844970703125, y2=587.8204599999975), text='These approaches have been generalized to coarser granularities, such as sentence embed- dings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sen- tence words given a representation of the previous sentence (Kiros et al., 2015), or denoising auto- encoder derived objectives (Hill et al., 2016). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9519357085227966, image_path=None, parent=None),
LayoutElement(bbox=Rectangle(x1=196.5296173095703, y1=181.1507377777777, x2=815.468994140625, y2=512.548237777777), text='word based only on its context. Unlike left-to- right language model pre-training, the MLM ob- jective enables the representation to fuse the left and the right context, which allows us to pre- In addi- train a deep bidirectional Transformer. tion to the masked language model, we also use a “next sentence prediction” task that jointly pre- trains text-pair representations. The contributions of our paper are as follows: ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9517233967781067, image_path=None, parent=None),
LayoutElement(bbox=Rectangle(x1=200.22352600097656, y1=539.1451822222216, x2=825.0242919921875, y2=870.542682222221), text='? We demonstrate the importance of bidirectional pre-training for language representations. Un- like Radford et al. (2018), which uses unidirec- tional language models for pre-training, BERT uses masked language models to enable pre- trained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9414362907409668, image_path=None, parent=None),
LayoutElement(bbox=Rectangle(x1=851.8727416992188, y1=599.8257377777753, x2=1468.0499267578125, y2=1420.4982377777742), text='ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding re- search along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual rep- resentation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-speci?c architectures, ELMo advances the state of the art for several major NLP benchmarks (Peters et al., 2018a) including ques- tion answering (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003). Melamud et al. (2016) proposed learning contextual representations through a task to pre- dict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation mod- els. ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.938507616519928, image_path=None, parent=None),

LayoutElement(bbox=Rectangle(x1=199.3734130859375, y1=900.5257377777765, x2=824.69873046875, y2=1156.648237777776), text='? We show that pre-trained representations reduce the need for many heavily-engineered task- speci?c architectures. BERT is the ?rst ?ne- tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outper- forming many task-speci?c architectures. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9461237788200378, image_path=None, parent=None),
LayoutElement(bbox=Rectangle(x1=195.5695343017578, y1=1185.526123046875, x2=815.9393920898438, y2=1330.3272705078125), text='? BERT advances the state of the art for eleven NLP tasks. The code and pre-trained mod- els are available at https://github.com/ google-research/bert. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9213815927505493, image_path=None, parent=None),
LayoutElement(bbox=Rectangle(x1=195.33956909179688, y1=1360.7886962890625, x2=447.47264000000007, y2=1397.038330078125), text='2 Related Work ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8663332462310791, image_path=None, parent=None),
LayoutElement(bbox=Rectangle(x1=197.7477264404297, y1=1419.3353271484375, x2=817.3308715820312, y2=1527.54443359375), text='There is a long history of pre-training general lan- guage representations, and we brie?y review the most widely-used approaches in this section. ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.928022563457489, image_path=None, parent=None),
LayoutElement(bbox=Rectangle(x1=851.0028686523438, y1=1468.341394166663, x2=1420.4693603515625, y2=1498.6444497222187), text='2.2 Unsupervised Fine-tuning Approaches ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8346447348594666, image_path=None, parent=None),
LayoutElement(bbox=Rectangle(x1=853.5444444444446, y1=1526.3701822222185, x2=1470.989990234375, y2=1669.5843488888852), text='As with the feature-based approaches, the ?rst works in this direction only pre-trained word em- (Col- bedding parameters from unlabeled text lobert and Weston, 2008). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9344717860221863, image_path=None, parent=None),
LayoutElement(bbox=Rectangle(x1=200.00000000000009, y1=1556.2037353515625, x2=799.1743774414062, y2=1588.031982421875), text='2.1 Unsupervised Feature-based Approaches ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8317819237709045, image_path=None, parent=None),
LayoutElement(bbox=Rectangle(x1=198.64227294921875, y1=1606.3146266666645, x2=815.2886352539062, y2=2125.895459999998), text='Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, of- fering signi?cant improvements over embeddings learned from scratch (Turian et al., 2010). To pre- train word embedding vectors, left-to-right lan- guage modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to dis- criminate correct from incorrect words in left and right context (Mikolov et al., 2013). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9450697302818298, image_path=None, parent=None),
LayoutElement(bbox=Rectangle(x1=853.4905395507812, y1=1681.5868488888855, x2=1467.8729248046875, y2=2125.8954599999965), text='More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and ?ne-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved pre- viously state-of-the-art results on many sentence- level tasks from the GLUE benchmark (Wang language model- Left-to-right et al., 2018a). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9476840496063232, image_path=None, parent=None)
]

        其中(x1,y1)是左上頂點的坐標,(x2,y2)是右下頂點的坐標:

(x_1, y_1) -------- | | | | | | ---------- (x_2, y_2)

       此時,可以選擇重新調整頁面的閱讀順序。Unstructured[3]有一個內置的排序算法,但我發現在處理雙列情況時,排序結果不是很令人滿意。

       因此,有必要設計一種算法。最簡單的方法是先按左上角頂點的水平坐標排序,如果水平坐標相同,則按垂直坐標排序。其偽代碼如下所示:

layout.sort(key=lambda z: (z.bbox.x1, z.bbox.y1, z.bbox.x2, z.bbox.y2))

       然而,我們發現,即使是同一列中的塊,其水平坐標也可能發生變化。如圖9所示,紫色線條塊的水平坐標bbox.x1實際上更靠左。排序時,它將位于綠線塊之前,這顯然違反了讀取順序。

在這種情況下使用的一種可能的算法如下:

x1_min = min([el.bbox.x1 for el in layout])x2_max = max([el.bbox.x2 for el in layout])mid_line_x_coordinate = (x2_max + x1_min) / 2

      接下來,如果bbox.x1<mid_line_x_cordinate,則塊被分類為左列的一部分。否則,它將被視為右列的一部分。

       分類完成后,根據列中的y坐標對每個塊進行排序。最后,將右側列連接到左側列的右側。

left_column = []right_column = []for el in layout: if el.bbox.x1 < mid_line_x_coordinate: left_column.append(el) else: right_column.append(el)
left_column.sort(key = lambda z: z.bbox.y1)right_column.sort(key = lambda z: z.bbox.y1)sorted_layout = left_column + right_column

       值得一提的是,這種改進也與單列PDF兼容。

挑戰3:如何提取多級標題

? ? ? ?提取標題(包括多級標題)的目的是提高LLM答案的準確性。

? ? ? ?例如,如果用戶想知道圖9中2.1節的主要內容,通過準確提取2.1節的標題,并將其與相關內容一起作為上下文發送給LLM,最終答案的準確性將顯著提高。

       該算法仍然依賴于圖9所示的布局塊。我們可以提取type=’Section-header’的塊,并計算高度差(bbox.y2–bbox.y1)。高度差最大的塊對應第一級標題,其次是第二級標題,然后是第三級標題。

2.3 基于多模態大模型解析復雜結構的PDF

? ? ? ?在多模態模型爆炸之后,也可以使用多模式模型來解析表。Llamalndex有幾個例子[9]:

       經過測試,確定第三種方法是最有效的。

       此外,我們可以使用多模態模型從圖像中提取或總結關鍵信息(PDF文件可以很容易地轉換為圖像),如圖10所示:

三、結論

      一般來說,非結構化文檔提供了高度的靈活性,并且需要各種解析技術。然而,目前還沒有達成共識的最佳實踐。

       在這種情況下,建議選擇最適合您項目需求的方法。建議根據不同類型的PDF應用特定的應對方法。例如,論文、書籍和財務報表可能會根據其特點進行獨特的設計。

       然而,如果可以的話,建議選擇基于深度學習或基于多模態的方法。這些方法可以有效地將文檔分割成定義明確、完整的信息單元,從而最大限度地保留文檔的預期含義和結構。

參考文獻:

[1] https://github.com/py-pdf/pypdf

[2] https://arxiv.org/pdf/1706.03762.pdf

[3] http://unstructured-io.github.io/unstructured/

[4] https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/document_loaders/pdf.py

[5] http://github.com/Layout-Parser/layout-parser

[6] https://layout-parser.github.io/platform/

[7] https://arxiv.org/pdf/2210.05391.pdf

[8] https://arxiv.org/pdf/1810.04805.pdf

[9] https://docs.llamaindex.ai/en/stable/examples/multi_modal/multi_modal_pdf_tables.html

文章轉自微信公眾號@ArronAI

上一篇:

零基礎構建基于LangChain的聊天機器人

下一篇:

如何獲取ipgeolocation 開放平臺 API Key 密鑰(分步指南)
#你可能也喜歡這些API文章!

我們有何不同?

API服務商零注冊

多API并行試用

數據驅動選型,提升決策效率

查看全部API→
??

熱門場景實測,選對API

#AI文本生成大模型API

對比大模型API的內容創意新穎性、情感共鳴力、商業轉化潛力

25個渠道
一鍵對比試用API 限時免費

#AI深度推理大模型API

對比大模型API的邏輯推理準確性、分析深度、可視化建議合理性

10個渠道
一鍵對比試用API 限時免費