Pdf2text
PDF2Text
Bases: PDFParserBase
Source code in Docs2KG/parser/pdf/pdf2text.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
|
__init__(*args, **kwargs)
Initialize the PDF2Text class
Source code in Docs2KG/parser/pdf/pdf2text.py
12 13 14 15 16 17 18 19 20 |
|
extract2markdown(output_csv=False)
Convert the extracted text to markdown
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_csv |
bool
|
Whether to output the extracted data to a csv file. Defaults to False. |
False
|
Returns:
Name | Type | Description |
---|---|---|
md |
str
|
The Markdown text, |
output_file |
Path
|
Where the Markdown text save to |
df |
Dataframe
|
Each page for the Markdown text |
Source code in Docs2KG/parser/pdf/pdf2text.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
|
extract2text(output_csv=False)
Extract text from the pdf file
Args output_csv (bool, optional): Whether to output the extracted data to a csv file. Defaults to False.
Returns:
Name | Type | Description |
---|---|---|
text |
str
|
The extracted text |
output_file |
Path
|
The path to the output file |
df |
Dataframe
|
The dataframe containing the text information |
Source code in Docs2KG/parser/pdf/pdf2text.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
|