Skip to content

How to process the documents with Docs2KG?

We currently can process

  • PDF files
    • Scanned PDFs
    • Generated/Exported PDFs
  • Web pages
  • Excels
  • Emails

This can be checked within our code sources, all the related modules are under the parser folder.


PDF Files

With the help of PyMuPDF, we can easily process the PDF files, especially the exported ones.

For the scanned PDF files, we can extract the text, and then associate the page image to each page node.

PDF Files Metadata Summary

We provide the function get_metadata_for_files to get the metadata for the PDF files.

In this way, you can understand your bulk of PDF files better.

For example, you will know:

  • format: PDF 1.3/1.4, ect
  • author
  • title
  • producer: with this you can estimate whether it is scanned or generated

We also calculate the tokens within each document, you can have an estimated price to use different OpenAI models

import argparse
from pathlib import Path

from Docs2KG.parser.pdf.pdf2metadata import get_metadata_for_files
from Docs2KG.utils.constants import DATA_INPUT_DIR, DATA_OUTPUT_DIR
from Docs2KG.utils.get_logger import get_logger

logger = get_logger(__name__)

if __name__ == "__main__":
    """
    Loop a folder of pdf files and process them
    """

    args = argparse.ArgumentParser()
    args.add_argument(
        "--input_dir",
        type=str,
        help="Input directory of pdf files",
        default=DATA_INPUT_DIR,
    )
    args = args.parse_args()
    data_input_dir = Path(args.input_dir)

    all_files = list(data_input_dir.rglob("*.pdf"))
    if len(all_files) == 0:
        logger.info("No pdf files found in the input directory")
        raise Exception("No pdf files found in the input directory")

    all_metadata_df = get_metadata_for_files(all_files, log_summary=True)
    """
    Then you can save it to a file

    Example:
        all_metadata_df.to_csv(DATA_OUTPUT_DIR / "metadata.csv", index=False)

    Or use can use the metadata as the orchestrator
    So files can be directed to different processing pipelines
    And modules based on the metadata
    """
    all_metadata_df.to_csv(DATA_OUTPUT_DIR / "metadata.csv", index=False)

Exported PDF Process

Each individual PDF file will be processed, intermediate results will be saved into the output folder.

from Docs2KG.parser.pdf.pdf2blocks import PDF2Blocks
from Docs2KG.parser.pdf.pdf2metadata import PDF_TYPE_SCANNED, get_scanned_or_exported
from Docs2KG.parser.pdf.pdf2tables import PDF2Tables
from Docs2KG.parser.pdf.pdf2text import PDF2Text
from Docs2KG.utils.constants import DATA_INPUT_DIR, DATA_OUTPUT_DIR
from Docs2KG.utils.get_logger import get_logger

logger = get_logger(__name__)

if __name__ == "__main__":
    """
    Here what we want to achieve is how to get the exported PDF into the triplets for the Neo4j.

    We will require some of the OpenAI functions, we are currently default the model to `gpt-3.5-turbo`

    The cost itself is quite low, but you can determine whether you want to use it or not.

    Overall steps include:

    1. Extract Text, Images, Tables from the PDF
        - Extract text **block** and image **blocks** from the PDF
            - This will provide bounding boxes for the text and images
        - Extract tables from the PDF
        - Extract the text and markdown from the PDF, each page will be one markdown file
            - Output will be in a csv
    """

    # you can name your file here
    pdf_file = DATA_INPUT_DIR / "historic information.pdf"

    output_folder = DATA_OUTPUT_DIR / "historic information.pdf"
    # the output will be default to `DATA_OUTPUT_DIR / "4.pdf" /` folder
    scanned_or_exported = get_scanned_or_exported(pdf_file)
    if scanned_or_exported == PDF_TYPE_SCANNED:
        logger.info("This is a scanned pdf, we will handle it in another demo")
    else:
        # This will extract the text, images.
        #
        # Output images/text with bounding boxes into a df
        pdf_2_blocks = PDF2Blocks(pdf_file)
        blocks_dict = pdf_2_blocks.extract_df(output_csv=True)
        logger.info(blocks_dict)

        # This will extract the tables from the pdf file
        # Output also will be the csv of the summary and each individual table csv

        pdf2tables = PDF2Tables(pdf_file)
        pdf2tables.extract2tables(output_csv=True)

        # Processing the text from the pdf file
        # For each page, we will have a markdown and text content,
        # Output will be in a csv

        pdf_to_text = PDF2Text(pdf_file)
        text = pdf_to_text.extract2text(output_csv=True)
        md_text = pdf_to_text.extract2markdown(output_csv=True)

        # Until now, your output folder should be some like this
        # .
        # ├── 4.pdf
        # ├── images
        # │         ├── blocks_images.csv
        # │         ├── page_0_block_1.jpeg
        # │         ├── page_0_block_4.jpeg
        # │         ├── ....
        # ├── metadata.json
        # ├── tables
        # │         ├── page_16-table_1.csv
        # │         ├── ....
        # │         └── tables.csv
        # └── texts
        #     ├── blocks_texts.csv
        #     ├── md.csv
        #     └── text.csv

Scanned PDF Process

from Docs2KG.modules.llm.markdown2json import LLMMarkdown2Json
from Docs2KG.parser.pdf.pdf2blocks import PDF2Blocks
from Docs2KG.parser.pdf.pdf2image import PDF2Image
from Docs2KG.parser.pdf.pdf2metadata import PDF_TYPE_SCANNED, get_scanned_or_exported
from Docs2KG.parser.pdf.pdf2text import PDF2Text
from Docs2KG.utils.constants import DATA_INPUT_DIR, DATA_OUTPUT_DIR
from Docs2KG.utils.get_logger import get_logger

logger = get_logger(__name__)

if __name__ == "__main__":
    """
    Here what we want to achieve is how to get the scanned PDF into the triplets for the Neo4j.

    We will require some of the OpenAI functions, we are currently default the model to `gpt-3.5-turbo`

    The cost itself is quite low, but you can determine whether you want to use it or not.

    Different from the exported PDF, current state of art still face some challenges in the scanned PDF
    Especially when trying to extract the figures and tables from the scanned PDF

    So we skip that step, only extract the text, and construct a layout knowledge graph based on the text
    Next is add the

    1. Extract Text, Images, Tables from the PDF
        - Extract text **block** and image **blocks** from the PDF
            - This will provide bounding boxes for the text and images
        - Extract tables from the PDF
        - Extract the text and markdown from the PDF, each page will be one markdown file
            - Output will be in a csv
    """

    # you can name your file here
    pdf_file = DATA_INPUT_DIR / "3.pdf"

    output_folder = DATA_OUTPUT_DIR / "3.pdf"
    # the output will be default to `DATA_OUTPUT_DIR / "4.pdf" /` folder
    scanned_or_exported = get_scanned_or_exported(pdf_file)
    if scanned_or_exported == PDF_TYPE_SCANNED:
        logger.info("This is a scanned pdf, we will handle it in another demo")

        # This will extract the text, images.
        #
        # Output images/text with bounding boxes into a df
        pdf_2_blocks = PDF2Blocks(pdf_file)
        blocks_dict = pdf_2_blocks.extract_df(output_csv=True)
        logger.info(blocks_dict)

        # Processing the text from the pdf file
        # For each page, we will have a markdown and text content,
        # Output will be in a csv

        pdf_to_text = PDF2Text(pdf_file)
        text = pdf_to_text.extract2text(output_csv=True)
        md_text = pdf_to_text.extract2markdown(output_csv=True)

        # Until now, your output folder should be some like this
        # .
        # ├── 4.pdf
        # ├── images
        # │         ├── blocks_images.csv
        # │         ├── page_0_block_1.jpeg
        # │         ├── page_0_block_4.jpeg
        # │         ├── ....
        # ├── metadata.json
        # └── texts
        #     ├── blocks_texts.csv
        #     ├── md.csv
        #     └── text.csv
        # under the image folder, are not valid image, we need better models for that.
        input_md_file = output_folder / "texts" / "md.csv"

        markdown2json = LLMMarkdown2Json(
            input_md_file,
            llm_model_name="gpt-3.5-turbo",
        )

        markdown2json.clean_markdown()
        markdown2json.markdown_file = output_folder / "texts" / "md.cleaned.csv"

        markdown2json.extract2json()

        pdf_2_image = PDF2Image(pdf_file)
        pdf_2_image.extract_page_2_image_df()
        # after this we will have a added `md.json.csv` in the `texts` folder
    else:
        logger.info("This is an exported pdf, we will handle it in another demo")


Web Pages

It is quite easy to process the web pages, as it is structured with the tree structure.

All we need to do is to download the html, and the corresponding images, this will directly happen during the KG Construction Stage


Excel

For Excel files, simple one can be directly transformed into the csv.

However, for very complex ones, we have multiple sheets within an Excel (similar to the Page concept in PDF).

So the key task here is to separate the description part and the table part within a sheet.

from Docs2KG.modules.llm.sheet2metadata import Sheet2Metadata
from Docs2KG.parser.excel.excel2image import Excel2Image
from Docs2KG.parser.excel.excel2markdown import Excel2Markdown
from Docs2KG.parser.excel.excel2table import Excel2Table
from Docs2KG.utils.constants import DATA_INPUT_DIR

if __name__ == "__main__":
    """
    Plan of the attack:

    1. For each sheet, extract the description stuff, and tables will be kept still in csv
    2. Then create the kg mainly based on the description
    """
    excel_file = DATA_INPUT_DIR / "excel" / "GCP_10002.xlsx"
    excel2table = Excel2Table(excel_file=excel_file)
    excel2table.extract_tables_from_excel()

    excel2image = Excel2Image(excel_file=excel_file)
    excel2image.excel2image_and_pdf()

    excel2markdown = Excel2Markdown(excel_file=excel_file)
    excel2markdown.extract2markdown()

    sheet_2_metadata = Sheet2Metadata(
        excel2markdown.md_csv,
        llm_model_name="gpt-3.5-turbo",
    )
    sheet_2_metadata.extract_metadata()


Emails

We provide the modules for you to connect to a specific email account, and then download the emails based on the search or numbers.

Email will be further transformed into web pages, and attachments will be directed to the corresponding modules.

Download module

This will download your email into a .eml file, this will actually include all data you need, include the attachments.

from Docs2KG.parser.email.utils.email_connector import EmailConnector

if __name__ == "__main__":
    email_address = ""
    password = ""
    imap_server = "imap.gmail.com"
    port = 993
    email_connector = EmailConnector(
        email_address=email_address,
        password=password,
        search_keyword="test search",
        num_emails=50,
        imap_server=imap_server,
        imap_port=port,
    )
    email_connector.pull()

Process module

Email will be decomposed into different formats, and waiting for further processing.

from Docs2KG.parser.email.email_compose import EmailDecompose
from Docs2KG.utils.constants import DATA_INPUT_DIR

if __name__ == "__main__":
    email_filename = DATA_INPUT_DIR / "email.eml"
    email_decomposer = EmailDecompose(email_file=email_filename)
    email_decomposer.decompose_email()