Skip to content

How to load the unified multimodal knowledge graph into the GraphDB?

The process we designed is that after the KG Construction process, you will have two files within your output dir

  • kg/layout_kg.json
  • kg/triplets_kg.json

Triplets Generation

Example format of the layout_kg.json

{
  "node_type": "document",
  "uuid": "77d231ae-93db-4073-a592-10caccca86d2",
  "node_properties": {
    "format": "PDF 1.3",
    "title": "Microsoft Word - WR_C19_2000_2006A.doc",
    "author": "michelle",
    "subject": "",
    "keywords": "",
    "creator": "PScript5.dll Version 5.2",
    "producer": "GNU Ghostscript 7.06",
    "creationDate": "2/7/2006 16:39:28",
    "modDate": "",
    "trapped": "",
    "encryption": null,
    "text_token": 25959,
    "estimated_price_gpt35": 0.103836,
    "estimated_price_gpt4o": 1.03836,
    "estimated_price_4_turbo": 2.07672,
    "file_path": "/Users/pascal/PhD/Docs2KG/data/input/tests_pdf/4.pdf",
    "scanned_or_exported": "exported"
  },
  "children": [
    {
      "node_type": "page",
      "uuid": "854072eb-58cf-4b79-b02f-0830d586a71f",
      "node_properties": {
        "page_number": 0,
        "page_text": "\n**Midwest Corporation Limited**\n****\n**Weld Range Tenement Group**\n****\n**Annual Report**\n****\n**Combined Report Group C19/2000**\n****\n**TR70/3902, M20/402, M20/403 and E20/176**\n****\n**5 January 2005 \u2013 6 January 2006**\n****\n\n\n\n\n\n**Author:**  **Michael Brown** \n         B.Sc., Geol, Grad Dip (GIS)\n  **Graeme Johnston** \nB.Sc., (Geol), M.Sc., D.I.C., F.G.S\n**Date:**   6 th January 2006\n\n\n\n\n32 Kings Park Rd West Perth 6005\n\n\n\n-----\n\n"
      },
      "children": [
        {
          "node_type": "h1",
          "uuid": "7dbb9ae2-27e0-4e8e-8737-23f0cd345f20",
          "node_properties": {
            "content": "Midwest Corporation Limited",
            "text": "",
            "records": []
          },
          "children": [
            {
              "node_type": "h2",
              "uuid": "309f0aa5-992a-4583-9d49-2a879992c231",
              "node_properties": {
                "content": "Weld Range Tenement Group",
                "text": "",
                "records": []
              },
              "children": []
            }
          ]
        }
      ]
    }
  ]
}

The first one is the nested JSON version of the KG.

If one node is connected to another node with relationships further than the tree, then we will have a linkage

for example mentioned_in field within the node properties.

The target node uuid will be stored in the mentioned_in field.

To get this format be able to load into the GraphDB, we need to convert it into the proper format.

We use Neo4j as an example here.

We will convert the kg_layout.json into a json with two list

{
  "nodes": [
    {
      "uuid": "4d2ab6c5-bcd0-4373-9e5c-15abdad53e92",
      "labels": [
        "TEXT_BLOCK"
      ],
      "properties": {
        "text_block_bbox": "(171.1199493408203, 417.1202087402344, 446.0379943847656, 432.7322082519531)",
        "content": " Plate 4: Looking up to W16 Hematite\u2013Goethite Lens ",
        "position": "below",
        "text_block_number": 10
      }
    }
  ],
  "relationships": [
    {
      "start_node": "4d2ab6c5-bcd0-4373-9e5c-15abdad53e92",
      "end_node": "4d2ab6c5-bcd0-4373-9e5c-15abdad53e92",
      "type": "mentioned_in",
      "properties": {
        "page_number": 0
      }
    }
  ]
}

This will be done in the class json2triplets.py

Load the KG into GraphDB

Then the next thing is easy, we just need to have a class to load it into the Neo4j with cypher.

You can use cypher directly to load the json into the Neo4j. Or use the class we provided neo4j_connector.py

uri = "bolt://localhost:7687"  # if it is a remote graph db, you can change it to the remote uri
username = "neo4j"
password = "testpassword"
json_file_path = excel2markdown.output_dir / "kg" / "triplets_kg.json"

neo4j_loader = Neo4jLoader(uri, username, password, json_file_path, clean=True)
neo4j_loader.load_data()
neo4j_loader.close()

Start the neo4j

Before you start it, if you do not have a neo4j instance, you can start one with our docker-compose file

docker compose -f examples/compose/docker-compose.yml up