Project Overview

This project involves the construction, labeling, visualization, and database ingestion of a Knowledge Graph (KG) based on an initial CSV dataset (data1.csv). It applies Natural Language Processing (NLP) to enrich nodes with semantic labels and exports the graph into a Neo4j database for advanced querying.

The overall process emphasizes data cleaning, metadata generation, machine learning evaluation, and knowledge representation.

Project Structure
    Folder                                      File	Purpose
    data1.csv	                                Cleaned CSV data containing triples (head, relationship, tail).
    results/generated_labels.csv	            CSV containing autogenerated labels (categories) for graph nodes.
    reults/
    output_images/histogram_relationships.png	Visualization of relationship types frequency.
    output_images/knowledge_graph.png	        Knowledge graph plot visualizing entities and their relationships.
    .env	                                    Environment file storing Neo4j username/password securely.
    README.txt	                                This file describing the full project.

Methodology
1. Load and Clean the Data

    File: data1.csv

    Process:

        Add an incremental ID column for better tracking.

        Ensure the columns are head, relationship, tail, all cast as category types for efficiency.

2. Create Knowledge Graph (NetworkX)

    Library Used: NetworkX

    Process:

        Nodes and directed edges are created based on triples (head → tail, labeled by relationship).

        Any available generated_labels.csv is used to add semantic metadata to each node.

3. Explore the Graph

    Outputs:

        Print basic statistics: Number of nodes and edges.

        Identify nodes missing semantic labels.

        Create a histogram of relationship types for metadata visualization (histogram_relationships.png).

4. Visualize the Knowledge Graph

    Methodology:

        Spring layout algorithm (nx.spring_layout) used for plotting.

        Edges colored by relationship type.

        Saved as knowledge_graph.png.

5. Triples Management (PyKEEN Factory)

    Process:

        Build a TriplesFactory object.

        Perform an 80/10/10 random split: train, validation, test sets.

6. Label Generation (NLP Categorization)

    Process:

        Analyze node names.

        Classify into: institution, country, artefact, person or unclassified_entity (maintaining the open world assumption)

7. Train and Evaluate KGE Models

    Models Trained:

        TransE

        ComplEx

        ConvE

        DistMult
    
    Each model is trained for 1000 epochs.
    Evaluation is done using a Rank-Based Evaluator to measure performance on the testing split.
    The evaluation results for each model are stored in memory as Python dictionaries.
    
    They contain the following metrics:

        Mean Reciprocal Rank (MRR)

        Hits@1, Hits@3, Hits@10

        Mean Rank (MR)

        Additional link prediction metrics

8. Push the Graph into Neo4j

    Connect using Neo4j Bolt driver.

    For each triple, create nodes (if missing) and relationships.

    Upload semantic labels as actual Neo4j node labels.

9. Umap Projection and similarity matrix
    
    extract the entity embeddings

    use UMAP for 2D visualization and cosine similarity for semantic comparisons

    retrieve the top-N most similar entities for a selected entity

    Generate a UMAP scatterplot

10. Security and Reproducibility

    Credentials (NEO4J_USERNAME, NEO4J_PASSWORD) are stored securely via .env.

    Random seeds are set across random, numpy, and torch to ensure reproducible splits and results.