Project Overview This project involves the construction, labeling, visualization, and database ingestion of a Knowledge Graph (KG) based on an initial CSV dataset (data1.csv). It applies Natural Language Processing (NLP) to enrich nodes with semantic labels and exports the graph into a Neo4j database for advanced querying. The overall process emphasizes data cleaning, metadata generation, machine learning evaluation, and knowledge representation. Project Structure Folder File Purpose data1.csv Cleaned CSV data containing triples (head, relationship, tail). results/generated_labels.csv CSV containing autogenerated labels (categories) for graph nodes. reults/ output_images/histogram_relationships.png Visualization of relationship types frequency. output_images/knowledge_graph.png Knowledge graph plot visualizing entities and their relationships. .env Environment file storing Neo4j username/password securely. README.txt This file describing the full project. Methodology 1. Load and Clean the Data File: data1.csv Process: Add an incremental ID column for better tracking. Ensure the columns are head, relationship, tail, all cast as category types for efficiency. 2. Create Knowledge Graph (NetworkX) Library Used: NetworkX Process: Nodes and directed edges are created based on triples (head → tail, labeled by relationship). Any available generated_labels.csv is used to add semantic metadata to each node. 3. Explore the Graph Outputs: Print basic statistics: Number of nodes and edges. Identify nodes missing semantic labels. Create a histogram of relationship types for metadata visualization (histogram_relationships.png). 4. Visualize the Knowledge Graph Methodology: Spring layout algorithm (nx.spring_layout) used for plotting. Edges colored by relationship type. Saved as knowledge_graph.png. 5. Triples Management (PyKEEN Factory) Process: Build a TriplesFactory object. Perform an 80/10/10 random split: train, validation, test sets. 6. Label Generation (NLP Categorization) Process: Analyze node names. Classify into: institution, country, artefact, person or unclassified_entity (maintaining the open world assumption) 7. Train and Evaluate KGE Models Models Trained: TransE ComplEx ConvE DistMult Each model is trained for 1000 epochs. Evaluation is done using a Rank-Based Evaluator to measure performance on the testing split. The evaluation results for each model are stored in memory as Python dictionaries. They contain the following metrics: Mean Reciprocal Rank (MRR) Hits@1, Hits@3, Hits@10 Mean Rank (MR) Additional link prediction metrics 8. Push the Graph into Neo4j Connect using Neo4j Bolt driver. For each triple, create nodes (if missing) and relationships. Upload semantic labels as actual Neo4j node labels. 9. Umap Projection and similarity matrix extract the entity embeddings use UMAP for 2D visualization and cosine similarity for semantic comparisons retrieve the top-N most similar entities for a selected entity Generate a UMAP scatterplot 10. Security and Reproducibility Credentials (NEO4J_USERNAME, NEO4J_PASSWORD) are stored securely via .env. Random seeds are set across random, numpy, and torch to ensure reproducible splits and results.