Lego Price Regression
Description
Context and methodology
Research Domain:
This dataset was created as part of a machine-learning experiment in the domain of collectible toy valuation - in particular, modelling the secondary (resale) price of LEGO® sets. It sits at the intersection of data science, retail analytics and cultural heritage (collector markets).
Purpose:
The goal is to predict a rounded, integer resale price for LEGO sets, given a handful of easily-available attributes (theme, subtheme, production year, piece-count, and original MSRP). By framing it as a regression problem, we can build and evaluate models that help collectors, resellers or analytics platforms estimate fair market values.
Creation of the dataset:
Raw data, already split into Train/Test/Validation was fetched via the DBRepo3 API. The categorical columns (theme, theme_group, subtheme) were label encoded.
Technical details
The dataset contains the columns theme, theme_group, subtheme, age, pieces, msrp_int, price_int and id. No special folder hierarchy or additional naming conventions are used. Working with the dataset requires only a standard Python environment (version 3.8 or higher), along with the pandas and NumPy libraries for data manipulation, scikit-learn for preprocessing and modeling, matplotlib for plotting, and the DBRepo3 REST client to fetch and store splits. Supplementary materials include a Jupyter notebook on GitHub that demonstrates all steps: modeling, evaluation, and artifact serialization. Also the DBRepo3 persistent identifiers for the three splits: “2401ab5e-693b-4235-b14e-0f3eb53ec773” for training, “723c26fe-f89d-475c-a69e-83336b32c7bf” for test, and “1e5040fc-8d05-4480-8176-975ad338f4d3” for validation are present.