← Back to Projects
Real Estate Prediction in NYC

Real Estate Prediction in NYC

Projects

Real Estate Prediction in NYC

Project Overview

This project leverages machine learning to predict property values in New York City by analyzing historical data from NYC Open Data. The implementation tackles common issues in real estate prediction, such as:

  • Incomplete datasets with missing values
  • Variances in data collected over different time periods
  • The need for robust cross-validation and hyperparameter tuning for accurate predictions
  • Visualization of spatial property value predictions over NYC

By integrating rigorous data preprocessing, model comparisons, and advanced visualization techniques, the project provides stakeholders with an informative tool to analyze market trends and make informed decisions in the NYC real estate market.

Key Features

  • Data Preprocessing: Aggregates multiple datasets covering different time ranges, performs data cleaning including imputation of missing values, outlier removal, and feature scaling.
  • Machine Learning Pipeline: Compares several regression models (Linear Regression, Random Forest Regressor, and XGBRegressor) and implements GridSearchCV for hyperparameter optimization.
  • Visualization & Analysis: Generates scatter plots comparing actual vs. predicted property values and creates geographic heat maps to illustrate spatial trends in property valuations.

Project Implementation & Results

Libraries and Tools

The project utilizes a variety of Python libraries for data processing, analysis, and machine learning:

Libraries and Tools

Libraries and Tools

Data Collection and Preprocessing

Two primary datasets were used in this project:

  1. Dataset 1 (2010-2019): Contains historical property valuation data

    2010-2019 Dataset Categories

    2010-2019 Dataset Categories

  2. Dataset 2 (2021-2023): Contains more recent property valuation data

    2021-2023 Dataset Categories

    2021-2023 Dataset Categories

After analyzing both datasets, the most useful categories were identified:

Useful Categories from Dataset 1:

Useful Categories Dataset 1

Useful Categories Dataset 1

Useful Categories from Dataset 2:

Useful Categories Dataset 2

Useful Categories Dataset 2

Common Categories Between Datasets:

Common Categories

Common Categories

After merging and cleaning the datasets:

Merged and Cleaned Dataset

Merged and Cleaned Dataset

Data Analysis by Borough

The data was split by borough to account for different growth characteristics across New York City. This analysis revealed significant differences in data distribution:

Datapoints per Borough

Datapoints per Borough

During data cleaning, some datapoints were removed due to missing values or outliers:

Datapoints Removed in Cleaning by Borough

Datapoints Removed in Cleaning by Borough

The properties were visualized on maps to better understand their spatial distribution:

Manhattan Properties:

Properties in Manhattan Borough

Properties in Manhattan Borough

Brooklyn Properties:

Properties in Brooklyn Borough

Properties in Brooklyn Borough

Data Distribution Analysis

To understand the price distribution, specific blocks were analyzed:

Price Distribution of Buildings in Block 16 for 2017

Price Distribution of Buildings in Block 16 for 2017

Before Cleaning:

Price Distribution Before Cleaning

Price Distribution Before Cleaning

After Cleaning:

Price Distribution After Cleaning

Price Distribution After Cleaning

Model Optimization

GridSearchCV was used to find optimal hyperparameters for the models:

GridSearch Results for n_estimators:

GridSearch CV Results for model_n_estimators

GridSearch CV Results for model_n_estimators

GridSearch Results for learning_rate:

GridSearch CV Results for model_learning_rate

GridSearch CV Results for model_learning_rate

Best Parameters for XGBClassifier:

Best Parameters for XGBClassifier

Best Parameters for XGBClassifier

Best Parameters for XGBRegressor:

Best Parameters for XGBRegressor

Best Parameters for XGBRegressor

Model Evaluation

The models were evaluated using various metrics to assess their performance:

Actual vs Predicted Values:

Actual vs Predicted Values

Actual vs Predicted Values

Model Metrics:

Metrics

Metrics

Model Evaluation Metrics:

Model Evaluation Metrics

Model Evaluation Metrics

Classification Report:

Classification Report

Classification Report

Conclusion

The Real Estate Prediction in NYC project successfully demonstrates the application of machine learning techniques to predict property values in New York City. After comparing several regression models, the XGBRegressor proved to be the most effective, achieving the lowest Mean Squared Error and highest R-squared value.

The heatmap visualization of predicted house values for 2025 provides stakeholders with valuable insights into future real estate trends across different boroughs:

Heatmap of Predicted House Value for 2025 NYC All Borough

Heatmap of Predicted House Value for 2025 NYC All Borough

This project demonstrates the importance of thorough data preprocessing, appropriate model selection, and rigorous evaluation in creating reliable property value predictions, which can be valuable for real estate investors, urban planners, and policymakers.