Data Cleaning

Raw Data Overview

Code
glimpse(miami_raw)
Rows: 13,932
Columns: 17
$ LATITUDE          <dbl> 25.89103, 25.89132, 25.89133, 25.89176, 25.89182, 25…
$ LONGITUDE         <dbl> -80.16056, -80.15397, -80.15374, -80.15266, -80.1546…
$ PARCELNO          <dbl> 622280070620, 622280100460, 622280100470, 6222801005…
$ SALE_PRC          <dbl> 440000, 349000, 800000, 988000, 755000, 630000, 1020…
$ LND_SQFOOT        <dbl> 9375, 9375, 9375, 12450, 12800, 9900, 10387, 10272, …
$ TOT_LVG_AREA      <dbl> 1753, 1715, 2276, 2058, 1684, 1531, 1753, 1663, 1493…
$ SPEC_FEAT_VAL     <dbl> 0, 0, 49206, 10033, 16681, 2978, 23116, 34933, 11668…
$ RAIL_DIST         <dbl> 2815.9, 4359.1, 4412.9, 4585.0, 4063.4, 2391.4, 3277…
$ OCEAN_DIST        <dbl> 12811.4, 10648.4, 10574.1, 10156.5, 10836.8, 13017.0…
$ WATER_DIST        <dbl> 347.6, 337.8, 297.1, 0.0, 326.6, 188.9, 0.0, 10.5, 5…
$ CNTR_DIST         <dbl> 42815.3, 43504.9, 43530.4, 43797.5, 43599.7, 43135.1…
$ SUBCNTR_DI        <dbl> 37742.2, 37340.5, 37328.7, 37423.2, 37550.8, 38176.2…
$ HWY_DIST          <dbl> 15954.9, 18125.0, 18200.5, 18514.4, 17903.4, 15687.2…
$ age               <dbl> 67, 63, 61, 63, 42, 41, 63, 21, 56, 63, 64, 51, 56, …
$ avno60plus        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ month_sold        <dbl> 8, 9, 2, 9, 7, 2, 2, 9, 3, 11, 2, 11, 7, 7, 9, 11, 6…
$ structure_quality <dbl> 4, 4, 4, 4, 4, 4, 5, 4, 4, 5, 4, 2, 2, 2, 5, 2, 2, 4…

The raw dataset contains 13932 properties and 17 columns. Each row represents a single residential property sold in Miami in 2016. Variables cover location coordinates, distances to key geographic features, structural characteristics, and the sale price outcome.

Column Name Standardization

Code
library(janitor)

miami <- miami_raw %>%
  clean_names() %>%
  rename(plane_noise = avno60plus) %>%
  select(-parcelno)

names(miami)
 [1] "latitude"          "longitude"         "sale_prc"         
 [4] "lnd_sqfoot"        "tot_lvg_area"      "spec_feat_val"    
 [7] "rail_dist"         "ocean_dist"        "water_dist"       
[10] "cntr_dist"         "subcntr_di"        "hwy_dist"         
[13] "age"               "plane_noise"       "month_sold"       
[16] "structure_quality"

parcelno is a unique property identifier with no predictive value and is dropped. avno60plus is renamed to plane_noise for readability.

Type Corrections

Code
miami <- miami %>%
  mutate(
    plane_noise       = factor(plane_noise),
    month_sold        = factor(month_sold),
    structure_quality = factor(structure_quality)
  )

miami %>%
  summarise(across(everything(), class)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "type")
# A tibble: 16 × 2
   variable          type   
   <chr>             <chr>  
 1 latitude          numeric
 2 longitude         numeric
 3 sale_prc          numeric
 4 lnd_sqfoot        numeric
 5 tot_lvg_area      numeric
 6 spec_feat_val     numeric
 7 rail_dist         numeric
 8 ocean_dist        numeric
 9 water_dist        numeric
10 cntr_dist         numeric
11 subcntr_di        numeric
12 hwy_dist          numeric
13 age               numeric
14 plane_noise       factor 
15 month_sold        factor 
16 structure_quality factor 

Three variables are stored as integers in the raw data but are categorical in nature. Converting them to factors ensures they are dummy-encoded correctly during preprocessing rather than treated as continuous.

Missingness Check

Code
miss_var_summary(miami) %>%
  filter(n_miss > 0)
# A tibble: 0 × 3
# ℹ 3 variables: variable <chr>, n_miss <int>, pct_miss <dbl>
Code
cat("Total missing values:", sum(is.na(miami)))
Total missing values: 0

No missing values are present across any variable. No imputation is needed and no rows are dropped, preserving the full dataset of 13932 properties for modeling.

Outcome Variable Distribution

Code
p1 <- ggplot(miami, aes(x = sale_prc)) +
  geom_histogram(bins = 60, fill = "steelblue", color = "white") +
  scale_x_continuous(labels = scales::dollar_format()) +
  labs(title = "Original scale", x = "Sale price (USD)", y = "Count")

p2 <- ggplot(miami, aes(x = log(sale_prc))) +
  geom_histogram(bins = 60, fill = "steelblue", color = "white") +
  labs(title = "Log scale", x = "log(Sale price)", y = "Count")

library(patchwork)
p1 + p2 +
  plot_annotation(title = "Sale price before and after log transformation")

Sale price is heavily right-skewed with a long tail of high-value properties. The majority of homes sold between $100,000 and $500,000, but a small number of luxury properties extend well beyond $1 million. A log transformation produces an approximately normal distribution, which stabilizes variance, reduces the influence of extreme values, and improves model performance across all families. All modeling is performed on log sale price; predictions are back-transformed with exp() for dollar-scale interpretation.

Final Cleaned Data

Code
miami <- miami %>%
  mutate(sale_prc_log = log(sale_prc)) %>%
  select(-sale_prc)

skim(miami)
Data summary
Name miami
Number of rows 13932
Number of columns 16
_______________________
Column type frequency:
factor 3
numeric 13
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
plane_noise 0 1 FALSE 2 0: 13724, 1: 208
month_sold 0 1 FALSE 12 6: 1387, 8: 1275, 5: 1245, 4: 1234
structure_quality 0 1 FALSE 5 4: 7625, 2: 4110, 5: 2002, 1: 179

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
latitude 0 1 25.73 0.14 25.43 25.62 25.73 25.85 25.97 ▃▆▇▅▆
longitude 0 1 -80.33 0.09 -80.54 -80.40 -80.34 -80.26 -80.12 ▁▇▇▅▃
lnd_sqfoot 0 1 8620.88 6070.09 1248.00 5400.00 7500.00 9126.25 57064.00 ▇▁▁▁▁
tot_lvg_area 0 1 2058.04 813.54 854.00 1470.00 1877.50 2471.00 6287.00 ▇▅▂▁▁
spec_feat_val 0 1 9562.49 13890.97 0.00 810.00 2765.50 12352.25 175020.00 ▇▁▁▁▁
rail_dist 0 1 8348.55 6178.03 10.50 3299.45 7106.30 12102.60 29621.50 ▇▆▃▂▁
ocean_dist 0 1 31690.99 17595.08 236.10 18079.35 28541.75 44310.65 75744.90 ▅▇▅▃▂
water_dist 0 1 11960.29 11932.99 0.00 2675.85 6922.60 19200.00 50399.80 ▇▂▂▁▁
cntr_dist 0 1 68490.33 32008.47 3825.60 42823.10 65852.40 89358.32 159976.50 ▃▇▇▃▂
subcntr_di 0 1 41115.05 22161.83 1462.80 23996.25 41109.90 53949.38 110553.80 ▆▇▆▂▁
hwy_dist 0 1 7723.77 6068.94 90.20 2998.12 6159.75 10854.20 48167.30 ▇▂▁▁▁
age 0 1 30.67 21.15 0.00 14.00 26.00 46.00 96.00 ▇▇▃▃▁
sale_prc_log 0 1 12.71 0.57 11.18 12.37 12.64 12.97 14.79 ▁▇▇▂▁

The skim summary confirms the dataset is ready for modeling. Log sale price ranges from approximately 11 to 14.7, corresponding to roughly $60,000 to $2.4 million on the original scale. No variables show signs of data entry errors or implausible ranges. The three factor variables each have a small number of levels appropriate for dummy encoding.

Code
write_rds(miami, "data/processed/miami_clean.rds")
cat("✓ miami_clean.rds written to data/processed/\n")
✓ miami_clean.rds written to data/processed/

The cleaned dataset contains 13932 rows and 16 columns, with 3 factor variables, 12 numeric predictors, and the log-transformed outcome variable.