Data Cleaning

Raw Data Overview

Code

glimpse(miami_raw)

Rows: 13,932
Columns: 17
$ LATITUDE          <dbl> 25.89103, 25.89132, 25.89133, 25.89176, 25.89182, 25…
$ LONGITUDE         <dbl> -80.16056, -80.15397, -80.15374, -80.15266, -80.1546…
$ PARCELNO          <dbl> 622280070620, 622280100460, 622280100470, 6222801005…
$ SALE_PRC          <dbl> 440000, 349000, 800000, 988000, 755000, 630000, 1020…
$ LND_SQFOOT        <dbl> 9375, 9375, 9375, 12450, 12800, 9900, 10387, 10272, …
$ TOT_LVG_AREA      <dbl> 1753, 1715, 2276, 2058, 1684, 1531, 1753, 1663, 1493…
$ SPEC_FEAT_VAL     <dbl> 0, 0, 49206, 10033, 16681, 2978, 23116, 34933, 11668…
$ RAIL_DIST         <dbl> 2815.9, 4359.1, 4412.9, 4585.0, 4063.4, 2391.4, 3277…
$ OCEAN_DIST        <dbl> 12811.4, 10648.4, 10574.1, 10156.5, 10836.8, 13017.0…
$ WATER_DIST        <dbl> 347.6, 337.8, 297.1, 0.0, 326.6, 188.9, 0.0, 10.5, 5…
$ CNTR_DIST         <dbl> 42815.3, 43504.9, 43530.4, 43797.5, 43599.7, 43135.1…
$ SUBCNTR_DI        <dbl> 37742.2, 37340.5, 37328.7, 37423.2, 37550.8, 38176.2…
$ HWY_DIST          <dbl> 15954.9, 18125.0, 18200.5, 18514.4, 17903.4, 15687.2…
$ age               <dbl> 67, 63, 61, 63, 42, 41, 63, 21, 56, 63, 64, 51, 56, …
$ avno60plus        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ month_sold        <dbl> 8, 9, 2, 9, 7, 2, 2, 9, 3, 11, 2, 11, 7, 7, 9, 11, 6…
$ structure_quality <dbl> 4, 4, 4, 4, 4, 4, 5, 4, 4, 5, 4, 2, 2, 2, 5, 2, 2, 4…

The raw dataset contains 13932 properties and 17 columns. Each row represents a single residential property sold in Miami in 2016. Variables cover location coordinates, distances to key geographic features, structural characteristics, and the sale price outcome.

Column Name Standardization

Code

library(janitor)

miami <- miami_raw %>%
  clean_names() %>%
  rename(plane_noise = avno60plus) %>%
  select(-parcelno)

names(miami)

 [1] "latitude"          "longitude"         "sale_prc"         
 [4] "lnd_sqfoot"        "tot_lvg_area"      "spec_feat_val"    
 [7] "rail_dist"         "ocean_dist"        "water_dist"       
[10] "cntr_dist"         "subcntr_di"        "hwy_dist"         
[13] "age"               "plane_noise"       "month_sold"       
[16] "structure_quality"

parcelno is a unique property identifier with no predictive value and is dropped. avno60plus is renamed to plane_noise for readability.

Type Corrections

Code

miami <- miami %>%
  mutate(
    plane_noise       = factor(plane_noise),
    month_sold        = factor(month_sold),
    structure_quality = factor(structure_quality)
  )

miami %>%
  summarise(across(everything(), class)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "type")

# A tibble: 16 × 2
   variable          type   
   <chr>             <chr>  
 1 latitude          numeric
 2 longitude         numeric
 3 sale_prc          numeric
 4 lnd_sqfoot        numeric
 5 tot_lvg_area      numeric
 6 spec_feat_val     numeric
 7 rail_dist         numeric
 8 ocean_dist        numeric
 9 water_dist        numeric
10 cntr_dist         numeric
11 subcntr_di        numeric
12 hwy_dist          numeric
13 age               numeric
14 plane_noise       factor 
15 month_sold        factor 
16 structure_quality factor

Three variables are stored as integers in the raw data but are categorical in nature. Converting them to factors ensures they are dummy-encoded correctly during preprocessing rather than treated as continuous.

Missingness Check

Code

miss_var_summary(miami) %>%
  filter(n_miss > 0)

# A tibble: 0 × 3
# ℹ 3 variables: variable <chr>, n_miss <int>, pct_miss <dbl>

Code

cat("Total missing values:", sum(is.na(miami)))

Total missing values: 0

No missing values are present across any variable. No imputation is needed and no rows are dropped, preserving the full dataset of 13932 properties for modeling.

Outcome Variable Distribution

Code

p1 <- ggplot(miami, aes(x = sale_prc)) +
  geom_histogram(bins = 60, fill = "steelblue", color = "white") +
  scale_x_continuous(labels = scales::dollar_format()) +
  labs(title = "Original scale", x = "Sale price (USD)", y = "Count")

p2 <- ggplot(miami, aes(x = log(sale_prc))) +
  geom_histogram(bins = 60, fill = "steelblue", color = "white") +
  labs(title = "Log scale", x = "log(Sale price)", y = "Count")

library(patchwork)
p1 + p2 +
  plot_annotation(title = "Sale price before and after log transformation")

Sale price is heavily right-skewed with a long tail of high-value properties. The majority of homes sold between $100,000 and $500,000, but a small number of luxury properties extend well beyond $1 million. A log transformation produces an approximately normal distribution, which stabilizes variance, reduces the influence of extreme values, and improves model performance across all families. All modeling is performed on log sale price; predictions are back-transformed with exp() for dollar-scale interpretation.

Final Cleaned Data

Code

miami <- miami %>%
  mutate(sale_prc_log = log(sale_prc)) %>%
  select(-sale_prc)

skim(miami)

Data summary
Name	miami
Number of rows	13932
Number of columns	16
_______________________
Column type frequency:
factor	3
numeric	13
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
plane_noise	1	FALSE	2	0: 13724, 1: 208
month_sold	1	FALSE	12	6: 1387, 8: 1275, 5: 1245, 4: 1234
structure_quality	1	FALSE	5	4: 7625, 2: 4110, 5: 2002, 1: 179

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
latitude	1	25.73	0.14	25.43	25.62	25.73	25.85	25.97	▃▆▇▅▆
longitude	1	-80.33	0.09	-80.54	-80.40	-80.34	-80.26	-80.12	▁▇▇▅▃
lnd_sqfoot	1	8620.88	6070.09	1248.00	5400.00	7500.00	9126.25	57064.00	▇▁▁▁▁
tot_lvg_area	1	2058.04	813.54	854.00	1470.00	1877.50	2471.00	6287.00	▇▅▂▁▁
spec_feat_val	1	9562.49	13890.97	0.00	810.00	2765.50	12352.25	175020.00	▇▁▁▁▁
rail_dist	1	8348.55	6178.03	10.50	3299.45	7106.30	12102.60	29621.50	▇▆▃▂▁
ocean_dist	1	31690.99	17595.08	236.10	18079.35	28541.75	44310.65	75744.90	▅▇▅▃▂
water_dist	1	11960.29	11932.99	0.00	2675.85	6922.60	19200.00	50399.80	▇▂▂▁▁
cntr_dist	1	68490.33	32008.47	3825.60	42823.10	65852.40	89358.32	159976.50	▃▇▇▃▂
subcntr_di	1	41115.05	22161.83	1462.80	23996.25	41109.90	53949.38	110553.80	▆▇▆▂▁
hwy_dist	1	7723.77	6068.94	90.20	2998.12	6159.75	10854.20	48167.30	▇▂▁▁▁
age	1	30.67	21.15	0.00	14.00	26.00	46.00	96.00	▇▇▃▃▁
sale_prc_log	1	12.71	0.57	11.18	12.37	12.64	12.97	14.79	▁▇▇▂▁

The skim summary confirms the dataset is ready for modeling. Log sale price ranges from approximately 11 to 14.7, corresponding to roughly $60,000 to $2.4 million on the original scale. No variables show signs of data entry errors or implausible ranges. The three factor variables each have a small number of levels appropriate for dummy encoding.

Code

write_rds(miami, "data/processed/miami_clean.rds")
cat("✓ miami_clean.rds written to data/processed/\n")

✓ miami_clean.rds written to data/processed/

The cleaned dataset contains 13932 rows and 16 columns, with 3 factor variables, 12 numeric predictors, and the log-transformed outcome variable.