What type of data do I need?

Getting the correct data is most of the challenge. You’ll need the ‘assessment roll’ and the ‘sales roll’. Every jurisdiction releases this data slightly differently (ranging from three thousand page pdfs to csv files on open data portals) so this guide tries to capture most situations.

The Assessment Roll

The assessment roll must consist of an unique identifier for each property, the year of assessment, and the value of assessment. Each row is then an assessment for one property in one year.

The Sales Roll

The sales roll must consist of an unique identifier for each property, the year of sale, and the value of the sale. Each row is then a unique sale of one property in one year.

Preprocessing

In order to conduct a sales ratio study, data is required to have at least three columns: Sale Year, Sale Price, and Assessed Value. We want to compare the sale price to the assessed value at the time of sale. Here is an example:

head(cmfproperty::example_data)
#>              PIN SALE_YEAR SALE_PRICE ASSESSED_VALUE
#> 1 17273100931118      2015      53000          33860
#> 2 18013090421010      2018      80000          60390
#> 3 12111190201042      2018     118000         108300
#> 4 13093160601015      2017     125500          87200
#> 5 14322110150000      2018    3705000        3670740
#> 6 27021200080000      2016     345000         267280

Each row in this dataset should match a sale and an assessment from the same year for a single property.

After gathering data in this form, we have to identify the three required columns (sales, assessments, and sale year) for the package using reformat_data.

library(cmfproperty)
df <- cmfproperty::example_data
ratios <-
  cmfproperty::reformat_data(
    df,
    sale_col = "SALE_PRICE",
    assessment_col = "ASSESSED_VALUE",
    sale_year_col = "SALE_YEAR",
  )
#> [1] "Filtered out non-arm's length transactions"
#> [1] "Inflation adjusted to 2019"
head(as.data.frame(ratios))
#>              PIN SALE_YEAR SALE_PRICE ASSESSED_VALUE TAX_YEAR     RATIO
#> 1 17273100931118      2015      53000          33860     2015 0.6388679
#> 2 18013090421010      2018      80000          60390     2018 0.7548750
#> 3 12111190201042      2018     118000         108300     2018 0.9177966
#> 4 13093160601015      2017     125500          87200     2017 0.6948207
#> 5 14322110150000      2018    3705000        3670740     2018 0.9907530
#> 6 27021200080000      2016     345000         267280     2016 0.7747246
#>   arms_length_transaction SALE_PRICE_ADJ ASSESSED_VALUE_ADJ
#> 1                       1       59209.48           37827.04
#> 2                       1       82313.03           62136.05
#> 3                       1      121411.71          111431.26
#> 4                       1      132854.54           92310.09
#> 5                       1     3812122.01         3776871.45
#> 6                       1      376080.37          291358.73

Our data is now in the appropriate form for further analysis. More information on the output of reformat_data can be found on the Getting started page.

Let’s look at an example. The Cook County Assessor’s Office, which is the assessor for Cook County, Illinois, has an excellent open data portal which includes all the necessary information and is the source of the example_data from this package.

Example

Cook County releases both the sales roll and the assessment roll on their open data portal. Sales were found here and assessments were found here. These files can be downloaded manually or via RSocrata.

Here we download the files manually:

library(data.table)
library(tidyverse)

sales <- fread("~/../Downloads/Cook_County_Assessor_s_Residential_Sales_Data.csv", 
               colClasses = "character") #from 2013 to 2019
assessments <- fread("~/../Downloads/Cook_County_Assessor_s_Residential_Assessments.csv", 
                     colClasses = "character") #from 2015 to 2019

Then filter the data to select only the columns we need and to remove some sales which are clearly not between two independent parties:

sales <- sales %>% select(PIN, `Sale Year`, `Sale Price`, `Deed No.`) %>%
  filter(`Sale Year` > 2014)

assessments <- assessments %>% select(PIN, YEAR, CERTIFIED)

# Filtering data to remove duplicate sales and low value sales
sales <- sales %>% distinct(`Deed No.`, .keep_all = TRUE) %>% select(-`Deed No.`)
sales <- sales %>% filter(as.numeric(`Sale Price`) > 2500)

Finally, we conduct a left join of assessments on sales matching sales to assessment by unique identifier (PIN) and year.

# Join assessments to sales based on PIN (a unique identifier) and Year.
joined <- sales %>% left_join(assessments, by=c("PIN"="PIN", "Sale Year"="YEAR"))

# Remove spaces from column names (not necessary)
joined <- joined %>% rename(SALE_YEAR = `Sale Year`, SALE_PRICE = `Sale Price`, ASSESSED_VALUE = CERTIFIED)

This dataset is then identical to example_data:

head(cmfproperty::example_data)
#>              PIN SALE_YEAR SALE_PRICE ASSESSED_VALUE
#> 1 17273100931118      2015      53000          33860
#> 2 18013090421010      2018      80000          60390
#> 3 12111190201042      2018     118000         108300
#> 4 13093160601015      2017     125500          87200
#> 5 14322110150000      2018    3705000        3670740
#> 6 27021200080000      2016     345000         267280