问题描述
我在 R 中有多种地址格式的地址数据,并且希望至少解析为重要的地址部分,以便我可以使用地址来合并多个数据集。但是,由于地址可以采用多种格式,因此我需要一些可以识别单元或公寓的信息,例如,根据街道和邮政编码。
问题:
testaddress1 <- "20 W 34th St,New York,NY 10001"
testaddress2 <- "20 West 34 St,New York City,NY 10001"
testaddress3 <- "20 WEST 34th,NYC,NY 10001"
在 R 中有没有一种简单的方法来解析地址部分?理想情况下,以下部分:
Number: 20; Direction: West; Street: 34; City: New York; State: NY; Zip: 10001
地址中的单位和收件人也存在问题:
#Problem with units/apartments
testunit1 <- "UNIT 9A 740 Park Ave,NY 10021"
testunit2 <- "740 Park Ave 9A,NY 10021"
testunit3 <- "APT 9A,740 Park Ave,NY 10021"
#Ideal parse
Unit: 9A; Number: 740; Street: Park Ave; City: New York; State: NY; Zip: 10021
#Problem with recipient
testrec1<- "John Doe UNIT 9A,NY 10021"
testrec2 <- "John Doe,740 Park Ave 9A,NY 10021"
testrec3 <- "JOHN DOE APT 9A,NY 10021"
#Ideal parse
Recipient: John Doe; Unit: 9A; Number: 740; Street: Park Ave; City: New York; State: NY; Zip: 10021
我发现了这个,但它看起来一团糟,我在实现它时遇到了麻烦: https://slu-opengis.github.io/postmastr/articles/postmastr.html
在 R 中有没有自动解析地址的东西?
解决方法
postmastr
似乎工作得很好...
v.adresses <- c("20 W 34th St,New York,NY 10001","20 West 34 St,New York City,"20 WEST 34th,NYC,NY 10001")
df <- data.frame(address = v.adresses)
library(postmastr)
library(magrittr)
library(tidycensus)
df
#***************************************************************
# STATES and POSTAL CODES #####
#***************************************************************
# Build states dictionary
stateDict <- pm_dictionary(locale = "us",type = "state")
#parse and get states + postalcodes
answer_1 <- df %>%
pm_identify(var = "address") %>%
pm_prep(var = "address",type = "street")
answer <- answer_1 %>%
pm_postal_parse() %>%
pm_state_parse(dictionary = stateDict)
#***************************************************************
# CITIES #####
#***************************************************************
# Create cities dictionary based on states in `answer`
# apikey needed (see postmastr-vignette)
# run below code once
# census_api_key("#####",install = TRUE)
# readRenviron("~/.Renviron")
# end run
cityDict <- pm_dictionary(type = "city",filter = unique(answer$pm.state),locale = "us")
# There seem to be addresses without correct cities
answer %>% pm_city_none(dictionary = cityDict)
# pm.uid pm.address pm.state pm.zip
# <int> <chr> <chr> <chr>
# 1 2 20 West 34 St New York City NY 10001
# 2 3 20 WEST 34th NYC NY 10001
# So we append the cities to the dictionary
missingCity <- pm_append(type = "city",input = c("New York City","NYC"),output = "New York",locale = "us")
# Build new cities dictionary
cityDict <- pm_dictionary(type = "city",append = missingCity,locale = "us")
# Now all line shave cities?
answer %>% pm_city_all(dictionary = cityDict)
#TRUE
# Parse
answer <- answer %>% pm_city_parse(dictionary = cityDict)
# m.uid pm.address pm.city pm.state pm.zip
# <int> <chr> <chr> <chr> <chr>
# 1 1 20 W 34th St New York NY 10001
# 2 2 20 West 34 St New York NY 10001
# 3 3 20 WEST 34th New York NY 10001
#***************************************************************
# HOUSENUMBERS #####
#***************************************************************
answer <- answer %>% pm_house_parse()
# pm.uid pm.address pm.house pm.city pm.state pm.zip
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 W 34th St 20 New York NY 10001
# 2 2 West 34 St 20 New York NY 10001
# 3 3 WEST 34th 20 New York NY 10001
#***************************************************************
# STREETS #####
#***************************************************************
dirsDict <- pm_dictionary(type = "directional",locale = "us")
answer <- answer %>%
pm_streetDir_parse(dictionary = dirsDict) %>%
pm_streetSuf_parse() %>%
pm_street_parse(ordinal = TRUE,drop = TRUE)
pm_replace(answer,source = answer_1)
# pm.uid pm.house pm.preDir pm.street pm.streetSuf pm.city pm.state pm.zip
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 20 W 34th St New York NY 10001
# 2 2 20 W 34 St New York NY 10001
# 3 3 20 W 34th NA New York NY 10001