解析 R 中的地址字符串

问题描述

我在 R 中有多种地址格式的地址数据,并且希望至少解析为重要的地址部分,以便我可以使用地址来合并多个数据集。但是,由于地址可以采用多种格式,因此我需要一些可以识别单元或公寓的信息,例如,根据街道和邮政编码。

问题:

testaddress1 <- "20 W 34th St,New York,NY 10001"
testaddress2 <- "20 West 34 St,New York City,NY 10001"
testaddress3 <- "20 WEST 34th,NYC,NY 10001"

在 R 中有没有一种简单的方法来解析地址部分?理想情况下,以下部分:

Number: 20; Direction: West; Street: 34; City: New York; State: NY; Zip: 10001

地址中的单位和收件人也存在问题:

#Problem with units/apartments
testunit1 <- "UNIT 9A 740 Park Ave,NY 10021"
testunit2 <- "740 Park Ave 9A,NY 10021"
testunit3 <- "APT 9A,740 Park Ave,NY 10021"

#Ideal parse
Unit: 9A; Number: 740; Street: Park Ave; City: New York; State: NY; Zip: 10021

#Problem with recipient
testrec1<- "John Doe UNIT 9A,NY 10021"
testrec2 <- "John Doe,740 Park Ave 9A,NY 10021"
testrec3 <- "JOHN DOE APT 9A,NY 10021"

#Ideal parse
Recipient: John Doe; Unit: 9A; Number: 740; Street: Park Ave; City: New York; State: NY; Zip: 10021

我发现了这个,但它看起来一团糟,我在实现它时遇到了麻烦: https://slu-opengis.github.io/postmastr/articles/postmastr.html

在 R 中有没有自动解析地址的东西?

解决方法

postmastr 似乎工作得很好...

v.adresses <- c("20 W 34th St,New York,NY 10001","20 West 34 St,New York City,"20 WEST 34th,NYC,NY 10001")

df <- data.frame(address = v.adresses)

library(postmastr)
library(magrittr)
library(tidycensus)
df
#***************************************************************
# STATES and POSTAL CODES #####
#***************************************************************
# Build states dictionary
stateDict <- pm_dictionary(locale = "us",type = "state")
#parse and get states + postalcodes
answer_1 <- df %>%
  pm_identify(var = "address") %>%
  pm_prep(var = "address",type = "street") 

answer <- answer_1 %>% 
  pm_postal_parse() %>%
  pm_state_parse(dictionary = stateDict)

#***************************************************************
# CITIES #####
#***************************************************************
# Create cities dictionary based on states in `answer` 
#  apikey needed (see postmastr-vignette)
# run below code once
#  census_api_key("#####",install = TRUE)
#  readRenviron("~/.Renviron")
# end run
cityDict <- pm_dictionary(type = "city",filter = unique(answer$pm.state),locale = "us")
#  There seem to be addresses without correct cities
answer %>% pm_city_none(dictionary = cityDict)
#   pm.uid pm.address                  pm.state pm.zip
#    <int> <chr>                       <chr>    <chr> 
# 1      2 20 West 34 St New York City NY       10001 
# 2      3 20 WEST 34th NYC            NY       10001 
# So we append the cities to the dictionary
missingCity <- pm_append(type = "city",input = c("New York City","NYC"),output = "New York",locale = "us")
# Build new cities dictionary
cityDict <- pm_dictionary(type = "city",append = missingCity,locale = "us")
# Now all line shave cities?
answer %>% pm_city_all(dictionary = cityDict)
#TRUE
# Parse
answer <- answer %>% pm_city_parse(dictionary = cityDict)
#    m.uid pm.address    pm.city  pm.state pm.zip
#    <int> <chr>         <chr>    <chr>    <chr> 
# 1      1 20 W 34th St  New York NY       10001 
# 2      2 20 West 34 St New York NY       10001 
# 3      3 20 WEST 34th  New York NY       10001 

#***************************************************************
# HOUSENUMBERS #####
#***************************************************************
answer <- answer %>% pm_house_parse()
#   pm.uid pm.address pm.house pm.city  pm.state pm.zip
#    <int> <chr>      <chr>    <chr>    <chr>    <chr> 
# 1      1 W 34th St  20       New York NY       10001 
# 2      2 West 34 St 20       New York NY       10001 
# 3      3 WEST 34th  20       New York NY       10001 

#***************************************************************
# STREETS #####
#***************************************************************
dirsDict <- pm_dictionary(type = "directional",locale = "us")
answer <- answer %>% 
  pm_streetDir_parse(dictionary = dirsDict) %>%
  pm_streetSuf_parse() %>%
  pm_street_parse(ordinal = TRUE,drop = TRUE)

pm_replace(answer,source = answer_1)
#   pm.uid pm.house pm.preDir pm.street pm.streetSuf pm.city  pm.state pm.zip
#    <int> <chr>    <chr>     <chr>     <chr>        <chr>    <chr>    <chr> 
# 1      1 20       W         34th      St           New York NY       10001 
# 2      2 20       W         34        St           New York NY       10001 
# 3      3 20       W         34th      NA           New York NY       10001