按字符将unicode分成列表

问题描述

我制作了一个程序,可以读取一组名称,然后将其转换为Unicode示例

StevensJohn:-:
WasouskiMike:-:
TimebombTime:-:
etc

有什么方法可以创建一个列表来拆分索引,使其类似

example_list = ["StevensJohn","WasouskiMike","TimebombTim"] 

这是动态的,因此将从网络抓取中返回名称和不同名称数量

任何输入将不胜感激。

代码

results = unicode("""
Hospitality
Customer Care
Wick,John 12:00-20:00
Wick,John 10:00-17:00
Obama,Barack 06:00-14:00
Musk,Elon 07:00-15:00
Wasouski,Mike 06:30-14:30
 Production
Fries
Piper,Billie 12:00-20:00
Tennent,David 06:30-14:30
Telsa,nikola 11:45-17:00
Beverages & Desserts in a Dual Lane Drive-thru with a split beverage cell
Timebomb,Tim 06:30-14:30
Freeman,Matt 08:00-16:00
Cool,Tre 11:45-17:00
Sausage
Prestly,Elvis 06:30-14:30
Fat,Mike 06:30-14:30
Knoxville,Johnny 06:00-14:00
Man,Wee 05:00-12:00
Heartness,Jack 09:00-16:00
Breakfast BOP
Schofield,Phillip 06:30-14:15
Burns,George 06:30-14:15
Johnson,Boris 06:30-14:30
Milliband,Edd 06:30-14:30
Trump,Donald 10:00-17:00
Biden,Joe 08:00-16:00
Tempering & Prep
Clinton,Hillary 11:00-19:00

""")

for span in results:
    results = results.replace(',','')
    results = results.replace(" ","")
    results = results.replace("/r","")
    results = results.replace(":-:","\r")
    results = ''.join([i for i in results if not i.isdigit()])
    print(results)

解决方法



import re

input = 'StevensJohn:-:\nWasouskiMike:-:\nTimebombTime:-:\n'

class Names:
    def __init__(self,input,delimiter=':-:\n'):
        self.names = [ x for x in re.split(delimiter,input) if x ]
        self.diffrent_names = set(self.names)

    def number_of_names(self):
        return len(self.names)

    def number_of_diffrent_names(self):
        return len(self.diffrent_names)

    def __str__(self):
        return str(self.names)

names = Names(input)
print(names)
print(names.number_of_names())
print(names.number_of_diffrent_names())
,
unicode_ex = 'StevensJohn:-:\nWasouskiMike:-:\nTimebombTime:-:\n'
splitted = [name.replace(" ","") for name in unicode_ex.split(":-:\n") if name]
print(splitted)

输出

['StevensJohn','WasouskiMike','TimebombTime']
,

您的编辑显示这确实是XY problem。您试图连续修剪小子字符串的尝试将不可避免地碰到某些情况下某些子字符串有时不应删除的极端情况。一种常见的替代方法是使用正则表达式。

class Comment:

def __init__(self,author_name,body,ups):
    self.author_name = author_name
    self.body = body
    self.ups = ups

演示:https://ideone.com/1syge8

更好的解决方案仍然是使用周围HTML的结构来仅提取特定范围。大多数现代网站都使用CSS选择器进行格式化,这对于抓取也非常有用。但是,由于我们看不到提取该字符串的原始页面,因此这完全是推测性的。