Python / Pandas-用分隔符将文本分成几列;并创建一个csv文件

问题描述

我在插入定界符“;”的地方有一个很长的文字。正是我想将文本分成不同的列的位置。 到目前为止,每当我尝试将文本分为“ ID”和“ ADText”时,我只会得到第一行。但是,两列中应该有1439行/行。

我的文字如下: 1234;以多句写成的文本经过多行,直到某个时刻将下一个ID写入dwon 2345;然后新的广告文字开始,直到下一个ID 3456;等等

我要使用;将我的文本分为两列,一列为ID,一列为AD文本。

y

不幸的是,该方法仅适用于第一个条目,然后停止。输出看起来像这样:

using Jose;
using System;
using System.Security.Cryptography;

namespace JWKValiadation
{
    public class ECJWKey
    {
        public string kty { get; set; }
        public string crv { get; set; }
        public string kid { get; set; }
        public string x { get; set; }
        public string y { get; set; }
    }

    class Program
    {
        static void Main(string[] args)
        {
            ECJWKey ecjwkkey = new ECJWKey
            {
                kty = "EC",crv = "P-256",kid = "2020-09-02T17:36:17.570.ec",x = "uAfEPKELRuUVMtB0DCB5oyYWnfiV8-9zHYntvI0lsRE",y = "32J6nVgeb9RLdWK21QNDHhWdOsZJbxvyEq2n0IOvLtQ"
            };

            string tokenEC = "eyJraWQiOiIyMDIwLTA5LTAyVDE3OjM2OjE3LjU3MC5lYyIsInR5cCI6IkpXVCIsImFsZyI6IkVTMjU2In0.eyJzdWIiOiJ1cm46Y2VybmVyOmlkZW50aXR5LWZlZGVyYXRpb246cmVhbG06SFdPb0lsUlgyWWRGZjkyNGJBZTZSR0l5WmtuajZrTjctY2g6cHJpbmNpcGFsOnRhNDh6OWdkNTVkNndyNW0iLCJhdWQiOiJodHRwczpcL1wvdXJsMjU4dmowai5leGVjdXRlLWFwaS51cy1lYXN0LTIuYW1hem9uYXdzLmNvbSIsImlzcyI6Imh0dHBzOlwvXC9kZXYuYmF5Y2FyZS5wYXRpZW50cG9ydGFsLnVzLTEuaGVhbHRoZWludGVudC5jb20iLCJleHAiOjE1OTkxNTQ1MTYsImlhdCI6MTU5OTE1MzkxNiwic2lkIjoiZGUwNmJhNmUtYjQyYy00ZmY5LWI4MmQtYmM4NjY0ODJmODU4In0.6Ru5Lyd1Zq016uv84pP-GjSuz6koVNipa_cd939eF21-5N2_A0Nj3I6AkDhuHrE870WzyTiCmZfkIjMOFZkRCA";

            // first read the header to get the kid
            var headers = Jose.JWT.Headers(tokenEC);
            if(headers.TryGetValue("kid",out var keyId))
            {
                // in a real application you would need the kid 
                // to select the right key from the JKWS
                Console.WriteLine(keyId);
            }

            // create the key based on the parameters from the JWK
            ECDsa eckey = ECDsa.Create(new ECParameters
            {
                Curve = ECCurve.NamedCurves.nistP256,Q = new ECPoint
                {
                    X = Base64Url.Decode(ecjwkkey.x),Y = Base64Url.Decode(ecjwkkey.y)
                }
            });
            
            // verify and decode the token
            string payload = Jose.JWT.Decode(tokenEC,eckey);
            Console.WriteLine(payload);
        }
    }
}

我要去哪里错了?我将不胜感激任何建议=) 谢谢!

解决方法

示例文字:

FullName;ISO3;ISO1;molecular_weight
Alanine;Ala;A;89.09
Arginine;Arg;R;174.20
Asparagine;Asn;N;132.12
Aspartic_Acid;Asp;D;133.10
Cysteine;Cys;C;121.16

基于“;”创建列分隔符:

import pandas as pd
f = "aminoacids"
df = pd.read_csv(f,sep=";")

enter image description here

编辑:考虑到评论,我认为文本看起来像这样:

t = """1234; text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon 2345; then the new Ad-Text begins until the next ID 3456; and so on1234; text in written from with multiple """

在这种情况下,像这样的正则表达式会将您的字符串分成ID和文本,然后您可以将其用于生成熊猫数据框。

import re
r = re.compile("([0-9]+);")
re.split(r,t)

输出:

['','1234',' text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon ','2345',' then the new Ad-Text begins until the next ID ','3456',' and so on',' text in written from with multiple ']

编辑2: 这是对提问者在评论中的其他问题的答复: 如何将此字符串转换为具有2列ID和文本的熊猫数据框

import pandas as pd
# a is the output list from the previous part of this answer
# Create list of texts. ::2 takes every other item from a list,starting with the FIRST one.
texts = a[::2][1:] 
print(texts)
# Create list of ID's. ::1 takes every other item from a list,starting with the SECOND one
ids = a[1::2]
print(ids)
df = pd.DataFrame({"IDs":ids,"Texts":texts})