问题描述
我是一名新的 Python 程序员。我目前正在从事一个自动文本摘要项目。我有一个像这样的 jsonlines 格式 (.jsonl) 的数据集:
{"category": "olahraga","gold_labels": [[true,true],[false,false,false],[true,true,false]],"id": "1475912720-messi-absen-argentina-kembali-gagal-menang","paragraphs": [[["Jakarta",","CNN","Indonesia","-","Timnas","Argentina","kembali","gagal","meraih","kemenangan","tanpa","kehadiran","Lionel","Messi","."],["Kali","ini","tim","Tango","ditahan","imbang","Peru","2","di","Stadion","Nasional","Lima","Kamis","(","6","/","10",")","malam","waktu","setempat","atau","Jumat","pagi","WIB","."]],[["Argentina","membuang","keunggulan","dua","kali","saat","menghadapi",["Ramiro","Funes","Mori","membawa","unggul","lewat","gol","pada","menit","ke","16",["Bek","Everton","itu","memanfaatkan","kemelut","kotak","penalti","skema","sepak","pojok",[["Keunggulan","satu","bertahan","hingga","jeda","babak","pertama",["Di","kedua","tepatnya","58","tuan","rumah","berhasil","menyamakan","kedudukan","kapten","Jose","Paolo","Guerrero",[["Usai","menerima","umpan","Miguel","Trauco","menahan","bola","dengan","dada","dan","lolos","dari","kawalan",["Mantan","penyerang","Bayern","Munich","kemudian","melepaskan","tendangan","mendatar","yang","tidak","bisa","dihentikan","kiper","Sergio","Romero","Gonzalo","Higuain","78",["El","Pipita","meneruskan","terobosan","Pablo","Zabaleta","mencungkil","melewati","Pedro","gallese",[["Sayang","ditangani","Edgardo","Bauza","mampu","mempertahankan",["Pada","84","mendapat","dieksekusi","sempurna","oleh","Christian","Cueva","setelah","menjatuhkan",[["Hasil","membuat","naik","peringkat","lima","klasemen","sementara","kualifikasi","Piala","Dunia","2018","zona","Conmebol",["Argentina","kini","mengoleksi","poin","tertinggal","tiga","Uruguay","puncak","untuk","beruntun","hasil",["Sebelumnya","Venezuela","September","lalu",["Menariknya","laga","tersebut","tampil","karena","cedera",[["Messi","sedang","menjalani","pemulihan","pangkal","paha","selama","pekan","didapatnya","melawan","Atletico","Madrid",["Penyerang","Barcelona","juga","dipastikan","absen","ketika","Paraguay","Cordoba","11","Oktober","mendatang",["(","har",")"]]],"source": "cnn indonesia","source_url": "http://www.cnnindonesia.com/olahraga/20161007114818-142-163937/messi-absen-argentina-kembali-gagal-menang/","summary": [["Timnas","berhadapan","skor",["Tanpa",["Messi","dikabarkan","."]]}
{"category": "hiburan",[false],"id": "1494353011-kebakaran-di-kapuk-muara-12-unit-damkar-dikerahkan","paragraphs": [[["Kebakaran","melanda","kawasan","pemukiman","Jalan","Kapuk","Raya","Vila","Muara","I","Jakarta","Utara",["Api","menyala","telah","membakar","sejumlah",[["\"","Yang","terbakar","adalah","lapak","warga","terbuat","bahan","mudah","\"","kata","Petugas","Sudin","Pemadam","Kebakaran","Rangga","Riswanto","dihubungi","kumparan","kumparan.com","Selasa","9","5",[["Rangga","menyebut","pihaknya","laporan","mengenai","kebakaran","sekitar","pukul","03.40",["Sudah","12","unit","mobil","pemadam","dikerahkan","memadamkan","api","damkar","Barat","ujar","dia",[["Hingga","04.21","masih","berlangsung",["Rangga","berpotensi","meluas",["\"","Situasinya","perambatannya","membahayakan","perambatan","kanan","kiri","."]]],"source": "kumparan","source_url": "https://kumparan.com/taufik-rahadian/kebakaran-di-kapuk-muara-12-unit-damkar-dikerahkan","summary": [["Kebakaran",["Petugas","."]]}
{"category": "tajuk utama","gold_labels": [[false,"id": "1501893029-lula-kamal-dokter-ryan-thamrin-sakit-sejak-setahun","Dokter","Ryan","Thamrin","terkenal","acara","Oz","meninggal","dunia","4","8","dini","hari",["Dokter","Lula","Kamal","merupakan","selebriti","sekaligus","rekan","kerja","kawannya","sudah","sakit","sejak","setahun",[["Lula","menuturkan","mesti","vakum","semua","kegiatannya","termasuk","menjadi","pembawa",["Kondisi","harus","kampung","halamannya","Pekanbaru","Riau","istirahat","Setahu","saya","orangnya","sehat","tapi","tahun","dengar","Karena","sakitnya","ia","langsung","pulang","jadi","kami","mau","jenguk","susah",["Barangkali","ya","betul","kalau","isirahatnya","kepada","CNNIndonesia.com","mengenal","sebelum","aktif","berkarier","televisi","mengaku","belum","sempat","membesuk","lantaran","lokasi","jauh",["Dia","tak","tahu","penyakit","apa","diderita","Itu","enggak","selamanya","dijenguk",["Enggak","berat","sekali","bagaimana","tutur",[["walau","menderita","mengetahui","penyebab","pasti","kematian","Dr",["Meski","demikian","mendengar","beberapa","kabar","bahwa","jatuh","kamar","mandi",[["\u201c","Saya","barangkali","dulu","sama","sekarang","berbeda","kematiannya","beda","sebelumnya",["Kita","kan","mengambil","kesimpulan",[["Ryan","sebagai","dokter","rutin","membagikan","tips","informasi","kesehatan","tayangan","menempuh","Pendidikan","2002","Fakultas","Kedokteran","Universitas","Gadjah","Mada","melanjutkan","pendidikan","Klinis","Kesehatan","Reproduksi","Penyakit","Menular","Seksual","Mahachulalongkornrajavidyalaya","University","Bangkok","Thailand","2004","source_url": "https://www.cnnindonesia.com/hiburan/20170804120703-234-232443/lula-kamal-dokter-ryan-thamrin-sakit-sejak-setahun-lalu/","summary": [["Dokter",["Lula","."]]}
根据这些数据,我想将每个“gold_labels”值与“段落”中的每个句子配对(段落中的每个句子都经过了标记化过程)。我听说过使用 json_normalize 生成扁平表数据,但直到最近我才明白如何在 jsonlines 中实现它。另外,jsonlines教程很少。
我想要的数据框可视化如下所示:check here
抱歉我的英语不好。谢谢。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)