问题描述
我有一个好奇的问题,我想将一个.7z
文件直接加载到Python中的NumPy数组而不提取它(我知道您可以解压缩.7z文件然后阅读)但我找不到任何相关的代码/软件包。
数据以浮点数的形式存储在.7z文件中,其中有一个.csv文件,我想将其直接从.7z文件转换为NumPy数组。
我已经阅读了libarchive
或pylzma
之类的几个软件包,但它们更多地是关于解压缩而不是将文件直接加载到np数组中。而且this post似乎也很相关,但对我来说却很模糊。
解决方法
我找到了解决方案。首先,我从here获得了一个示例{
"$schema": "https://vega.github.io/schema/vega/v4.3.0.json","autosize": "fit","title": "POS COUNT","data": [
{
"name": "data_table","url": {
"index": "sa_test_index_data","body": {
"size": 0,"query": {
"nested": {
"path": "xforms.sentence.tokens","query": {
"bool": {
"should": [
{
"wildcard": {
"xforms.sentence.tokens.value.keyword": "24*" <----- this to change
}
}
]
}
}
}
},"aggs": {
"sentence": {
"nested": {"path": "xforms.sentence.tokens"},"aggs": {
"pos_filter": {
"filter": {
"wildcard": {"xforms.sentence.tokens.value.keyword": "24*"} <----- this to change
},"aggs": {
"pos": {
"terms": {"field": "xforms.sentence.tokens.tag.keyword"}
}
}
}
}
}
}
}
},"format": {"property": "aggregations.sentence.pos_filter.pos.buckets"},"transform": [
{
"type": "collect","sort": {"field": ["doc_count"],"order": ["descending"]}
}
]
},{
"name": "data_table_pie_inner","source": "data_table","transform": [
{
"type": "aggregate","groupby": ["key"],"fields": ["doc_count"],"ops": ["sum"],"as": ["ff_sum_count"]
},{
"type": "pie","field": "ff_sum_count","as": ["ff_inner_startAngle","ff_inner_endAngle"]
}
]
}
],"scales": [
{
"name": "scale_color","type": "ordinal","range": {"scheme": "category10"},"domain": {"data": "data_table","field": "key"}
}
],"marks": [
{
"name": "mark_inner_ring","type": "arc","from": {"data": "data_table_pie_inner"},"encode": {
"enter": {
"x": {"signal": "width / 2"},"y": {"signal": "height / 2"},"fill": {"scale": "scale_color","field": "key"},"fillOpacity": {"value": 0.8},"stroke": {"value": "white"},"startAngle": {"field": "ff_inner_startAngle"},"endAngle": {"field": "ff_inner_endAngle"},"innerRadius": {"value": 0},"outerRadius": {"value": 100},"tooltip": {
"signal": "datum['key'] + ': count ' + datum['ff_sum_count']"
}
}
}
}
],"legends": [
{
"fill": "scale_color","title": "POS","orient": "right","encode": {
"symbols": {"enter": {"fillOpacity": {"value": 0.5}}},"labels": {"update": {"text": {"field": "value"}}}
}
}
]
}
文件。
然后我解压缩并用7z
将其解压缩。我使用py7zr将内容扩展到工作目录中,并使用NumPy中的.csv
将其转换为numpy数组。
genfromtxt
输出:
import os
import py7zr
from numpy import genfromtxt
with py7zr.SevenZipFile('FL_insurance_sample.7z',mode='r') as z:
z.extractall(os.getcwd())
my_data = genfromtxt('FL_insurance_sample.csv',delimiter=',')
print(my_data)
请注意,如果值是非数字的, nan]
[1.19736e+05 nan nan ... nan nan
1.00000e+00]
[4.48094e+05 nan nan ... nan nan
3.00000e+00]
...
[7.91209e+05 nan nan ... nan nan
4.00000e+00]
[3.22627e+05 nan nan ... nan nan
3.00000e+00]
[3.98149e+05 nan nan ... nan nan
1.00000e+00]]
将给出genfromtxt
(这不是问题,因为OP的csv只有数字)
参考: