理论上来说,pandas提供了个工具
import pandas as pd
df = pd.read_orc(file_name)
然而,pandas底层用了个叫pyarrow的library,然后pip install 里面的image把pyarrow._orc部分拿掉了,他们有些orc的C++ link error。
这个Youtube视频解释的很清楚并且给出了解决之道,就是用另一libary pyorc
具体error为
import pyarrow._orc
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-65-ca95b0424328> in <module>
----> 1 import pyarrow._orc
ModuleNotFoundError: No module named 'pyarrow._orc'
google了一下,貌似mac和windows都有这个问题。估计一时半会也修不好,他们的ticket说找不到自愿者。我试过另一种常用格式parquet倒是好的。
df = pd.read_parquet(file_name)
这里我写了个可以直接用的function,输入为orc file directory,然后读入目录下全部文件并合并成一个pandas dataframe并返回。为什么要读入多个文件?因为在大数据处理系统里面,比如Hive或Spark,数据(比如,Hive table)都是分成小文件的(part files)。ptn是pattern,有时你想跳过一些非数据文件,比如_SUCCESS file之类。
def orc_to_df(path, ptn='**/*'):
dfs = []
paths = list(Path(path).glob(ptn))
print("to read files: ")
for path in paths:
print(path)
i = 0
for path in paths:
f = open(path, 'rb')
reader = pyorc.Reader(f)
columns = reader.schema.fields
# sort by column id to ensure correct order (since "fields" is a dict, order may not be correct)
columns = [y for x, y in sorted([(reader.schema.find_column_id(c), c) for c in columns])]
df = pd.DataFrame(reader, columns=columns)
dfs += [df]
i += 1
print(f'loaded {i} part file(s).')
df = pd.concat(dfs)
return df