理論上來說,pandas提供了個工具
import pandas as pd
df = pd.read_orc(file_name)
然而,pandas底層用了個叫pyarrow的library,然後pip install 裡面的image把pyarrow._orc部分拿掉了,他們有些orc的C++ link error。
這個Youtube視頻解釋的很清楚並且給出了解決之道,就是用另一libary pyorc
具體error為
import pyarrow._orc
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-65-ca95b0424328> in <module>
----> 1 import pyarrow._orc
ModuleNotFoundError: No module named 'pyarrow._orc'
google了一下,貌似mac和windows都有這個問題。估計一時半會也修不好,他們的ticket說找不到自願者。我試過另一種常用格式parquet倒是好的。
df = pd.read_parquet(file_name)
這裡我寫了個可以直接用的function,輸入為orc file directory,然後讀入目錄下全部文件併合並成一個pandas dataframe並返回。為什麼要讀入多個文件?因為在大數據處理系統裡面,比如Hive或Spark,數據(比如,Hive table)都是分成小文件的(part files)。ptn是pattern,有時你想跳過一些非數據文件,比如_SUCCESS file之類。
def orc_to_df(path, ptn='**/*'):
dfs = []
paths = list(Path(path).glob(ptn))
print("to read files: ")
for path in paths:
print(path)
i = 0
for path in paths:
f = open(path, 'rb')
reader = pyorc.Reader(f)
columns = reader.schema.fields
# sort by column id to ensure correct order (since "fields" is a dict, order may not be correct)
columns = [y for x, y in sorted([(reader.schema.find_column_id(c), c) for c in columns])]
df = pd.DataFrame(reader, columns=columns)
dfs += [df]
i += 1
print(f'loaded {i} part file(s).')
df = pd.concat(dfs)
return df