怎樣把orc文件讀進來變成pandas dataframe ? | how to read orc file and convert it to a pandas dataframe ?

理論上來說,pandas提供了個工具

import pandas as pd

df = pd.read_orc(file_name)

然而,pandas底層用了個叫pyarrow的library,然後pip install 裡面的image把pyarrow._orc部分拿掉了,他們有些orc的C++ link error。

這個Youtube視頻解釋的很清楚並且給出了解決之道,就是用另一libary pyorc

具體error為

import pyarrow._orc
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-65-ca95b0424328> in <module>
----> 1 import pyarrow._orc

ModuleNotFoundError: No module named 'pyarrow._orc'

google了一下,貌似mac和windows都有這個問題。估計一時半會也修不好,他們的ticket說找不到自願者。我試過另一種常用格式parquet倒是好的。

df = pd.read_parquet(file_name)

這裡我寫了個可以直接用的function,輸入為orc file directory,然後讀入目錄下全部文件併合並成一個pandas dataframe並返回。為什麼要讀入多個文件?因為在大數據處理系統裡面,比如Hive或Spark,數據(比如,Hive table)都是分成小文件的(part files)。ptn是pattern,有時你想跳過一些非數據文件,比如_SUCCESS file之類。

def orc_to_df(path, ptn='**/*'):
    dfs = []
    
    paths = list(Path(path).glob(ptn))
    print("to read files: ")
    for path in paths:
        print(path)
    
    i = 0
    for path in paths:
        f = open(path, 'rb')
        reader = pyorc.Reader(f) 
        columns = reader.schema.fields

        # sort by column id to ensure correct order (since "fields" is a dict, order may not be correct)
        columns = [y for x, y in sorted([(reader.schema.find_column_id(c), c) for c in columns])] 
        df = pd.DataFrame(reader, columns=columns)
        dfs += [df]
        i += 1
        print(f'loaded {i} part file(s).')
        
    df = pd.concat(dfs)
    return df

Leave a Comment

Your email address will not be published.