怎样把orc文件读进来变成pandas dataframe ? | how to read orc file and convert it to a pandas dataframe ?

理论上来说,pandas提供了个工具

import pandas as pd

df = pd.read_orc(file_name)

然而,pandas底层用了个叫pyarrow的library,然后pip install 里面的image把pyarrow._orc部分拿掉了,他们有些orc的C++ link error。

这个Youtube视频解释的很清楚并且给出了解决之道,就是用另一libary pyorc

具体error为

import pyarrow._orc
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-65-ca95b0424328> in <module>
----> 1 import pyarrow._orc

ModuleNotFoundError: No module named 'pyarrow._orc'

google了一下,貌似mac和windows都有这个问题。估计一时半会也修不好,他们的ticket说找不到自愿者。我试过另一种常用格式parquet倒是好的。

df = pd.read_parquet(file_name)

这里我写了个可以直接用的function,输入为orc file directory,然后读入目录下全部文件并合并成一个pandas dataframe并返回。为什么要读入多个文件?因为在大数据处理系统里面,比如Hive或Spark,数据(比如,Hive table)都是分成小文件的(part files)。ptn是pattern,有时你想跳过一些非数据文件,比如_SUCCESS file之类。

def orc_to_df(path, ptn='**/*'):
    dfs = []
    
    paths = list(Path(path).glob(ptn))
    print("to read files: ")
    for path in paths:
        print(path)
    
    i = 0
    for path in paths:
        f = open(path, 'rb')
        reader = pyorc.Reader(f) 
        columns = reader.schema.fields

        # sort by column id to ensure correct order (since "fields" is a dict, order may not be correct)
        columns = [y for x, y in sorted([(reader.schema.find_column_id(c), c) for c in columns])] 
        df = pd.DataFrame(reader, columns=columns)
        dfs += [df]
        i += 1
        print(f'loaded {i} part file(s).')
        
    df = pd.concat(dfs)
    return df

Leave a Comment

Your email address will not be published.