Python/Pandas实战: 处理IBKR Statement

Python/Pandas实战: 处理盈透Statement

Pandas是一个非常强大的数据分析方面的Python package. 如果是做Machine Learning或者数据分析, 掌握Pandas很省去很多麻烦. 许多Machine Learning前期的数据处理也是用Pandas做得.

IBKR(Interactive Brokers, 有时简称IB, 中文叫盈透证券)是美国老牌券商, 也是我的主要使用的券商. 又到了辞旧迎新的时候, 需要看看2020投资收益, 于是趁新年长周末写点小程序做点数据分析. 而这正好覆盖了Pandas的各种常用functions.

下载IB Statement

IB Statement提供多种方式下载, 比如html, pdf, csv. 用作数据处理选csv. 内容大概长这样

IB statement例子

Statement这个column是内容, 里面有很多项, 而这里我只看“Realized & Unrealized Performance Summary”, 然后相同的第一列后面的列数都是一样的. 上图是Jupyter Lab的显示有问题. 因为文件其实可以看成很多CSV files连在一起, 而第一列可以看成是小csv的文件名. 然后第二列是Header或者Data. Header那行就是告诉你后面的Data行里每一列都是什么. 于是我们可以把数据读到内存里.

建立Pandas DataFrame

import pandas as pd

field = 'Realized & Unrealized Performance Summary'

f = open('statement.csv', 'r')

rows = []
for line in f:
    cols = line.strip().split(',')
    if cols[0] == field:
        if cols[1] == 'Header':
            header_row = cols[2:]
        else:
            rows.append(cols[2:])

header_ros是个list of string

['Asset Category',
 'Symbol',
 'Cost Adj.',
 'Realized S/T Profit',
 'Realized S/T Loss',
 'Realized L/T Profit',
 'Realized L/T Loss',
 'Realized Total',
 'Unrealized S/T Profit',
 'Unrealized S/T Loss',
 'Unrealized L/T Profit',
 'Unrealized L/T Loss',
 'Unrealized Total',
 'Total',
 'Code']

而rows是list of list. 就是list of rows

[['Stocks',
  'AAPL',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '29564.510608',
  '-14.176325',
  '0',
  '0',
  '29550.334283',
  '29550.334283',
  '\n'],
 ['Stocks',
  'AMD',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '220.208447',
  '0',
  '0',
  '0',
  '220.208447',
  '220.208447',
  '\n'],
...
]

接下来就可以用Panda create dataframe了

df = pd.DataFrame(rows, columns = header_row)
df.head(5)

这里有个问题. Created出来的dataframe的数据类型有数字有string, 全部都是object. 如果sort by数字column得到的结果是sort by string, 并不是我们想要的. 可以用dtypes来确定

df.dtypes
Asset Category           object
Symbol                   object
Cost Adj.                object
Realized S/T Profit      object
Realized S/T Loss        object
Realized L/T Profit      object
Realized L/T Loss        object
Realized Total           object
Unrealized S/T Profit    object
Unrealized S/T Loss      object
Unrealized L/T Profit    object
Unrealized L/T Loss      object
Unrealized Total         object
Total                    object
Code                     object
dtype: object

这里我把需要转换成数字的columns转换了

header_text_col = set(['Asset Category', 'Symbol', 'Code'])
num_col = list(set(header_row) - header_text_col)
df[num_col] = df[num_col].apply(pd.to_numeric)
df.dtypes
Asset Category            object
Symbol                    object
Cost Adj.                  int64
Realized S/T Profit      float64
Realized S/T Loss        float64
Realized L/T Profit        int64
Realized L/T Loss        float64
Realized Total           float64
Unrealized S/T Profit    float64
Unrealized S/T Loss      float64
Unrealized L/T Profit      int64
Unrealized L/T Loss      float64
Unrealized Total         float64
Total                    float64
Code                      object
dtype: object

这样就可以对数字类型的column进行排序了, 比如下面指令做2步操作
1. 选出Asset Category是”Stocks”的rows, 相当于SQL里的where, 语法为df[df['Asset Category'] == 'Stock'], return是另一个filtered好的dataframe
2. 按Realized Total从小到大排序, return也是另一排序好的dataframe

df[df['Asset Category'] == 'Stock'].sort_values(by=['Realized Total'], ascending=True)

还可以画图

stock_gain = df[
    (df['Asset Category'] == 'Stocks') & (
        df['Realized Total'] != 0.0)][['Symbol', 'Realized Total']].sort_values(
    by=['Realized Total'], ascending=True)
stock_gain

还可以画图

stock_gain.plot.bar(x='Symbol')

技术小结

建立dataframe

rows是list of rows(lists), header_row是list of string

df = pd.DataFrame(rows, columns = header_row)

显示前n行

df.head(10)

显示后面n行

df.tail(10)

把特定columns转换成数字(numerical type)

num_col是list of strings, 下面命令会把num_col里的columns都转化成数字类. 类似SQL里的select cast(col1 as double), cast(col2 as double), ... from df

df[num_col] = df[num_col].apply(pd.to_numeric)

Filter和sort

类似SQL里, select * from df where asset_category = 'Stocks' order by realized_total asc;

df[df['Asset Category'] == 'Stocks'].sort_values(by=['Realized Total'], ascending=True)

Leave a Comment

Your email address will not be published.