Python/Pandas實戰: 處理IBKR Statement

Python/Pandas實戰: 處理盈透Statement

Pandas是一個非常強大的數據分析方面的Python package. 如果是做Machine Learning或者數據分析, 掌握Pandas很省去很多麻煩. 許多Machine Learning前期的數據處理也是用Pandas做得.

IBKR(Interactive Brokers, 有時簡稱IB, 中文叫盈透證券)是美國老牌券商, 也是我的主要使用的券商. 又到了辭舊迎新的時候, 需要看看2020投資收益, 於是趁新年長周末寫點小程序做點數據分析. 而這正好覆蓋了Pandas的各種常用functions.

下載IB Statement

IB Statement提供多種方式下載, 比如html, pdf, csv. 用作數據處理選csv. 內容大概長這樣

IB statement例子

Statement這個column是內容, 裏面有很多項, 而這裡我只看「Realized & Unrealized Performance Summary”, 然後相同的第一列後面的列數都是一樣的. 上圖是Jupyter Lab的顯示有問題. 因為文件其實可以看成很多CSV files連在一起, 而第一列可以看成是小csv的文件名. 然後第二列是Header或者Data. Header那行就是告訴你後面的Data行里每一列都是什麼. 於是我們可以把數據讀到內存里.

建立Pandas DataFrame

import pandas as pd

field = 'Realized & Unrealized Performance Summary'

f = open('statement.csv', 'r')

rows = []
for line in f:
    cols = line.strip().split(',')
    if cols[0] == field:
        if cols[1] == 'Header':
            header_row = cols[2:]
        else:
            rows.append(cols[2:])

header_ros是個list of string

['Asset Category',
 'Symbol',
 'Cost Adj.',
 'Realized S/T Profit',
 'Realized S/T Loss',
 'Realized L/T Profit',
 'Realized L/T Loss',
 'Realized Total',
 'Unrealized S/T Profit',
 'Unrealized S/T Loss',
 'Unrealized L/T Profit',
 'Unrealized L/T Loss',
 'Unrealized Total',
 'Total',
 'Code']

而rows是list of list. 就是list of rows

[['Stocks',
  'AAPL',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '29564.510608',
  '-14.176325',
  '0',
  '0',
  '29550.334283',
  '29550.334283',
  '\n'],
 ['Stocks',
  'AMD',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '220.208447',
  '0',
  '0',
  '0',
  '220.208447',
  '220.208447',
  '\n'],
...
]

接下來就可以用Panda create dataframe了

df = pd.DataFrame(rows, columns = header_row)
df.head(5)

這裡有個問題. Created出來的dataframe的數據類型有數字有string, 全部都是object. 如果sort by數字column得到的結果是sort by string, 並不是我們想要的. 可以用dtypes來確定

df.dtypes
Asset Category           object
Symbol                   object
Cost Adj.                object
Realized S/T Profit      object
Realized S/T Loss        object
Realized L/T Profit      object
Realized L/T Loss        object
Realized Total           object
Unrealized S/T Profit    object
Unrealized S/T Loss      object
Unrealized L/T Profit    object
Unrealized L/T Loss      object
Unrealized Total         object
Total                    object
Code                     object
dtype: object

這裡我把需要轉換成數字的columns轉換了

header_text_col = set(['Asset Category', 'Symbol', 'Code'])
num_col = list(set(header_row) - header_text_col)
df[num_col] = df[num_col].apply(pd.to_numeric)
df.dtypes
Asset Category            object
Symbol                    object
Cost Adj.                  int64
Realized S/T Profit      float64
Realized S/T Loss        float64
Realized L/T Profit        int64
Realized L/T Loss        float64
Realized Total           float64
Unrealized S/T Profit    float64
Unrealized S/T Loss      float64
Unrealized L/T Profit      int64
Unrealized L/T Loss      float64
Unrealized Total         float64
Total                    float64
Code                      object
dtype: object

這樣就可以對數字類型的column進行排序了, 比如下面指令做2步操作
1. 選出Asset Category是”Stocks”的rows, 相當於SQL里的where, 語法為df[df['Asset Category'] == 'Stock'], return是另一個filtered好的dataframe
2. 按Realized Total從小到大排序, return也是另一排序好的dataframe

df[df['Asset Category'] == 'Stock'].sort_values(by=['Realized Total'], ascending=True)

還可以畫圖

stock_gain = df[
    (df['Asset Category'] == 'Stocks') & (
        df['Realized Total'] != 0.0)][['Symbol', 'Realized Total']].sort_values(
    by=['Realized Total'], ascending=True)
stock_gain

還可以畫圖

stock_gain.plot.bar(x='Symbol')

技術小結

建立dataframe

rows是list of rows(lists), header_row是list of string

df = pd.DataFrame(rows, columns = header_row)

顯示前n行

df.head(10)

顯示後面n行

df.tail(10)

把特定columns轉換成數字(numerical type)

num_col是list of strings, 下面命令會把num_col里的columns都轉化成數字類. 類似SQL里的select cast(col1 as double), cast(col2 as double), ... from df

df[num_col] = df[num_col].apply(pd.to_numeric)

Filter和sort

類似SQL里, select * from df where asset_category = 'Stocks' order by realized_total asc;

df[df['Asset Category'] == 'Stocks'].sort_values(by=['Realized Total'], ascending=True)

Leave a Comment

Your email address will not be published.