Python/Pandas实战: 处理IBKR Statement

feellikelearning January 2, 2021 0 Comments

Pandas是一个非常强大的数据分析方面的Python package. 如果是做Machine Learning或者数据分析, 掌握Pandas很省去很多麻烦. 许多Machine Learning前期的数据处理也是用Pandas做得.

IBKR(Interactive Brokers, 有时简称IB, 中文叫盈透证券)是美国老牌券商, 也是我的主要使用的券商. 又到了辞旧迎新的时候, 需要看看2020投资收益, 于是趁新年长周末写点小程序做点数据分析. 而这正好覆盖了Pandas的各种常用functions.

下载IB Statement

IB Statement提供多种方式下载, 比如html, pdf, csv. 用作数据处理选csv. 内容大概长这样

Statement这个column是内容, 里面有很多项, 而这里我只看“Realized & Unrealized Performance Summary”, 然后相同的第一列后面的列数都是一样的. 上图是Jupyter Lab的显示有问题. 因为文件其实可以看成很多CSV files连在一起, 而第一列可以看成是小csv的文件名. 然后第二列是Header或者Data. Header那行就是告诉你后面的Data行里每一列都是什么. 于是我们可以把数据读到内存里.

建立Pandas DataFrame

import pandas as pd

field = 'Realized & Unrealized Performance Summary'

f = open('statement.csv', 'r')

rows = []

for line in f:

cols = line.strip().split(',')

if cols[0] == field:

if cols[1] == 'Header':

header_row = cols[2:]

else:

rows.append(cols[2:])

import pandas as pd field = 'Realized & Unrealized Performance Summary' f = open('statement.csv', 'r') rows = [] for line in f: cols = line.strip().split(',') if cols[0] == field: if cols[1] == 'Header': header_row = cols[2:] else: rows.append(cols[2:])

import pandas as pd

field = 'Realized & Unrealized Performance Summary'

f = open('statement.csv', 'r')

rows = []
for line in f:
    cols = line.strip().split(',')
    if cols[0] == field:
        if cols[1] == 'Header':
            header_row = cols[2:]
        else:
            rows.append(cols[2:])

header_ros是个list of string

['Asset Category',

'Symbol',

'Cost Adj.',

'Realized S/T Profit',

'Realized S/T Loss',

'Realized L/T Profit',

'Realized L/T Loss',

'Realized Total',

'Unrealized S/T Profit',

'Unrealized S/T Loss',

'Unrealized L/T Profit',

'Unrealized L/T Loss',

'Unrealized Total',

'Total',

'Code']

['Asset Category', 'Symbol', 'Cost Adj.', 'Realized S/T Profit', 'Realized S/T Loss', 'Realized L/T Profit', 'Realized L/T Loss', 'Realized Total', 'Unrealized S/T Profit', 'Unrealized S/T Loss', 'Unrealized L/T Profit', 'Unrealized L/T Loss', 'Unrealized Total', 'Total', 'Code']

['Asset Category',
 'Symbol',
 'Cost Adj.',
 'Realized S/T Profit',
 'Realized S/T Loss',
 'Realized L/T Profit',
 'Realized L/T Loss',
 'Realized Total',
 'Unrealized S/T Profit',
 'Unrealized S/T Loss',
 'Unrealized L/T Profit',
 'Unrealized L/T Loss',
 'Unrealized Total',
 'Total',
 'Code']

而rows是list of list. 就是list of rows

[['Stocks',

'AAPL',

'0',

'29564.510608',

'-14.176325',

'0',

'29550.334283',

'\n'],

['Stocks',

'AMD',

'0',

'220.208447',

'0',

'220.208447',

'\n'],

...

]

[['Stocks', 'AAPL', '0', '0', '0', '0', '0', '0', '29564.510608', '-14.176325', '0', '0', '29550.334283', '29550.334283', '\n'], ['Stocks', 'AMD', '0', '0', '0', '0', '0', '0', '220.208447', '0', '0', '0', '220.208447', '220.208447', '\n'], ... ]

[['Stocks',
  'AAPL',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '29564.510608',
  '-14.176325',
  '0',
  '0',
  '29550.334283',
  '29550.334283',
  '\n'],
 ['Stocks',
  'AMD',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '220.208447',
  '0',
  '0',
  '0',
  '220.208447',
  '220.208447',
  '\n'],
...
]

接下来就可以用Panda create dataframe了

df = pd.DataFrame(rows, columns = header_row)

df.head(5)

df = pd.DataFrame(rows, columns = header_row) df.head(5)

df = pd.DataFrame(rows, columns = header_row)
df.head(5)

这里有个问题. Created出来的dataframe的数据类型有数字有string, 全部都是object. 如果sort by数字column得到的结果是sort by string, 并不是我们想要的. 可以用dtypes来确定

df.dtypes

df.dtypes

Asset Category object

Symbol object

Cost Adj. object

Realized S/T Profit object

Realized S/T Loss object

Realized L/T Profit object

Realized L/T Loss object

Realized Total object

Unrealized S/T Profit object

Unrealized S/T Loss object

Unrealized L/T Profit object

Unrealized L/T Loss object

Unrealized Total object

Total object

Code object

dtype: object

Asset Category object Symbol object Cost Adj. object Realized S/T Profit object Realized S/T Loss object Realized L/T Profit object Realized L/T Loss object Realized Total object Unrealized S/T Profit object Unrealized S/T Loss object Unrealized L/T Profit object Unrealized L/T Loss object Unrealized Total object Total object Code object dtype: object

Asset Category           object
Symbol                   object
Cost Adj.                object
Realized S/T Profit      object
Realized S/T Loss        object
Realized L/T Profit      object
Realized L/T Loss        object
Realized Total           object
Unrealized S/T Profit    object
Unrealized S/T Loss      object
Unrealized L/T Profit    object
Unrealized L/T Loss      object
Unrealized Total         object
Total                    object
Code                     object
dtype: object

这里我把需要转换成数字的columns转换了

header_text_col = set(['Asset Category', 'Symbol', 'Code'])

num_col = list(set(header_row) - header_text_col)

df[num_col] = df[num_col].apply(pd.to_numeric)

df.dtypes

header_text_col = set(['Asset Category', 'Symbol', 'Code']) num_col = list(set(header_row) - header_text_col) df[num_col] = df[num_col].apply(pd.to_numeric) df.dtypes

header_text_col = set(['Asset Category', 'Symbol', 'Code'])
num_col = list(set(header_row) - header_text_col)
df[num_col] = df[num_col].apply(pd.to_numeric)
df.dtypes

Asset Category object

Symbol object

Cost Adj. int64

Realized S/T Profit float64

Realized S/T Loss float64

Realized L/T Profit int64

Realized L/T Loss float64

Realized Total float64

Unrealized S/T Profit float64

Unrealized S/T Loss float64

Unrealized L/T Profit int64

Unrealized L/T Loss float64

Unrealized Total float64

Total float64

Code object

dtype: object

Asset Category object Symbol object Cost Adj. int64 Realized S/T Profit float64 Realized S/T Loss float64 Realized L/T Profit int64 Realized L/T Loss float64 Realized Total float64 Unrealized S/T Profit float64 Unrealized S/T Loss float64 Unrealized L/T Profit int64 Unrealized L/T Loss float64 Unrealized Total float64 Total float64 Code object dtype: object

Asset Category            object
Symbol                    object
Cost Adj.                  int64
Realized S/T Profit      float64
Realized S/T Loss        float64
Realized L/T Profit        int64
Realized L/T Loss        float64
Realized Total           float64
Unrealized S/T Profit    float64
Unrealized S/T Loss      float64
Unrealized L/T Profit      int64
Unrealized L/T Loss      float64
Unrealized Total         float64
Total                    float64
Code                      object
dtype: object

这样就可以对数字类型的column进行排序了, 比如下面指令做2步操作
1. 选出Asset Category是”Stocks”的rows, 相当于SQL里的where, 语法为df[df['Asset Category'] == 'Stock'], return是另一个filtered好的dataframe
2. 按Realized Total从小到大排序, return也是另一排序好的dataframe

df[df['Asset Category'] == 'Stock'].sort_values(by=['Realized Total'], ascending=True)

df[df['Asset Category'] == 'Stock'].sort_values(by=['Realized Total'], ascending=True)

还可以画图

stock_gain = df[

(df['Asset Category'] == 'Stocks') & (

df['Realized Total'] != 0.0)][['Symbol', 'Realized Total']].sort_values(

by=['Realized Total'], ascending=True)

stock_gain

stock_gain = df[ (df['Asset Category'] == 'Stocks') & ( df['Realized Total'] != 0.0)][['Symbol', 'Realized Total']].sort_values( by=['Realized Total'], ascending=True) stock_gain

stock_gain = df[
    (df['Asset Category'] == 'Stocks') & (
        df['Realized Total'] != 0.0)][['Symbol', 'Realized Total']].sort_values(
    by=['Realized Total'], ascending=True)
stock_gain

还可以画图

stock_gain.plot.bar(x='Symbol')

stock_gain.plot.bar(x='Symbol')

技术小结

建立dataframe

rows是list of rows(lists), header_row是list of string

df = pd.DataFrame(rows, columns = header_row)

df = pd.DataFrame(rows, columns = header_row)

显示前n行

df.head(10)

df.head(10)

显示后面n行

df.tail(10)

df.tail(10)

把特定columns转换成数字(numerical type)

num_col是list of strings, 下面命令会把num_col里的columns都转化成数字类. 类似SQL里的select cast(col1 as double), cast(col2 as double), ... from df

df[num_col] = df[num_col].apply(pd.to_numeric)

df[num_col] = df[num_col].apply(pd.to_numeric)

Filter和sort

类似SQL里, select * from df where asset_category = 'Stocks' order by realized_total asc;

df[df['Asset Category'] == 'Stocks'].sort_values(by=['Realized Total'], ascending=True)

df[df['Asset Category'] == 'Stocks'].sort_values(by=['Realized Total'], ascending=True)

Python/Pandas实战: 处理IBKR Statement

下载IB Statement

建立Pandas DataFrame

技术小结

建立dataframe

显示前n行

显示后面n行

把特定columns转换成数字(numerical type)

Filter和sort

Related Posts

Leave a Comment Cancel reply