Pandas是一个非常强大的数据分析方面的Python package. 如果是做Machine Learning或者数据分析, 掌握Pandas很省去很多麻烦. 许多Machine Learning前期的数据处理也是用Pandas做得.
IBKR(Interactive Brokers, 有时简称IB, 中文叫盈透证券)是美国老牌券商, 也是我的主要使用的券商. 又到了辞旧迎新的时候, 需要看看2020投资收益, 于是趁新年长周末写点小程序做点数据分析. 而这正好覆盖了Pandas的各种常用functions.
下载IB Statement
IB Statement提供多种方式下载, 比如html, pdf, csv. 用作数据处理选csv. 内容大概长这样
IB statement例子
Statement这个column是内容, 里面有很多项, 而这里我只看“Realized & Unrealized Performance Summary”, 然后相同的第一列后面的列数都是一样的. 上图是Jupyter Lab的显示有问题. 因为文件其实可以看成很多CSV files连在一起, 而第一列可以看成是小csv的文件名. 然后第二列是Header或者Data. Header那行就是告诉你后面的Data行里每一列都是什么. 于是我们可以把数据读到内存里.
建立Pandas DataFrame
import pandas as pd
field = 'Realized & Unrealized Performance Summary'
f = open('statement.csv', 'r')
rows = []
for line in f:
cols = line.strip().split(',')
if cols[0] == field:
if cols[1] == 'Header':
header_row = cols[2:]
else:
rows.append(cols[2:])
header_ros是个list of string
['Asset Category',
'Symbol',
'Cost Adj.',
'Realized S/T Profit',
'Realized S/T Loss',
'Realized L/T Profit',
'Realized L/T Loss',
'Realized Total',
'Unrealized S/T Profit',
'Unrealized S/T Loss',
'Unrealized L/T Profit',
'Unrealized L/T Loss',
'Unrealized Total',
'Total',
'Code']
而rows是list of list. 就是list of rows
[['Stocks',
'AAPL',
'0',
'0',
'0',
'0',
'0',
'0',
'29564.510608',
'-14.176325',
'0',
'0',
'29550.334283',
'29550.334283',
'\n'],
['Stocks',
'AMD',
'0',
'0',
'0',
'0',
'0',
'0',
'220.208447',
'0',
'0',
'0',
'220.208447',
'220.208447',
'\n'],
...
]
接下来就可以用Panda create dataframe了
df = pd.DataFrame(rows, columns = header_row)
df.head(5)
这里有个问题. Created出来的dataframe的数据类型有数字有string, 全部都是object. 如果sort by数字column得到的结果是sort by string, 并不是我们想要的. 可以用dtypes来确定
df.dtypes
Asset Category object
Symbol object
Cost Adj. object
Realized S/T Profit object
Realized S/T Loss object
Realized L/T Profit object
Realized L/T Loss object
Realized Total object
Unrealized S/T Profit object
Unrealized S/T Loss object
Unrealized L/T Profit object
Unrealized L/T Loss object
Unrealized Total object
Total object
Code object
dtype: object
这里我把需要转换成数字的columns转换了
header_text_col = set(['Asset Category', 'Symbol', 'Code'])
num_col = list(set(header_row) - header_text_col)
df[num_col] = df[num_col].apply(pd.to_numeric)
df.dtypes
Asset Category object
Symbol object
Cost Adj. int64
Realized S/T Profit float64
Realized S/T Loss float64
Realized L/T Profit int64
Realized L/T Loss float64
Realized Total float64
Unrealized S/T Profit float64
Unrealized S/T Loss float64
Unrealized L/T Profit int64
Unrealized L/T Loss float64
Unrealized Total float64
Total float64
Code object
dtype: object
这样就可以对数字类型的column进行排序了, 比如下面指令做2步操作 1. 选出Asset Category是”Stocks”的rows, 相当于SQL里的where, 语法为df[df['Asset Category'] == 'Stock']
, return是另一个filtered好的dataframe 2. 按Realized Total从小到大排序, return也是另一排序好的dataframe
df[df['Asset Category'] == 'Stock'].sort_values(by=['Realized Total'], ascending=True)
还可以画图
stock_gain = df[
(df['Asset Category'] == 'Stocks') & (
df['Realized Total'] != 0.0)][['Symbol', 'Realized Total']].sort_values(
by=['Realized Total'], ascending=True)
stock_gain
还可以画图
stock_gain.plot.bar(x='Symbol')
技术小结
建立dataframe
rows是list of rows(lists), header_row是list of string
df = pd.DataFrame(rows, columns = header_row)
显示前n行
df.head(10)
显示后面n行
df.tail(10)
把特定columns转换成数字(numerical type)
num_col是list of strings, 下面命令会把num_col里的columns都转化成数字类. 类似SQL里的select cast(col1 as double), cast(col2 as double), ... from df
df[num_col] = df[num_col].apply(pd.to_numeric)
Filter和sort
类似SQL里, select * from df where asset_category = 'Stocks' order by realized_total asc;
df[df['Asset Category'] == 'Stocks'].sort_values(by=['Realized Total'], ascending=True)