最近在做脚本运行效率的优化,于是尝试了python的各种循环的写法,比较了各种写法的运行效率,在这里记录一下。
1. 各循环效率比较
分别使用python内置的for新循环,map函数,pd和np包,运行100000次平方计算
func_dict = {
'----- use list without filter -----' : lambda x: [a ** 2 for a in range(x)],
'----- use map without filter -----' : lambda x: list(map(lambda a: a ** 2, range(x))),
'----- use list with filter -----' : lambda x: [a ** 2 for a in range(x) if a % 2],
'----- use map with filter -----' : lambda x: map(lambda a: a ** 2, filter(lambda a: a % 2, range(x))),
'----- use numpy without change type -----' : lambda x: np.array(range(x)) ** 2,
'----- use numpy with change type -----' : lambda x: (np.array(range(x)) ** 2).tolist(),
'----- use pandas dataframe without change type -----': lambda x: pd.DataFrame(range(x)) ** 2,
'----- use pandas dataframe with change type -----' : lambda x: (pd.DataFrame(range(x)) ** 2)[0].values.tolist(),
'----- use pandas series without change type -----' : lambda x: pd.Series(range(x)) ** 2,
'----- use pandas series with change type -----' : lambda x: (pd.Series(range(x)) ** 2).values.tolist(),
}
for s, func in func_dict.items():
print(s)
%timeit -n 100 func(100000)
print('\n')
运行结果
----- use list without filter -----
27.3 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
----- use map without filter -----
32.2 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
----- use list with filter -----
17.2 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
----- use map with filter -----
23.5 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
----- use numpy without change type -----
9.45 ms ± 41.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
----- use numpy with change type -----
11.9 ms ± 45.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
----- use pandas dataframe without change type -----
755 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
----- use pandas dataframe with change type -----
3.18 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
----- use pandas series without change type -----
802 µs ± 87.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
----- use pandas series with change type -----
3.01 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
- map的运行效率并不高,甚至比直接使用for循环的效率还要低
- 在数值运算上,使用numpy和pandas都比直接使用python级别的循环的效率要高,而pandas的运行效率出奇的高,足足比numpy快了10多倍
- numpy和pandas效率高,可能是直接调的底层c语言接口,而且运用了一些并行的计算。
2. 一些循环写法的优化
尽量用pd.series替代pd.dataframe
例如
# 对df的每一列赋值 for i in df.columns: df[i] = values # 如果df是一行的,可以改为下面这种形式 sr = df.iloc[0] for i in sr.index: sr = values df.iloc[0] = sr
将显示的for改为pd的内循环
例如:
# 对array中每个元素,统计其在df的idx列中出现的次数 for i in array: count_df[i] = df[idx].apply(lambda x: x in array) # 改为 count_df = df.merge({pd.DataFrame({idx: array})}, how='inner') count_df = pd.pivot_table(tmp_df, index=index, values=values, columns=idx, aggfunc='count', fill_value=0)