python高效循环trick


最近在做脚本运行效率的优化,于是尝试了python的各种循环的写法,比较了各种写法的运行效率,在这里记录一下。

1. 各循环效率比较

分别使用python内置的for新循环,map函数,pd和np包,运行100000次平方计算

func_dict = {
    '----- use list without filter -----'                 : lambda x: [a ** 2 for a in range(x)],
    '----- use map without filter -----'                  : lambda x: list(map(lambda a: a ** 2, range(x))),
    '----- use list with filter -----'                    : lambda x: [a ** 2 for a in range(x) if a % 2],
    '----- use map with filter -----'                     : lambda x: map(lambda a: a ** 2, filter(lambda a: a % 2, range(x))),
    '----- use numpy without change type -----'           : lambda x: np.array(range(x)) ** 2,
    '----- use numpy with change type -----'              : lambda x: (np.array(range(x)) ** 2).tolist(),
    '----- use pandas dataframe without change type -----': lambda x: pd.DataFrame(range(x)) ** 2,
    '----- use pandas dataframe with change type -----'   : lambda x: (pd.DataFrame(range(x)) ** 2)[0].values.tolist(),
    '----- use pandas series without change type -----'   : lambda x: pd.Series(range(x)) ** 2,
    '----- use pandas series with change type -----'      : lambda x: (pd.Series(range(x)) ** 2).values.tolist(),
}

for s, func in func_dict.items():
    print(s)
    %timeit -n 100 func(100000)
    print('\n')

运行结果

----- use list without filter -----
27.3 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


----- use map without filter -----
32.2 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


----- use list with filter -----
17.2 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


----- use map with filter -----
23.5 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


----- use numpy without change type -----
9.45 ms ± 41.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


----- use numpy with change type -----
11.9 ms ± 45.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


----- use pandas dataframe without change type -----
755 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


----- use pandas dataframe with change type -----
3.18 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


----- use pandas series without change type -----
802 µs ± 87.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


----- use pandas series with change type -----
3.01 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • map的运行效率并不高,甚至比直接使用for循环的效率还要低
  • 在数值运算上,使用numpy和pandas都比直接使用python级别的循环的效率要高,而pandas的运行效率出奇的高,足足比numpy快了10多倍
  • numpy和pandas效率高,可能是直接调的底层c语言接口,而且运用了一些并行的计算。

2. 一些循环写法的优化

  • 尽量用pd.series替代pd.dataframe

    例如

    # 对df的每一列赋值
    for i in df.columns:
        df[i] = values
    
    # 如果df是一行的,可以改为下面这种形式
    sr = df.iloc[0]
    for i in sr.index:
        sr = values
    df.iloc[0] = sr
  • 将显示的for改为pd的内循环

    例如:

    # 对array中每个元素,统计其在df的idx列中出现的次数
    for i in array:
        count_df[i] = df[idx].apply(lambda x: x in array)
    
    # 改为
    count_df = df.merge({pd.DataFrame({idx: array})}, how='inner')
    count_df = pd.pivot_table(tmp_df, index=index, values=values, columns=idx, aggfunc='count', fill_value=0)

评论
  目录