PYTHON DATA 시각화 – PANDAS #2

누적 도면 그리기

데이터 과학 meetup 그룹의 시간에 따른 구성원 분포를 확인하기 위해 매주 meetup 그룹별로 재구성한다. 매주 총 회원 수를 cumsum을 통하여 계산하고, 각 그룹의 총합으로 나눠 각 그루의 분포를 전체 회원 수의 비율로 구한다. div 함수를 사용하면서 axis 매개변수 값을 index 로 설정한다.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
# join_date 컬럼의 데이터를 timestamp 로 변환 한다음 인덱스로 설정한다.
meetup = pd.read_csv('data/meetup_groups.csv', parse_dates=['join_date'], index_col='join_date')
meetup
Out[2]:
group city state country
join_date
2016-11-18 02:41:29 houston machine learning Houston TX us
2017-05-09 14:16:37 houston machine learning Houston TX us
2016-12-30 02:34:16 houston machine learning Houston TX us
2016-07-18 00:48:17 houston machine learning Houston TX us
2017-05-25 12:58:16 houston machine learning Houston TX us
... ... ... ... ...
2017-10-07 18:05:24 houston data visualization Houston TX us
2017-06-24 14:06:26 houston data visualization Houston TX us
2015-10-05 17:08:40 houston data visualization Houston TX us
2016-11-04 22:36:24 houston data visualization Houston TX us
2016-08-02 17:47:29 houston data visualization Houston TX us

7671 rows × 4 columns

In [3]:
(meetup
 .groupby([pd.Grouper(freq='W'), 'group'])
 .size()
)
Out[3]:
join_date   group                       
2010-11-07  houstonr                         5
2010-11-14  houstonr                        11
2010-11-21  houstonr                         2
2010-12-05  houstonr                         1
2011-01-16  houstonr                         2
                                            ..
2017-10-15  houston data science            14
            houston data visualization      13
            houston energy data science      9
            houston machine learning        11
            houstonr                         2
Length: 763, dtype: int64
In [4]:
(meetup.groupby([pd.Grouper(freq='W'), 'group'])
 .size()
 .unstack('group', fill_value=0))
Out[4]:
group houston data science houston data visualization houston energy data science houston machine learning houstonr
join_date
2010-11-07 0 0 0 0 5
2010-11-14 0 0 0 0 11
2010-11-21 0 0 0 0 2
2010-12-05 0 0 0 0 1
2011-01-16 0 0 0 0 2
... ... ... ... ... ...
2017-09-17 16 2 6 5 0
2017-09-24 19 4 16 12 7
2017-10-01 20 6 6 20 1
2017-10-08 22 10 10 4 2
2017-10-15 14 13 9 11 2

278 rows × 5 columns

In [5]:
(meetup.groupby([pd.Grouper(freq='W'), 'group'])
 .size()
 .unstack('group',fill_value=0)
 .cumsum()
)
Out[5]:
group houston data science houston data visualization houston energy data science houston machine learning houstonr
join_date
2010-11-07 0 0 0 0 5
2010-11-14 0 0 0 0 16
2010-11-21 0 0 0 0 18
2010-12-05 0 0 0 0 19
2011-01-16 0 0 0 0 21
... ... ... ... ... ...
2017-09-17 2105 1708 1886 708 1056
2017-09-24 2124 1712 1902 720 1063
2017-10-01 2144 1718 1908 740 1064
2017-10-08 2166 1728 1918 744 1066
2017-10-15 2180 1741 1927 755 1068

278 rows × 5 columns

누적 영역 차트는 전체 백분율을 사용하여 각 행의 총합이 항상 1이 되게 한다.

In [6]:
(meetup
 .groupby([pd.Grouper(freq='W'), 'group'])
 .size()
 .unstack('group', fill_value=0)
 .cumsum()
 .pipe(lambda df_:df_.div(df_.sum(axis='columns'),axis='index'))
 )
Out[6]:
group houston data science houston data visualization houston energy data science houston machine learning houstonr
join_date
2010-11-07 0.000000 0.000000 0.000000 0.000000 1.000000
2010-11-14 0.000000 0.000000 0.000000 0.000000 1.000000
2010-11-21 0.000000 0.000000 0.000000 0.000000 1.000000
2010-12-05 0.000000 0.000000 0.000000 0.000000 1.000000
2011-01-16 0.000000 0.000000 0.000000 0.000000 1.000000
... ... ... ... ... ...
2017-09-17 0.282058 0.228862 0.252713 0.094868 0.141498
2017-09-24 0.282409 0.227629 0.252892 0.095732 0.141338
2017-10-01 0.283074 0.226829 0.251914 0.097703 0.140481
2017-10-08 0.284177 0.226712 0.251640 0.097612 0.139858
2017-10-15 0.284187 0.226959 0.251206 0.098423 0.139226

278 rows × 5 columns

1. 누적 영역 도면 만들기

In [7]:
fig, ax = plt.subplots(figsize=(18,6))
(meetup
 .groupby([pd.Grouper(freq='W'), 'group'])
 .size()
 .unstack('group',fill_value=0)
 .cumsum()
 .pipe(lambda df_:df_.div(df_.sum(axis='columns'),axis='index'))
 .plot.area(ax=ax, cmap='Greys',xlim=('2013-6',None), ylim=(0,1), legend=False)
 )

ax.figure.suptitle('Houston Meetup Groups', size=25)
ax.set_xlabel('')
ax.yaxis.tick_right()
kwargs={'xycoords':'axes fraction','size':15}
ax.annotate(xy=(.1,.7), s='R Users', color='w', **kwargs)
ax.annotate(xy=(.25, .16), s='Data Visualizatoin', color='k', **kwargs)
ax.annotate(xy=(.5, .55), s='Energy Data Science', color='k', **kwargs)
ax.annotate(xy=(.83, .07), s='Data Science', color='k', **kwargs)
ax.annotate(xy=(.86, .78), s='Machine Learning', color='w', **kwargs)
fig.savefig('c13-stacked1.png')
/Users/kimseongmog/opt/anaconda3/envs/statics/lib/python3.7/site-packages/ipykernel_launcher.py:15: MatplotlibDeprecationWarning: The 's' parameter of annotate() has been renamed 'text' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
  from ipykernel import kernelapp as app
/Users/kimseongmog/opt/anaconda3/envs/statics/lib/python3.7/site-packages/ipykernel_launcher.py:16: MatplotlibDeprecationWarning: The 's' parameter of annotate() has been renamed 'text' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
  app.launch_new_instance()
/Users/kimseongmog/opt/anaconda3/envs/statics/lib/python3.7/site-packages/ipykernel_launcher.py:17: MatplotlibDeprecationWarning: The 's' parameter of annotate() has been renamed 'text' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
/Users/kimseongmog/opt/anaconda3/envs/statics/lib/python3.7/site-packages/ipykernel_launcher.py:18: MatplotlibDeprecationWarning: The 's' parameter of annotate() has been renamed 'text' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
/Users/kimseongmog/opt/anaconda3/envs/statics/lib/python3.7/site-packages/ipykernel_launcher.py:19: MatplotlibDeprecationWarning: The 's' parameter of annotate() has been renamed 'text' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.

답글 남기기

이메일 주소는 공개되지 않습니다.