코딩/Python

판다스 이해하기 - 분할, 더미변수, 문자형 날짜형 변환

정듀이 2020. 7. 16. 15:37
판다스 정리4

분할

In [25]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important;}</style>"))
In [10]:
import pandas as pd

emp = pd.read_csv("c:/data/emp3.csv")
emp
Out[10]:
index empno ename job mgr hiredate sal comm deptno
0 1 7839 KING PRESIDENT NaN 1981-11-17 0:00 5000 NaN 10
1 2 7698 BLAKE MANAGER 7839.0 1981-05-01 0:00 2850 NaN 30
2 3 7782 CLARK MANAGER 7839.0 1981-05-09 0:00 2450 NaN 10
3 4 7566 JONES MANAGER 7839.0 1981-04-01 0:00 2975 NaN 20
4 5 7654 MARTIN SALESMAN 7698.0 1981-09-10 0:00 1250 1400.0 30
5 6 7499 ALLEN SALESMAN 7698.0 1981-02-11 0:00 1600 300.0 30
6 7 7844 TURNER SALESMAN 7698.0 1981-08-21 0:00 1500 0.0 30
7 8 7900 JAMES CLERK 7698.0 1981-12-11 0:00 950 NaN 30
8 9 7521 WARD SALESMAN 7698.0 1981-02-23 0:00 1250 500.0 30
9 10 7902 FORD ANALYST 7566.0 1981-12-11 0:00 3000 NaN 20
10 11 7369 SMITH CLERK 7902.0 1980-12-09 0:00 800 NaN 20
11 12 7788 SCOTT ANALYST 7566.0 1982-12-22 0:00 3000 NaN 20
12 13 7876 ADAMS CLERK 7788.0 1983-01-15 0:00 1100 NaN 20
13 14 7934 MILLER CLERK 7782.0 1982-01-11 0:00 1300 NaN 10
In [11]:
count, bin_dividers = np.histogram(emp.sal,bins=3)
print(count)
print(bin_dividers) # 경계값 리스트
[8 5 1]
[ 800. 2200. 3600. 5000.]
In [12]:
bin_names = ['저소득','중간소득','고소득']
emp['sal_divide'] = pd.cut(x=emp.sal,bins=bin_dividers,labels=bin_names)
emp
Out[12]:
index empno ename job mgr hiredate sal comm deptno sal_divide
0 1 7839 KING PRESIDENT NaN 1981-11-17 0:00 5000 NaN 10 고소득
1 2 7698 BLAKE MANAGER 7839.0 1981-05-01 0:00 2850 NaN 30 중간소득
2 3 7782 CLARK MANAGER 7839.0 1981-05-09 0:00 2450 NaN 10 중간소득
3 4 7566 JONES MANAGER 7839.0 1981-04-01 0:00 2975 NaN 20 중간소득
4 5 7654 MARTIN SALESMAN 7698.0 1981-09-10 0:00 1250 1400.0 30 저소득
5 6 7499 ALLEN SALESMAN 7698.0 1981-02-11 0:00 1600 300.0 30 저소득
6 7 7844 TURNER SALESMAN 7698.0 1981-08-21 0:00 1500 0.0 30 저소득
7 8 7900 JAMES CLERK 7698.0 1981-12-11 0:00 950 NaN 30 저소득
8 9 7521 WARD SALESMAN 7698.0 1981-02-23 0:00 1250 500.0 30 저소득
9 10 7902 FORD ANALYST 7566.0 1981-12-11 0:00 3000 NaN 20 중간소득
10 11 7369 SMITH CLERK 7902.0 1980-12-09 0:00 800 NaN 20 NaN
11 12 7788 SCOTT ANALYST 7566.0 1982-12-22 0:00 3000 NaN 20 중간소득
12 13 7876 ADAMS CLERK 7788.0 1983-01-15 0:00 1100 NaN 20 저소득
13 14 7934 MILLER CLERK 7782.0 1982-01-11 0:00 1300 NaN 10 저소득

더미변수

In [13]:
pd.get_dummies(emp.deptno)
Out[13]:
10 20 30
0 1 0 0
1 0 0 1
2 1 0 0
3 0 1 0
4 0 0 1
5 0 0 1
6 0 0 1
7 0 0 1
8 0 0 1
9 0 1 0
10 0 1 0
11 0 1 0
12 0 1 0
13 1 0 0

문자형을 날짜형으로 변환

In [14]:
df = pd.read_csv("c:/data/studyfile/stock-data.csv")
print(df.info())
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Date    20 non-null     object
 1   Close   20 non-null     int64 
 2   Start   20 non-null     int64 
 3   High    20 non-null     int64 
 4   Low     20 non-null     int64 
 5   Volume  20 non-null     int64 
dtypes: int64(5), object(1)
memory usage: 1.1+ KB
None
Out[14]:
Date Close Start High Low Volume
0 2018-07-02 10100 10850 10900 10000 137977
1 2018-06-29 10700 10550 10900 9990 170253
2 2018-06-28 10400 10900 10950 10150 155769
3 2018-06-27 10900 10800 11050 10500 133548
4 2018-06-26 10800 10900 11000 10700 63039
In [15]:
df[['Date']] = pd.to_datetime(df.Date)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    20 non-null     datetime64[ns]
 1   Close   20 non-null     int64         
 2   Start   20 non-null     int64         
 3   High    20 non-null     int64         
 4   Low     20 non-null     int64         
 5   Volume  20 non-null     int64         
dtypes: datetime64[ns](1), int64(5)
memory usage: 1.1 KB
In [16]:
df.Date.dt.year.tail()
Out[16]:
15    2018
16    2018
17    2018
18    2018
19    2018
Name: Date, dtype: int64
In [17]:
df.Date.dt.month.head()
Out[17]:
0    7
1    6
2    6
3    6
4    6
Name: Date, dtype: int64
In [18]:
df.Date.dt.day.head()
Out[18]:
0     2
1    29
2    28
3    27
4    26
Name: Date, dtype: int64

인덱스를 날짜형으로 만들기

In [19]:
df.set_index('Date',inplace=True)
df
Out[19]:
Close Start High Low Volume
Date
2018-07-02 10100 10850 10900 10000 137977
2018-06-29 10700 10550 10900 9990 170253
2018-06-28 10400 10900 10950 10150 155769
2018-06-27 10900 10800 11050 10500 133548
2018-06-26 10800 10900 11000 10700 63039
2018-06-25 11150 11400 11450 11000 55519
2018-06-22 11300 11250 11450 10750 134805
2018-06-21 11200 11350 11750 11200 133002
2018-06-20 11550 11200 11600 10900 308596
2018-06-19 11300 11850 11950 11300 180656
2018-06-18 12000 13400 13400 12000 309787
2018-06-15 13400 13600 13600 12900 201376
2018-06-14 13450 13200 13700 13150 347451
2018-06-12 13200 12200 13300 12050 558148
2018-06-11 11950 12000 12250 11950 62293
2018-06-08 11950 11950 12200 11800 59258
2018-06-07 11950 12200 12300 11900 49088
2018-06-05 12150 11800 12250 11800 42485
2018-06-04 11900 11900 12200 11700 25171
2018-06-01 11900 11800 12100 11750 32062
In [20]:
df.index
Out[20]:
DatetimeIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
               '2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
               '2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
               '2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
               '2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
              dtype='datetime64[ns]', name='Date', freq=None)
In [21]:
df.index.year
Out[21]:
Int64Index([2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,
            2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018],
           dtype='int64', name='Date')
In [22]:
df.index.month
Out[22]:
Int64Index([7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6], dtype='int64', name='Date')
In [23]:
df.index.day
Out[23]:
Int64Index([2, 29, 28, 27, 26, 25, 22, 21, 20, 19, 18, 15, 14, 12, 11, 8, 7, 5,
            4, 1],
           dtype='int64', name='Date')