디지털미디어랩 머신러닝 여름캠프¶

3주차(4) : Multivariable Linear Regression 실습

datetime - hourly date + timestamp
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather -
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
count - number of total rentals

데이터 불러오기¶

먼저 pandas로 데이터를 불러 온다.

import pandas as pd

data = pd.read_csv('bike_train.csv')
data[:10]

Datetime 변수의 가공¶

Datetime을 연도, 월, 일, 시간으로 분리한다.
Pandas 패키지의 to_datetime과 dt.year, dt.month, dt.day, dt.hour를 활용한다.
http://pandas.pydata.org/pandas-docs/version/0.17.0/api.html#datetimelike-properties
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.to_datetime.html

datetime_series = data['datetime']
data['year'] = pd.to_datetime(datetime_series).dt.year
data['month'] = pd.to_datetime(datetime_series).dt.month
data['day'] = pd.to_datetime(datetime_series).dt.day
data['hour'] = pd.to_datetime(datetime_series).dt.hour

data.to_csv('bike_new.csv', index=False)
data[:10]

입력변수(X), 출력변수(y) 설정¶

X = data.drop(['datetime', 'count'], axis=1)
y = data['count']

from sklearn import linear_model

H = linear_model.LinearRegression()
H.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

sklearn으로 생성한 LinearRegression 모델의 Hypothesis¶

H.coef_ : W값
H.intercept_ : b값
H.residues_는 잔차의 제곱합(Cost)

print(list(X.columns))
print(H.coef_)
print(H.intercept_)

['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed', 'year', 'month', 'day', 'hour']
[ -7.73682199  -5.95993365   0.17299389  -4.88909329   1.64755957
   4.67426125  -2.03779123   0.60491752  82.76236005   9.93335653
   0.37788635   7.77808352]
-166442.552741

따라서, $$ x_1 : season,\ x_2 : holiday,\ x_3 : workingday,\ x_4 : weather\\ x_5 : temp,\ x_6 : atemp,\ x_7 : humidity,\ x_8 : windspeed\\ x_9 : year,\ x_{10} : month,\ x_{11} : day,\ x_{12} : hour\\ $$

$$H(x) = -7.73x_1 + -5.96x_2 + 0.173x_3 -4.489x_4 + 1.648x_5 + 4.674x_6 -2.038x_7\\ + 0.605x_8 + 82.762x_9 + 9.933x_{10} + 0.378x_{11} + 7.778x_{12} - 166422.552$$

	datetime	season	weather	temp	atemp	humidity	windspeed	count
0	2011-01-01 0:00	1	1	9.84	14.395	81	0.0000	16
1	2011-01-01 1:00	1	1	9.02	13.635	80	0.0000	40
2	2011-01-01 2:00	1	1	9.02	13.635	80	0.0000	32
3	2011-01-01 3:00	1	1	9.84	14.395	75	0.0000	13
4	2011-01-01 4:00	1	1	9.84	14.395	75	0.0000	1
5	2011-01-01 5:00	1	2	9.84	12.880	75	6.0032	1
6	2011-01-01 6:00	1	1	9.02	13.635	80	0.0000	2
7	2011-01-01 7:00	1	1	8.20	12.880	86	0.0000	3
8	2011-01-01 8:00	1	1	9.84	14.395	75	0.0000	8
9	2011-01-01 9:00	1	1	13.12	17.425	76	0.0000	14

	datetime	season	weather	temp	atemp	humidity	windspeed	count	year	month	day	hour
0	2011-01-01 0:00	1	1	9.84	14.395	81	0.0000	16	2011	1	1	0
1	2011-01-01 1:00	1	1	9.02	13.635	80	0.0000	40	2011	1	1	1
2	2011-01-01 2:00	1	1	9.02	13.635	80	0.0000	32	2011	1	1	2
3	2011-01-01 3:00	1	1	9.84	14.395	75	0.0000	13	2011	1	1	3
4	2011-01-01 4:00	1	1	9.84	14.395	75	0.0000	1	2011	1	1	4
5	2011-01-01 5:00	1	2	9.84	12.880	75	6.0032	1	2011	1	1	5
6	2011-01-01 6:00	1	1	9.02	13.635	80	0.0000	2	2011	1	1	6
7	2011-01-01 7:00	1	1	8.20	12.880	86	0.0000	3	2011	1	1	7
8	2011-01-01 8:00	1	1	9.84	14.395	75	0.0000	8	2011	1	1	8
9	2011-01-01 9:00	1	1	13.12	17.425	76	0.0000	14	2011	1	1	9

디지털미디어랩 머신러닝 여름캠프¶

Bike Sharing Demand¶

데이터 불러오기¶

Datetime 변수의 가공¶

입력변수(X), 출력변수(y) 설정¶

sklearn으로 생성한 LinearRegression 모델의 Hypothesis¶