디지털미디어랩 머신러닝 여름캠프

3주차(4) : Multivariable Linear Regression 실습

Bike Sharing Demand

  • datetime - hourly date + timestamp
  • season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
  • holiday - whether the day is considered a holiday
  • workingday - whether the day is neither a weekend nor holiday
  • weather -
    1: Clear, Few clouds, Partly cloudy, Partly cloudy
    2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp - temperature in Celsius
  • atemp - "feels like" temperature in Celsius
  • humidity - relative humidity
  • windspeed - wind speed
  • count - number of total rentals

데이터 불러오기

  • 먼저 pandas로 데이터를 불러 온다.
In [1]:
import pandas as pd

data = pd.read_csv('bike_train.csv')
data[:10]
Out[1]:
datetime season holiday workingday weather temp atemp humidity windspeed count
0 2011-01-01 0:00 1 0 0 1 9.84 14.395 81 0.0000 16
1 2011-01-01 1:00 1 0 0 1 9.02 13.635 80 0.0000 40
2 2011-01-01 2:00 1 0 0 1 9.02 13.635 80 0.0000 32
3 2011-01-01 3:00 1 0 0 1 9.84 14.395 75 0.0000 13
4 2011-01-01 4:00 1 0 0 1 9.84 14.395 75 0.0000 1
5 2011-01-01 5:00 1 0 0 2 9.84 12.880 75 6.0032 1
6 2011-01-01 6:00 1 0 0 1 9.02 13.635 80 0.0000 2
7 2011-01-01 7:00 1 0 0 1 8.20 12.880 86 0.0000 3
8 2011-01-01 8:00 1 0 0 1 9.84 14.395 75 0.0000 8
9 2011-01-01 9:00 1 0 0 1 13.12 17.425 76 0.0000 14

Datetime 변수의 가공

In [2]:
datetime_series = data['datetime']
data['year'] = pd.to_datetime(datetime_series).dt.year
data['month'] = pd.to_datetime(datetime_series).dt.month
data['day'] = pd.to_datetime(datetime_series).dt.day
data['hour'] = pd.to_datetime(datetime_series).dt.hour

data.to_csv('bike_new.csv', index=False)
data[:10]
Out[2]:
datetime season holiday workingday weather temp atemp humidity windspeed count year month day hour
0 2011-01-01 0:00 1 0 0 1 9.84 14.395 81 0.0000 16 2011 1 1 0
1 2011-01-01 1:00 1 0 0 1 9.02 13.635 80 0.0000 40 2011 1 1 1
2 2011-01-01 2:00 1 0 0 1 9.02 13.635 80 0.0000 32 2011 1 1 2
3 2011-01-01 3:00 1 0 0 1 9.84 14.395 75 0.0000 13 2011 1 1 3
4 2011-01-01 4:00 1 0 0 1 9.84 14.395 75 0.0000 1 2011 1 1 4
5 2011-01-01 5:00 1 0 0 2 9.84 12.880 75 6.0032 1 2011 1 1 5
6 2011-01-01 6:00 1 0 0 1 9.02 13.635 80 0.0000 2 2011 1 1 6
7 2011-01-01 7:00 1 0 0 1 8.20 12.880 86 0.0000 3 2011 1 1 7
8 2011-01-01 8:00 1 0 0 1 9.84 14.395 75 0.0000 8 2011 1 1 8
9 2011-01-01 9:00 1 0 0 1 13.12 17.425 76 0.0000 14 2011 1 1 9

입력변수(X), 출력변수(y) 설정

In [3]:
X = data.drop(['datetime', 'count'], axis=1)
y = data['count']
In [4]:
from sklearn import linear_model

H = linear_model.LinearRegression()
H.fit(X, y)
Out[4]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

sklearn으로 생성한 LinearRegression 모델의 Hypothesis

  • H.coef_ : W값
  • H.intercept_ : b값
  • H.residues_는 잔차의 제곱합(Cost)
In [5]:
print(list(X.columns))
print(H.coef_)
print(H.intercept_)
['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed', 'year', 'month', 'day', 'hour']
[ -7.73682199  -5.95993365   0.17299389  -4.88909329   1.64755957
   4.67426125  -2.03779123   0.60491752  82.76236005   9.93335653
   0.37788635   7.77808352]
-166442.552741
  • 따라서, $$ x_1 : season,\ x_2 : holiday,\ x_3 : workingday,\ x_4 : weather\\ x_5 : temp,\ x_6 : atemp,\ x_7 : humidity,\ x_8 : windspeed\\ x_9 : year,\ x_{10} : month,\ x_{11} : day,\ x_{12} : hour\\ $$
$$H(x) = -7.73x_1 + -5.96x_2 + 0.173x_3 -4.489x_4 + 1.648x_5 + 4.674x_6 -2.038x_7\\ + 0.605x_8 + 82.762x_9 + 9.933x_{10} + 0.378x_{11} + 7.778x_{12} - 166422.552$$