Notice
Recent Posts
Recent Comments
Link
«   2025/05   »
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Archives
Today
Total
관리 메뉴

Seongho Jang

전체 내용 정리 본문

Data Science Lv.2

전체 내용 정리

seonghojang 2023. 1. 22. 21:44

*sklearn에서는 fit 뒤에 종속변수/독립변수 입력해서 fitting하기


단순회귀분석

from statsmodels.formula.api import ols
model = ols(formula = "종속변수 ~ 독립변수", data = df).fit()
예측값 : df_test["독립변수"].fittedvalues
잔차 : df_test["독립변수"].resid



다중회귀분석

from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor

y, X = dmatrices(formula, return_type = 'dataframe')
formula = "casual ~ " + " + ".join(df_sub.columns[:-1])
y, X = dmatrices(formula, data = df_sub, return_type = 'dataframe')
df_vif = pd.DataFrame()
df_vif["colname"] = X.columns
df_vif["VIF"] = [vif(X.values, i) for i in range(X.shape[1])]



로지스틱 회귀분석

from statsmodels.api import Logit
from sklearn.linear_model import LogisticRegression

model = Logit(endog = '종속변수', exog = '독립변수').fit()
승산 : np.exp(model.params)
pred = model.predict(df['종속변수'])
pred_class = (pred > 0.5) + 0


의사결정나무

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeClassifier(random_state = 123).fit(X = df.loc['종속변수'], y = df['독립변수'])
pred = model.predict(df.loc['종속변수'])



KMeans

from sklearn.cluster import KMeans
model = KMeans(n_cluster = 3, random_state = 123).fit(df.loc['종속변수'])
df['cluster'] = model.labels_ : cluster라는 열을 추가하여 각 행이 어떤 그룹에 속하는지 보여줌
model.cluster_centers_ : 각 군집의 중심점을 나타냄
df.groupby('cluster').mean() : 각 cluster별 mean을 분석



나이브-베이즈

from sklearn.naive_bayes import GaussianNB
model = GaussianNB().fit(X = '종속변수', y = '독립변수)
model.class_prior_ : 사전확률
pred = model.predict_proba(X = '종속변수')
pred_class = (pred > 0.5) + 0



Scaler
Min-Max Scaler : 최소값을 0, 최대값을 1로 정규화
(x - xmin) / (xmax - xmin)으로 계산

from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler().fit(data).transform(data)
df_minmax = pd.DataFrame(minmax, columns = df.columns)


MinMax Scaler는 fit과 transform까지 필요(일반적으로 둘다 동일 데이터)

 


머신러닝 평가
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
Score = accuracy_score(y_true = df['종속변수'], y_pred = pred)
사용법은 전부 동일, 내용만 잘 알고 있을 것.

분류 문제에서의 계산 방법