[ad_1]
Features are important in an information science challenge as a result of they make the code extra modular, reusable, readable, and testable. Nevertheless, writing a messy perform that tries to do an excessive amount of can introduce upkeep hurdles and diminish the code’s readability.
Within the following code, the perform impute_missing_values
is lengthy, messy, and tries to do many issues. Since there are lots of hard-coded values, it will be unattainable for another person to reuse this perform for a DataFrame with totally different column names.
def impute_missing_values(df):
# Fill lacking values with group statistics
df["MSZoning"] = df.groupby("MSSubClass")["MSZoning"].rework(
lambda x: x.fillna(x.mode()[0])
)
df["LotFrontage"] = df.groupby("Neighborhood")["LotFrontage"].rework(
lambda x: x.fillna(x.median())
)# Fill lacking values with fixed
df["Functional"] = df["Functional"].fillna("Typ")
df["Alley"] = df["Alley"].fillna("Lacking")
for col in ["GarageType", "GarageFinish", "GarageQual", "GarageCond"]:
df[col] = df[col].fillna("Lacking")
for col in ("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2"):
df[col] = df[col].fillna("Lacking")
df["FireplaceQu"] = df["FireplaceQu"].fillna("Lacking")
df["PoolQC"] = df["PoolQC"].fillna("Lacking")
df["Fence"] = df["Fence"].fillna("Lacking")
df["MiscFeature"] = df["MiscFeature"].fillna("Lacking")
numeric_dtypes = ["int16", "int32", "int64", "float16", "float32", "float64"]
for i in df.columns:
if df[i].dtype in numeric_dtypes:
df[i] = df[i].fillna(0)
# Fill lacking values with mode
df["Electrical"] = df["Electrical"].fillna("SBrkr")
df["KitchenQual"] = df["KitchenQual"].fillna("TA")
df["Exterior1st"] = df["Exterior1st"].fillna(df["Exterior1st"].mode()[0])
df["Exterior2nd"] = df["Exterior2nd"].fillna(df["Exterior2nd"].mode()[0])
df["SaleType"] = df["SaleType"].fillna(df["SaleType"].mode()[0])
for i in df.columns:
if df[i].dtype == object:
df[i] = df[i].fillna(df[i].mode()[0])
return df
This instance is customized from the pocket book titled How I Achieved Top 0.3% in a Kaggle Competition, with just a few alterations.
[ad_2]
Source link